Unicode support

From DCppWiki

Jump to: navigation, search

Note: this proposal is largely irrelevant, as the new ADC protocol is designed to use UTF-8 from the ground up.

(Note that this proposal is geared towards working over old unicode-unaware hubs. A proposal for aware hubs should probably let the hub communicate the choosen encoding for all clients in the hub, ie. forcing everyones encoding to be the same. This is, after all, a scenario which makes sense: clients should understand eachother, hence use the same encoding)

UTF-8 is US-ASCII clean/transparent, which is good. This allows UTF-8 to be readily used in $NickList/$GetNickList, Chat and $MyINFO in current hubs.

The biggest problem is how to communicate the choice of encoding to other clients. Optimally, a client should know immediately on login to a hub, each clients used encoding, without any roundtrips to each client or other inefficient methods.

One proposed way to do unicode-flagged chat is:

<<nick>>?<message>|

The ? here is a special flag value. Current proposal is 0xA0, since this will make UTF-8 chat using US-ASCII look like normal chat, making it backwards compatible, but will on the other hand require unicode clients to decode the chat message in two steps, the nickname and the message part separately (well, depending on your decoder. Some will just ignore the invalid 0xA0 character, some will discard the entire string). Most current clients will send a space here, and wont allow the user to edit this position. The overall format of the message, including <, > and || will not change for UTF-8 encoded chat. This means no hub upgrade are needed, and unicode chat will look garbled to unicode-ignorant clients when using any character codes above 127, though its easy to detect a unicode message and discard that if the client doesnt support it. The 0xA0 flag will not be optimal for descr/nick encoding, since the client need not always chat at once after login.

Its also been said that the 0xA0 flag in chat should/could be used for CTCP instead of utf-8 chat. I think that might be a good idea, though using CTCP in mainchat to tell clients which encoding one is using for descr/nick doesnt seem very efficient to me, but it will work for chat (just like 0xA0) if embedding the chat into the ctcp message also.

Another way to do this is to use a bit in the "info-byte" of the MyINFO. The highest bit is afaik unused by nmdc, and not in use by any large quantity of people. Even if somebody uses it, this usage is probably more useful. Setting this bit means this client is using utf-8 in descr, nick and chat. This is the best method to do this, in terms of latency and bandwidth. The propagation of the MyINFO message is optimal for these kinds of flags.