Date: 24/6/2002
I've got my first multi-byte codepage working, which opens the door for a host of others. Someone sent me a spam in Gb2312 so I thought that would be a good place to start as any, seeing as no-one has sent me a ISO-2022-JP message yet!?!
There are 2 distinct parts of new code that I've written to support multibyte codepages. First there is the decoding of characters from the 7 or 8 bit packed format, then there is the conversion to UTF-16 (unicode). With most codepages I don't forsee a problem in decoding the characters to fixed width data, but the mapping to unicode is another matter.

Conversion tables for multibyte characters sets run into many K's of characters pairs, even in binary (the gb2312 map is 29k). So to keep Scribe modular and neat I've implemented codepage map files for the large mapping tables that Scribe needs to convert various codepages into Unicode which it can (now) use internally, at least for the editor. This way vanilla Scribe can remain a small download and you can optionally download the mapping files for the codepages your interested in. And to make it easier I thought it'd be nice to have a nice "install optional component" menu which finds and downloads the files needed to the right directory for you. Which goes equally well for plugins and what not, as well as codepage map files. So what I'll probably do is have a script on that the program can call to get a list of optional components for any application, that returns the names and paths of the components which the user can install/uninstall.

Which brings me to the editor itself, the unicode (Utf-16) version of the editor is humming along nicely. Last night I intergrated it into Scribe as an optional component, you can select which version of the editor to use for the time being as the 16bit version is still fairly untested. The control accepts input in 8bit or 16bit form, where the codepage setting is used to convert 8bit into Utf-16 for internal representation, and then likewise converted back to the appropriate 8bit representation when saving a message. This means that the unicode version of the editor can support a huge new range of codepages, multibyte, utf, anything. As long as the codepage subsystem can translate it from 8bit strings into utf-16 and back again. Which is just a matter of coding really. One codepage at a time.

All this however doesn't address the other issues, like recipients, subject and so on not being catered for. Currently they are converted (as best as I can) into the interface codepage, which is dependant on the translation your using. Now this means that basically any multibyte charset using in the parts of the email other than the body of the message is going to be lost, or incorrectly rendered. There is only 1 way to solve this in the short term and that is to rewrite the UI to use Utf-8, for every translation, all the time. This is the least invasive approach to supporting unicode throughout Scribe. But it's been a good while since I posted a good release, people have been annoyed at the degraded codepage support in the last 4-5 releases, even though I know it's getting better, it's kinda "it's gotta get worse before it gets better sort of thing". So I want to get a good version out the door and then I'll sequester myself away for a month to write up Utf-8 support for all the controls etc.

This ties in nicely with going to Utf-8 for the language resource files. So that will all happen at the same time. Finally I feel like I'm getting somewhere with international support. Long time comming I know, but well it's a labour of love, and I'm learning along the way. I'm working smarter not harder these days thank goodness.

If you guys havn't already seen it, there is a rough roadmap for Scribe. It shows some of the plans to add functionality comming up over the next year or so. Check it out sometime, it's going to be my "todo" list for a while.

Email (optional): (Will be HTML encoded to evade harvesting)
Remember username and/or email in a cookie.
Notify me of new posts in this thread via email.