part counter: string encodings

I did a little research about the encodings used for JavaScript strings and contributed to a thread on the V8 list which eventually clarified some issues for me. It turns out JavaScript strings are broken in (at least) two ways:

The JavaScript specification sometimes implies a string should be UTF-16 and sometimes implies a string should be UCS-2. Goofy!
Each character in a UCS-2 string is a 16-bit quantity. By contrast, each character in a UTF-16 string may be a "surrogate pair" of two 16-bit quantities. Even though JavaScript sometimes implies strings should be UTF-16, its built-in support for string handling (including regular expressions) isn't savvy to surrogate pairs, which may ultimately lead to their being split or misinterpreted, producing garbage which may rattle around inside an application for a long time before anyone notices anything's amiss.

As well, I'm pretty sure there's lots of code (and coders) out there which (who) don't handle surrogate pairs correctly. (This isn't a flaw in JavaScript.)

I've decided strings will enter Counterpart as UCS-2 by default, and if they cannot be converted cleanly to UCS-2, Counterpart will throw an exception. This guarantees valid strings because UCS-2 is a proper subset of UTF-16, which means that any string which plays by the rules of UCS-2 also plays by the rules of UTF-16. The downside here is that Counterpart does not, by default, support UTF-16, which may annoy some folks.

The first practical effect of this policy is a change to org.icongarden.fileSystem.openFile. It still has a property encoding which, if it's a string, names the encoding of the file. But if this property is instead an object, then the object may specify an encoding for both the file and the strings which will be read from the file. The file property, if present, overrides the default encoding of the file (UTF-8), and the string property, if present, overrides the default encoding of strings which will be read from the file (UCS-2). The only possibility other than the default for string is UTF-16.

This decision is somewhat paternalistic; for insight into why I chose to be this way, see the the aforementioned thread on the V8 list.

1 comment:

Pete said...: I should add, I suppose, that the read function of org.icongarden.fileSystem.openFile does not yet support UTF-16 properly. I'm not sure how to allocate the buffer for a string whose characters may each be 16 or 32 bits wide without wasting substantial memory (temporarily) in many cases. So, at present, you can ask for UTF-16, but if there are any surrogate pairs, Counterpart will run out of buffer space and throw an exception. (sigh); October 28, 2008 at 1:04 AM

part counter

Tuesday, October 28, 2008

string encodings

1 comment:

Related Pages

Blog Archive