- The JavaScript specification sometimes implies a string should be UTF-16 and sometimes implies a string should be UCS-2. Goofy!
- Each character in a UCS-2 string is a 16-bit quantity. By contrast, each character in a UTF-16 string may be a "surrogate pair" of two 16-bit quantities. Even though JavaScript sometimes implies strings should be UTF-16, its built-in support for string handling (including regular expressions) isn't savvy to surrogate pairs, which may ultimately lead to their being split or misinterpreted, producing garbage which may rattle around inside an application for a long time before anyone notices anything's amiss.
I've decided strings will enter Counterpart as UCS-2 by default, and if they cannot be converted cleanly to UCS-2, Counterpart will throw an exception. This guarantees valid strings because UCS-2 is a proper subset of UTF-16, which means that any string which plays by the rules of UCS-2 also plays by the rules of UTF-16. The downside here is that Counterpart does not, by default, support UTF-16, which may annoy some folks.
The first practical effect of this policy is a change to
org.icongarden.fileSystem.openFile. It still has a property encoding which, if it's a string, names the encoding of the file. But if this property is instead an object, then the object may specify an encoding for both the file and the strings which will be read from the file. The file property, if present, overrides the default encoding of the file (UTF-8), and the string property, if present, overrides the default encoding of strings which will be read from the file (UCS-2). The only possibility other than the default for string is UTF-16.This decision is somewhat paternalistic; for insight into why I chose to be this way, see the the aforementioned thread on the V8 list.
1 comment:
I should add, I suppose, that the read function of org.icongarden.fileSystem.openFile does not yet support UTF-16 properly. I'm not sure how to allocate the buffer for a string whose characters may each be 16 or 32 bits wide without wasting substantial memory (temporarily) in many cases. So, at present, you can ask for UTF-16, but if there are any surrogate pairs, Counterpart will run out of buffer space and throw an exception. (sigh)
Post a Comment