Tuesday, October 28, 2008

string encodings

I did a little research about the encodings used for JavaScript strings and contributed to a thread on the V8 list which eventually clarified some issues for me. It turns out JavaScript strings are broken in (at least) two ways:
  1. The JavaScript specification sometimes implies a string should be UTF-16 and sometimes implies a string should be UCS-2. Goofy!
  2. Each character in a UCS-2 string is a 16-bit quantity. By contrast, each character in a UTF-16 string may be a "surrogate pair" of two 16-bit quantities. Even though JavaScript sometimes implies strings should be UTF-16, its built-in support for string handling (including regular expressions) isn't savvy to surrogate pairs, which may ultimately lead to their being split or misinterpreted, producing garbage which may rattle around inside an application for a long time before anyone notices anything's amiss.
As well, I'm pretty sure there's lots of code (and coders) out there which (who) don't handle surrogate pairs correctly. (This isn't a flaw in JavaScript.)

I've decided strings will enter Counterpart as UCS-2 by default, and if they cannot be converted cleanly to UCS-2, Counterpart will throw an exception. This guarantees valid strings because UCS-2 is a proper subset of UTF-16, which means that any string which plays by the rules of UCS-2 also plays by the rules of UTF-16. The downside here is that Counterpart does not, by default, support UTF-16, which may annoy some folks.

The first practical effect of this policy is a change to org.icongarden.fileSystem.openFile. It still has a property encoding which, if it's a string, names the encoding of the file. But if this property is instead an object, then the object may specify an encoding for both the file and the strings which will be read from the file. The file property, if present, overrides the default encoding of the file (UTF-8), and the string property, if present, overrides the default encoding of strings which will be read from the file (UCS-2). The only possibility other than the default for string is UTF-16.

This decision is somewhat paternalistic; for insight into why I chose to be this way, see the the aforementioned thread on the V8 list.

Sunday, October 19, 2008

debut JSON

I've checked in some JSON functionality, specifically a lightly modified version of Crockford's json2.js. Crockford's code had global effects, and I dislike that, so I made some changes. The parse and stringify functions appear within extensions.org.icongarden.json (big surprise). As well, the extension doesn't modify the prototype of Date, which means stringify doesn't convert instances of Date, which isn't part of JSON anyway. If I end up doing anything special with Date, it won't involve altering any prototype.

There was a discussion on the V8 list recently about JSON performance, with various folk making all kinds of claims unsupported by data, so I thought I'd generate some. Note this is the farthest thing from scientific or comprehensive, but it's a start. I wrote a test script with collects a hierarchy of objects describing some sub-directories on my disk, converts it to JSON, and finally converts it back to the hierarchy it started with. Each object contains a couple of strings and a number in addition to zero or more child nodes. Here's the output of my test script:
  • milliseconds to enumerate: 203
  • milliseconds to stringify: 2053
  • milliseconds to parse: 1060

Surprisingly, the enumerate operation is by far the fastest. It's surprising because enumerating the sub-directories involves several hundred system calls, whereas I presume the stringify and parse operations involve none (aside, perhaps, from the occasional brk). Worse, at least some of these system calls touch the disk.

Let's ignore the stringify time for a moment and focus on the parse time. It's over five times the enumerate time, which is surprising enough, but when you consider how parse works, it's even more surprising. This function is basically a validating front end for eval involving a regular expression, which makes me wonder if V8's eval or regular expression engine could be a lot faster.

Finally, let's ponder stringify. It's all JavaScript, so it makes sense that it's slower than parse, but an order of magnitude slower than the enumerate operation? That seems crazy.

I'm going to leave things the way they are now because the important thing is to establish an interface for these functions so I can build other things atop it, but I have a feeling I'll be coming back to this.

Update: I renamed parse and stringify as decode and encode, respectively. Now I can refer to the JSON extension as a "codec" and stick out my chest in some kind of geek macho display.

Monday, October 13, 2008

org.icongarden.fileSystem.remove

I've checked in org.icongarden.fileSystem.remove, which, unsurprisingly, removes entries from the file system. The path to the directory entry is specified as the single argument to the function in the usual style of array. If the entry is a directory which is not empty, the function throws an exception. Otherwise, the entry is removed. (If you need to remove a directory which contains files, use enumerateDirectory to discover all the children and then remove the deepest ones first. I'll probably write a script extension later to do this.)

UNIX allows the removal of a file while it is open and Windows doesn't. Consequently, if you want to write a cross-platform script, you'll need to remove files only when you think they aren't open. (They may of course have been opened by some program other than your script, but you can only do your best and handle exceptions appropriately.)

byteOffset and byteCount are -1 when a file is closed

I just checked in the fix for a bug which provides an additional "feature": Once an instance of openFile is closed, its byteOffset and byteCount properties will be -1 rather than cause an exception if you try to access them. This provides a way to test whether a given openFile is open or closed since -1 is invalid in all other cases. Thanks to a chat room whose name I may not speak aloud for enduring my confused blunderings on this.

Sunday, October 12, 2008

debut read

I've just checked in the read function for the openFile object, but before I describe how it works, I need to go back on a couple of things I said earlier.

I've backed away from letting the client set the encoding of the file at any time. It was annoyingly complicated, and that was enough to remind me that I'm trying to make the C++ code in this project as simple as possible and push as much complexity as possible up into JavaScript. I also could not think of a use case for a file with multiple text encodings. So now encoding is a property of the object passed to fileOpen, and, once a file is open, its encoding cannot be changed. (The default is UTF-8.)

I've also realized there are fewer uses for getting byteOffset. After reading from a file, byteOffset reflects what has gone on (and will go on) with the read function's buffering, which has more to do with efficiency than providing a useful value for callers. Setting byteOffset might still be useful if you're keeping track of the value yourself by means of the return value of write, but as soon as you involve read, all bets are off.

Now then, how does the read function work? The interface is pretty simple. It takes a single argument, a number, which is the count of characters the caller wants to read. The return value is a string which contains the characters. If the string contains fewer characters than were expected, it means read encountered the end of the file.

One disadvantage of the simple buffering approach I took is that files which are both read and written are likely to buffer spans of the file which do not correspond to disk blocks, which means the buffering won't be as quick as it might be. If anybody ends up caring about this, it's entirely fixable, but it'll require a little more complexity than I'd like at first.

Saturday, October 11, 2008

openFile tweaks

Recent minor revisions to openFile include:
  • debut of the close function
  • the write function now returns the number of bytes it wrote; keeping track of where you are in a file yourself will almost certainly be a lot faster than repeatedly polling the byteOffset property
  • exceptions which occur as a result of an attempt to read a file not opened for reading or write a file not opened for writing will be phrased a bit more intuitively
Next, I think I will tackle read, which will involve some non-trivial buffering logic due to character encoding.

I also still need to figure out a way to reliably trigger an openFile object to be collected as garbage so I can reliably test the callback function which destroys the corresponding C++ object. I found a way. It seems it's not enough to let a script fall though the last curly-brace and then collect the garbage. However, replacing one newly created instance of openFile with another newly created instance seems to make the first "eligible" for garbage collection. I'm not sure why V8 thinks there is a difference once the script has ended, but at least I have a way to test my weak reference callbacks.

Thursday, October 9, 2008

openFile

I've checked in a new portion of org.icongarden.fileSystem called openFile. (Actually, I checked this in a few days ago and I just haven't gotten around to writing about it yet. I really wish I had more time to work on this project!) This is the most elaborate bit of interface I've done for Counterpart so far.

openFile is a function which takes one or two arguments. The first argument is an array in the style of other functions in fileSystem, with the exception that the last member of the array is the name of a file. The second argument is an object whose properties describe how the file is to be be opened. If the second argument is not present, it's as if the caller passed an empty object.

The read property is a boolean which specifies whether the caller wants to be able to read from the file once it is open. If it is absent, the assumption is that the caller does want to read from the file. Why on earth would anyone want to open a file and not read from it? Well, it really comes down to a matter of not trusting yourself. If you know your intent is to write some data to a file and close it, you can specify at the outset that you want it to be impossible for you to read from the file. This is nice to have in a large program or one that gets shuffled around a lot during development.

The write property is a boolean which specifies whether the caller wants to be able to write to the file once it is open. If this property is absent, the assumption is that the caller does not want to write to the file. It should be little more obvious why you would want to prevent yourself from writing to a file you've opened; since a mistake could destroy data, it's nice to have a way to ensure such a mistake is impossible.

If the caller attempts to open a file for neither writing nor reading, an exception is thrown.

The create property specifies whether a file which does not already exist should be created. That's right: one creates and opens a file in a single step. If you only want to create a file, you must endure having opened it at the same time. In combination with some additional properties, this helps make race conditions less likely when you're using the file system as a communications medium. If the create property is absent and the file does not yet exist, openFile throws an exception. If it is present, it may take several forms.

If the create property is a boolean, then it merely specifies whether the file should be created as described above. If the create property is an object, and the object is empty, it has the same effect as a boolean true.

If the create object contains a must property, this property is a boolean which specifies whether openFile must successfully create the file. In other words, if the file already exists, openFile will throw an exception. This is useful when you are using the file as a signifier for exclusive access to a directory or in some other communications scheme that requires atomic exclusion.

If the create object contains a readOnly property, this specifies that subsequent attempts to open the file for writing will fail. (On UNIX, this corresponds roughly to chmod u-r.) The readOnly property has no effect on the the present call to openFile.

So let's suppose you've called openFile and it didn't throw an exception; what now? openFile has returned to its caller an object, and this object has several properties.

The simplest one is encoding. This is the name of the character encoding to use with this file. You can change encoding to any of the names supported by your local implementation of iconv. (I'll get around to making a list of common names eventually. For now, the only encoding actually supported is UTF-8, which is also the default.) When reading from or writing to the file, characters are automagically converted between this encoding and the JavaScript encoding, which is UTF-16. You can change encoding at any time, but you do need to be sure that the file you are reading or writing is encoded accordingly. (Generally, a given file will have just one encoding.)

The byteCount property is a number which specifies how many bytes are contained in the file. If you set byteCount to a smaller value, the file will get shorter, and if you set it to a larger value, the file will get longer (and zeroes will be written into any new portion). However, it can be difficult to set the correct value because many encodings have characters which may be more than one byte, and unless your code has exhaustive knowledge of the contents of the file and the byte-width of each character the file contains, it's usually impossible to know how big to make the file. There are two cases in which it's easier. First, you can always set byteCount to zero to remove all bytes from the file. And second, if you read or write the file and then remember what the value of byteCount was, you can set it back to that value later. (You can also set byteCount to the value of byteOffset, which is the next property discussed.)

The byteOffset property specifies where in the file the next read or write operation will begin. The first offset in the file is zero. You can set byteOffset to any value you like — even one beyond the end of the file — but you must exercise the same care you would with byteCount as described above. One additional safe and simple operation is to set byteOffset to byteCount. This means the next write to the file will occur at its end.

Finally we come to the properties which act on the file. Unsurprisingly, they are both functions, read and write. So far, I've only implemented write; I'll describe read once it's implemented. One tricky aspect of these functions is that if the caller does not specify it wants to be able to write the file when opening it, there is no write property. Likewise, if the caller doesn't specify it wants to be able to read the file when opening it, there is no read property, though this will probably be less common. So, if and when it's inappropriate to perform an operation on a file, it's not just that you can't perform it; you can't even try to perform it. Daniel convinced me that was too clever by half because the failure mode involves tearing down the entire script unrecoverably. I had thought this was a feature, but he convinced me it was a bug. So now read and write are both exposed unconditionally, and if they are called inappropriately, they throw an exception.

The write function writes characters to the file in the file's current encoding. The characters result from converting the single argument to write into a string. The value of byteOffset will advance by the number of bytes occupied by the characters after they have been converted into the file's encoding. The value of byteCount will advance by an appropriate amount if the write operation would have extended beyond the end of the file.

The file will automagically be closed when its object becomes unreferenced and the garbage collector gets around to destroying it. I'm still working on getting that to happen reliably. I'll also add a close function which allows you to explicitly close the file sooner than it would otherwise.

Saturday, October 4, 2008

gc extension

For various reasons, I exposed the gc extension provided by V8. This is a global function which allows scripts to collect garbage at any time.

Normally, scripts shouldn't need to concern themselves with this. The garbage collector should decide for itself the best time to collect the garbage; that's its job. But I need it for a couple of reasons.

First, I've been meaning to have the bootstrap script collect the garbage just before returning. This script doesn't do a ton of work, but it does some, and it seems there's no point in cloning heap objects which will only get tossed out anyway.

Second, I need a way to exercise C++ code's capability to create "weak" objects. Weak objects can have callbacks such that when the garbage collector gets around to deleting them, they can release whatever native resource they might have been asked to hold. The best way I could think of to test this was to force garbage collection.

Unfortunately, the gc extension puts an object into the global namespace which doesn't behave like other objects. I found I could copy it into another object, but I couldn't delete the original. I had wanted to move it into a namespace called something like com.google.v8, but that was not to be.

Given that I was stuck with gc in the global namespace, my scheme to populate that space with TLDs ("org", "com", etc.) had been thwarted. So now extensions live under 'extensions'. This caused a ripple effect throughout the existing code which chewed up a good chunk of the afternoon. Sucks.

From an architectural perspective, this was something I was considering anyway. With an arbitrary collection of TLDs in the global namespace, the risk of collision with a name in a user script became non-trivial. Now user scripts will be able to know to avoid a small, well-known set of names, specifically 'gc' and 'extensions'. I may even get around to putting user scripts into their own namespace so they don't have to worry about colliding with anything at all.