Wednesday, November 5, 2008

headersComplete

Hoo boy. I have really not had much time for this project lately. A few days ago, I did manage to do yet more prep work for what will eventually become session support. It may not seem like it at first, but bear with me.

extensions.org.icongarden.http needed to be able to report whether the script has finished writing the response headers. Since these headers should be in ASCII (specifically a variant called NET-ASCII, which ends lines with a CR LF sequence), my assumption is that people will be using the writeASCII function to write headers. Into the extension I embedded a little state machine which watches characters as they pass through writeASCII and writeUTF8 and detects the sequence CR LF CR LF, which signals the end of the headers. In fact, the machine has a state for each character in that sequence, so that the script doesn't have to get this exactly right; the extension will finish up if necessary.

And when might that happen? When the script calls writeUTF8, the function determines the machine's state and provides however many characters of the CR LF CR LF sequence which seem to have been neglected before writing any of the UTF-8 data. It does this under the assumption that if you're writing UTF-8 then you must not be writing headers any more, because headers must be ASCII. You can of course use writeASCII to write the body of your response, but if you do that then you're responsible for providing the CR LF CR LF yourself.

But the real purpose of the state machine is to enable a property called headersComplete, which is a (read-only) boolean indicating whether the state machine has detected that a script is done sending response headers. The forthcoming cookies extension will use this to verify that it can write response header lines (which will contain cookies). I expect that the cookies extension will throw an exception if headers are complete when the script tries to write cookies into the response.

This is all very low-level stuff and I expect that most scripts will use a higher-level layer written entirely in JavaScript which abstracts away these details.

Update: Ah, silly me. For some reason I got the impression that HTTP headers were in NET-ASCII, but that turns out not to be the case. At least one of the cookie response headers is specified as having UTF-8, and, really, why not? It's not as if UTF-8 will confuse anyone who consumes text as bytes, even if they think it's ASCII. I tried to take advantage of a detail which simply does not exist.

Tuesday, October 28, 2008

string encodings

I did a little research about the encodings used for JavaScript strings and contributed to a thread on the V8 list which eventually clarified some issues for me. It turns out JavaScript strings are broken in (at least) two ways:
  1. The JavaScript specification sometimes implies a string should be UTF-16 and sometimes implies a string should be UCS-2. Goofy!
  2. Each character in a UCS-2 string is a 16-bit quantity. By contrast, each character in a UTF-16 string may be a "surrogate pair" of two 16-bit quantities. Even though JavaScript sometimes implies strings should be UTF-16, its built-in support for string handling (including regular expressions) isn't savvy to surrogate pairs, which may ultimately lead to their being split or misinterpreted, producing garbage which may rattle around inside an application for a long time before anyone notices anything's amiss.
As well, I'm pretty sure there's lots of code (and coders) out there which (who) don't handle surrogate pairs correctly. (This isn't a flaw in JavaScript.)

I've decided strings will enter Counterpart as UCS-2 by default, and if they cannot be converted cleanly to UCS-2, Counterpart will throw an exception. This guarantees valid strings because UCS-2 is a proper subset of UTF-16, which means that any string which plays by the rules of UCS-2 also plays by the rules of UTF-16. The downside here is that Counterpart does not, by default, support UTF-16, which may annoy some folks.

The first practical effect of this policy is a change to org.icongarden.fileSystem.openFile. It still has a property encoding which, if it's a string, names the encoding of the file. But if this property is instead an object, then the object may specify an encoding for both the file and the strings which will be read from the file. The file property, if present, overrides the default encoding of the file (UTF-8), and the string property, if present, overrides the default encoding of strings which will be read from the file (UCS-2). The only possibility other than the default for string is UTF-16.

This decision is somewhat paternalistic; for insight into why I chose to be this way, see the the aforementioned thread on the V8 list.

Sunday, October 19, 2008

debut JSON

I've checked in some JSON functionality, specifically a lightly modified version of Crockford's json2.js. Crockford's code had global effects, and I dislike that, so I made some changes. The parse and stringify functions appear within extensions.org.icongarden.json (big surprise). As well, the extension doesn't modify the prototype of Date, which means stringify doesn't convert instances of Date, which isn't part of JSON anyway. If I end up doing anything special with Date, it won't involve altering any prototype.

There was a discussion on the V8 list recently about JSON performance, with various folk making all kinds of claims unsupported by data, so I thought I'd generate some. Note this is the farthest thing from scientific or comprehensive, but it's a start. I wrote a test script with collects a hierarchy of objects describing some sub-directories on my disk, converts it to JSON, and finally converts it back to the hierarchy it started with. Each object contains a couple of strings and a number in addition to zero or more child nodes. Here's the output of my test script:
  • milliseconds to enumerate: 203
  • milliseconds to stringify: 2053
  • milliseconds to parse: 1060

Surprisingly, the enumerate operation is by far the fastest. It's surprising because enumerating the sub-directories involves several hundred system calls, whereas I presume the stringify and parse operations involve none (aside, perhaps, from the occasional brk). Worse, at least some of these system calls touch the disk.

Let's ignore the stringify time for a moment and focus on the parse time. It's over five times the enumerate time, which is surprising enough, but when you consider how parse works, it's even more surprising. This function is basically a validating front end for eval involving a regular expression, which makes me wonder if V8's eval or regular expression engine could be a lot faster.

Finally, let's ponder stringify. It's all JavaScript, so it makes sense that it's slower than parse, but an order of magnitude slower than the enumerate operation? That seems crazy.

I'm going to leave things the way they are now because the important thing is to establish an interface for these functions so I can build other things atop it, but I have a feeling I'll be coming back to this.

Update: I renamed parse and stringify as decode and encode, respectively. Now I can refer to the JSON extension as a "codec" and stick out my chest in some kind of geek macho display.

Monday, October 13, 2008

org.icongarden.fileSystem.remove

I've checked in org.icongarden.fileSystem.remove, which, unsurprisingly, removes entries from the file system. The path to the directory entry is specified as the single argument to the function in the usual style of array. If the entry is a directory which is not empty, the function throws an exception. Otherwise, the entry is removed. (If you need to remove a directory which contains files, use enumerateDirectory to discover all the children and then remove the deepest ones first. I'll probably write a script extension later to do this.)

UNIX allows the removal of a file while it is open and Windows doesn't. Consequently, if you want to write a cross-platform script, you'll need to remove files only when you think they aren't open. (They may of course have been opened by some program other than your script, but you can only do your best and handle exceptions appropriately.)

byteOffset and byteCount are -1 when a file is closed

I just checked in the fix for a bug which provides an additional "feature": Once an instance of openFile is closed, its byteOffset and byteCount properties will be -1 rather than cause an exception if you try to access them. This provides a way to test whether a given openFile is open or closed since -1 is invalid in all other cases. Thanks to a chat room whose name I may not speak aloud for enduring my confused blunderings on this.

Sunday, October 12, 2008

debut read

I've just checked in the read function for the openFile object, but before I describe how it works, I need to go back on a couple of things I said earlier.

I've backed away from letting the client set the encoding of the file at any time. It was annoyingly complicated, and that was enough to remind me that I'm trying to make the C++ code in this project as simple as possible and push as much complexity as possible up into JavaScript. I also could not think of a use case for a file with multiple text encodings. So now encoding is a property of the object passed to fileOpen, and, once a file is open, its encoding cannot be changed. (The default is UTF-8.)

I've also realized there are fewer uses for getting byteOffset. After reading from a file, byteOffset reflects what has gone on (and will go on) with the read function's buffering, which has more to do with efficiency than providing a useful value for callers. Setting byteOffset might still be useful if you're keeping track of the value yourself by means of the return value of write, but as soon as you involve read, all bets are off.

Now then, how does the read function work? The interface is pretty simple. It takes a single argument, a number, which is the count of characters the caller wants to read. The return value is a string which contains the characters. If the string contains fewer characters than were expected, it means read encountered the end of the file.

One disadvantage of the simple buffering approach I took is that files which are both read and written are likely to buffer spans of the file which do not correspond to disk blocks, which means the buffering won't be as quick as it might be. If anybody ends up caring about this, it's entirely fixable, but it'll require a little more complexity than I'd like at first.

Saturday, October 11, 2008

openFile tweaks

Recent minor revisions to openFile include:
  • debut of the close function
  • the write function now returns the number of bytes it wrote; keeping track of where you are in a file yourself will almost certainly be a lot faster than repeatedly polling the byteOffset property
  • exceptions which occur as a result of an attempt to read a file not opened for reading or write a file not opened for writing will be phrased a bit more intuitively
Next, I think I will tackle read, which will involve some non-trivial buffering logic due to character encoding.

I also still need to figure out a way to reliably trigger an openFile object to be collected as garbage so I can reliably test the callback function which destroys the corresponding C++ object. I found a way. It seems it's not enough to let a script fall though the last curly-brace and then collect the garbage. However, replacing one newly created instance of openFile with another newly created instance seems to make the first "eligible" for garbage collection. I'm not sure why V8 thinks there is a difference once the script has ended, but at least I have a way to test my weak reference callbacks.