Wednesday, November 5, 2008

headersComplete

Hoo boy. I have really not had much time for this project lately. A few days ago, I did manage to do yet more prep work for what will eventually become session support. It may not seem like it at first, but bear with me.

extensions.org.icongarden.http needed to be able to report whether the script has finished writing the response headers. Since these headers should be in ASCII (specifically a variant called NET-ASCII, which ends lines with a CR LF sequence), my assumption is that people will be using the writeASCII function to write headers. Into the extension I embedded a little state machine which watches characters as they pass through writeASCII and writeUTF8 and detects the sequence CR LF CR LF, which signals the end of the headers. In fact, the machine has a state for each character in that sequence, so that the script doesn't have to get this exactly right; the extension will finish up if necessary.

And when might that happen? When the script calls writeUTF8, the function determines the machine's state and provides however many characters of the CR LF CR LF sequence which seem to have been neglected before writing any of the UTF-8 data. It does this under the assumption that if you're writing UTF-8 then you must not be writing headers any more, because headers must be ASCII. You can of course use writeASCII to write the body of your response, but if you do that then you're responsible for providing the CR LF CR LF yourself.

But the real purpose of the state machine is to enable a property called headersComplete, which is a (read-only) boolean indicating whether the state machine has detected that a script is done sending response headers. The forthcoming cookies extension will use this to verify that it can write response header lines (which will contain cookies). I expect that the cookies extension will throw an exception if headers are complete when the script tries to write cookies into the response.

This is all very low-level stuff and I expect that most scripts will use a higher-level layer written entirely in JavaScript which abstracts away these details.

Update: Ah, silly me. For some reason I got the impression that HTTP headers were in NET-ASCII, but that turns out not to be the case. At least one of the cookie response headers is specified as having UTF-8, and, really, why not? It's not as if UTF-8 will confuse anyone who consumes text as bytes, even if they think it's ASCII. I tried to take advantage of a detail which simply does not exist.

Tuesday, October 28, 2008

string encodings

I did a little research about the encodings used for JavaScript strings and contributed to a thread on the V8 list which eventually clarified some issues for me. It turns out JavaScript strings are broken in (at least) two ways:
  1. The JavaScript specification sometimes implies a string should be UTF-16 and sometimes implies a string should be UCS-2. Goofy!
  2. Each character in a UCS-2 string is a 16-bit quantity. By contrast, each character in a UTF-16 string may be a "surrogate pair" of two 16-bit quantities. Even though JavaScript sometimes implies strings should be UTF-16, its built-in support for string handling (including regular expressions) isn't savvy to surrogate pairs, which may ultimately lead to their being split or misinterpreted, producing garbage which may rattle around inside an application for a long time before anyone notices anything's amiss.
As well, I'm pretty sure there's lots of code (and coders) out there which (who) don't handle surrogate pairs correctly. (This isn't a flaw in JavaScript.)

I've decided strings will enter Counterpart as UCS-2 by default, and if they cannot be converted cleanly to UCS-2, Counterpart will throw an exception. This guarantees valid strings because UCS-2 is a proper subset of UTF-16, which means that any string which plays by the rules of UCS-2 also plays by the rules of UTF-16. The downside here is that Counterpart does not, by default, support UTF-16, which may annoy some folks.

The first practical effect of this policy is a change to org.icongarden.fileSystem.openFile. It still has a property encoding which, if it's a string, names the encoding of the file. But if this property is instead an object, then the object may specify an encoding for both the file and the strings which will be read from the file. The file property, if present, overrides the default encoding of the file (UTF-8), and the string property, if present, overrides the default encoding of strings which will be read from the file (UCS-2). The only possibility other than the default for string is UTF-16.

This decision is somewhat paternalistic; for insight into why I chose to be this way, see the the aforementioned thread on the V8 list.

Sunday, October 19, 2008

debut JSON

I've checked in some JSON functionality, specifically a lightly modified version of Crockford's json2.js. Crockford's code had global effects, and I dislike that, so I made some changes. The parse and stringify functions appear within extensions.org.icongarden.json (big surprise). As well, the extension doesn't modify the prototype of Date, which means stringify doesn't convert instances of Date, which isn't part of JSON anyway. If I end up doing anything special with Date, it won't involve altering any prototype.

There was a discussion on the V8 list recently about JSON performance, with various folk making all kinds of claims unsupported by data, so I thought I'd generate some. Note this is the farthest thing from scientific or comprehensive, but it's a start. I wrote a test script with collects a hierarchy of objects describing some sub-directories on my disk, converts it to JSON, and finally converts it back to the hierarchy it started with. Each object contains a couple of strings and a number in addition to zero or more child nodes. Here's the output of my test script:
  • milliseconds to enumerate: 203
  • milliseconds to stringify: 2053
  • milliseconds to parse: 1060

Surprisingly, the enumerate operation is by far the fastest. It's surprising because enumerating the sub-directories involves several hundred system calls, whereas I presume the stringify and parse operations involve none (aside, perhaps, from the occasional brk). Worse, at least some of these system calls touch the disk.

Let's ignore the stringify time for a moment and focus on the parse time. It's over five times the enumerate time, which is surprising enough, but when you consider how parse works, it's even more surprising. This function is basically a validating front end for eval involving a regular expression, which makes me wonder if V8's eval or regular expression engine could be a lot faster.

Finally, let's ponder stringify. It's all JavaScript, so it makes sense that it's slower than parse, but an order of magnitude slower than the enumerate operation? That seems crazy.

I'm going to leave things the way they are now because the important thing is to establish an interface for these functions so I can build other things atop it, but I have a feeling I'll be coming back to this.

Update: I renamed parse and stringify as decode and encode, respectively. Now I can refer to the JSON extension as a "codec" and stick out my chest in some kind of geek macho display.

Monday, October 13, 2008

org.icongarden.fileSystem.remove

I've checked in org.icongarden.fileSystem.remove, which, unsurprisingly, removes entries from the file system. The path to the directory entry is specified as the single argument to the function in the usual style of array. If the entry is a directory which is not empty, the function throws an exception. Otherwise, the entry is removed. (If you need to remove a directory which contains files, use enumerateDirectory to discover all the children and then remove the deepest ones first. I'll probably write a script extension later to do this.)

UNIX allows the removal of a file while it is open and Windows doesn't. Consequently, if you want to write a cross-platform script, you'll need to remove files only when you think they aren't open. (They may of course have been opened by some program other than your script, but you can only do your best and handle exceptions appropriately.)

byteOffset and byteCount are -1 when a file is closed

I just checked in the fix for a bug which provides an additional "feature": Once an instance of openFile is closed, its byteOffset and byteCount properties will be -1 rather than cause an exception if you try to access them. This provides a way to test whether a given openFile is open or closed since -1 is invalid in all other cases. Thanks to a chat room whose name I may not speak aloud for enduring my confused blunderings on this.

Sunday, October 12, 2008

debut read

I've just checked in the read function for the openFile object, but before I describe how it works, I need to go back on a couple of things I said earlier.

I've backed away from letting the client set the encoding of the file at any time. It was annoyingly complicated, and that was enough to remind me that I'm trying to make the C++ code in this project as simple as possible and push as much complexity as possible up into JavaScript. I also could not think of a use case for a file with multiple text encodings. So now encoding is a property of the object passed to fileOpen, and, once a file is open, its encoding cannot be changed. (The default is UTF-8.)

I've also realized there are fewer uses for getting byteOffset. After reading from a file, byteOffset reflects what has gone on (and will go on) with the read function's buffering, which has more to do with efficiency than providing a useful value for callers. Setting byteOffset might still be useful if you're keeping track of the value yourself by means of the return value of write, but as soon as you involve read, all bets are off.

Now then, how does the read function work? The interface is pretty simple. It takes a single argument, a number, which is the count of characters the caller wants to read. The return value is a string which contains the characters. If the string contains fewer characters than were expected, it means read encountered the end of the file.

One disadvantage of the simple buffering approach I took is that files which are both read and written are likely to buffer spans of the file which do not correspond to disk blocks, which means the buffering won't be as quick as it might be. If anybody ends up caring about this, it's entirely fixable, but it'll require a little more complexity than I'd like at first.

Saturday, October 11, 2008

openFile tweaks

Recent minor revisions to openFile include:
  • debut of the close function
  • the write function now returns the number of bytes it wrote; keeping track of where you are in a file yourself will almost certainly be a lot faster than repeatedly polling the byteOffset property
  • exceptions which occur as a result of an attempt to read a file not opened for reading or write a file not opened for writing will be phrased a bit more intuitively
Next, I think I will tackle read, which will involve some non-trivial buffering logic due to character encoding.

I also still need to figure out a way to reliably trigger an openFile object to be collected as garbage so I can reliably test the callback function which destroys the corresponding C++ object. I found a way. It seems it's not enough to let a script fall though the last curly-brace and then collect the garbage. However, replacing one newly created instance of openFile with another newly created instance seems to make the first "eligible" for garbage collection. I'm not sure why V8 thinks there is a difference once the script has ended, but at least I have a way to test my weak reference callbacks.

Thursday, October 9, 2008

openFile

I've checked in a new portion of org.icongarden.fileSystem called openFile. (Actually, I checked this in a few days ago and I just haven't gotten around to writing about it yet. I really wish I had more time to work on this project!) This is the most elaborate bit of interface I've done for Counterpart so far.

openFile is a function which takes one or two arguments. The first argument is an array in the style of other functions in fileSystem, with the exception that the last member of the array is the name of a file. The second argument is an object whose properties describe how the file is to be be opened. If the second argument is not present, it's as if the caller passed an empty object.

The read property is a boolean which specifies whether the caller wants to be able to read from the file once it is open. If it is absent, the assumption is that the caller does want to read from the file. Why on earth would anyone want to open a file and not read from it? Well, it really comes down to a matter of not trusting yourself. If you know your intent is to write some data to a file and close it, you can specify at the outset that you want it to be impossible for you to read from the file. This is nice to have in a large program or one that gets shuffled around a lot during development.

The write property is a boolean which specifies whether the caller wants to be able to write to the file once it is open. If this property is absent, the assumption is that the caller does not want to write to the file. It should be little more obvious why you would want to prevent yourself from writing to a file you've opened; since a mistake could destroy data, it's nice to have a way to ensure such a mistake is impossible.

If the caller attempts to open a file for neither writing nor reading, an exception is thrown.

The create property specifies whether a file which does not already exist should be created. That's right: one creates and opens a file in a single step. If you only want to create a file, you must endure having opened it at the same time. In combination with some additional properties, this helps make race conditions less likely when you're using the file system as a communications medium. If the create property is absent and the file does not yet exist, openFile throws an exception. If it is present, it may take several forms.

If the create property is a boolean, then it merely specifies whether the file should be created as described above. If the create property is an object, and the object is empty, it has the same effect as a boolean true.

If the create object contains a must property, this property is a boolean which specifies whether openFile must successfully create the file. In other words, if the file already exists, openFile will throw an exception. This is useful when you are using the file as a signifier for exclusive access to a directory or in some other communications scheme that requires atomic exclusion.

If the create object contains a readOnly property, this specifies that subsequent attempts to open the file for writing will fail. (On UNIX, this corresponds roughly to chmod u-r.) The readOnly property has no effect on the the present call to openFile.

So let's suppose you've called openFile and it didn't throw an exception; what now? openFile has returned to its caller an object, and this object has several properties.

The simplest one is encoding. This is the name of the character encoding to use with this file. You can change encoding to any of the names supported by your local implementation of iconv. (I'll get around to making a list of common names eventually. For now, the only encoding actually supported is UTF-8, which is also the default.) When reading from or writing to the file, characters are automagically converted between this encoding and the JavaScript encoding, which is UTF-16. You can change encoding at any time, but you do need to be sure that the file you are reading or writing is encoded accordingly. (Generally, a given file will have just one encoding.)

The byteCount property is a number which specifies how many bytes are contained in the file. If you set byteCount to a smaller value, the file will get shorter, and if you set it to a larger value, the file will get longer (and zeroes will be written into any new portion). However, it can be difficult to set the correct value because many encodings have characters which may be more than one byte, and unless your code has exhaustive knowledge of the contents of the file and the byte-width of each character the file contains, it's usually impossible to know how big to make the file. There are two cases in which it's easier. First, you can always set byteCount to zero to remove all bytes from the file. And second, if you read or write the file and then remember what the value of byteCount was, you can set it back to that value later. (You can also set byteCount to the value of byteOffset, which is the next property discussed.)

The byteOffset property specifies where in the file the next read or write operation will begin. The first offset in the file is zero. You can set byteOffset to any value you like — even one beyond the end of the file — but you must exercise the same care you would with byteCount as described above. One additional safe and simple operation is to set byteOffset to byteCount. This means the next write to the file will occur at its end.

Finally we come to the properties which act on the file. Unsurprisingly, they are both functions, read and write. So far, I've only implemented write; I'll describe read once it's implemented. One tricky aspect of these functions is that if the caller does not specify it wants to be able to write the file when opening it, there is no write property. Likewise, if the caller doesn't specify it wants to be able to read the file when opening it, there is no read property, though this will probably be less common. So, if and when it's inappropriate to perform an operation on a file, it's not just that you can't perform it; you can't even try to perform it. Daniel convinced me that was too clever by half because the failure mode involves tearing down the entire script unrecoverably. I had thought this was a feature, but he convinced me it was a bug. So now read and write are both exposed unconditionally, and if they are called inappropriately, they throw an exception.

The write function writes characters to the file in the file's current encoding. The characters result from converting the single argument to write into a string. The value of byteOffset will advance by the number of bytes occupied by the characters after they have been converted into the file's encoding. The value of byteCount will advance by an appropriate amount if the write operation would have extended beyond the end of the file.

The file will automagically be closed when its object becomes unreferenced and the garbage collector gets around to destroying it. I'm still working on getting that to happen reliably. I'll also add a close function which allows you to explicitly close the file sooner than it would otherwise.

Saturday, October 4, 2008

gc extension

For various reasons, I exposed the gc extension provided by V8. This is a global function which allows scripts to collect garbage at any time.

Normally, scripts shouldn't need to concern themselves with this. The garbage collector should decide for itself the best time to collect the garbage; that's its job. But I need it for a couple of reasons.

First, I've been meaning to have the bootstrap script collect the garbage just before returning. This script doesn't do a ton of work, but it does some, and it seems there's no point in cloning heap objects which will only get tossed out anyway.

Second, I need a way to exercise C++ code's capability to create "weak" objects. Weak objects can have callbacks such that when the garbage collector gets around to deleting them, they can release whatever native resource they might have been asked to hold. The best way I could think of to test this was to force garbage collection.

Unfortunately, the gc extension puts an object into the global namespace which doesn't behave like other objects. I found I could copy it into another object, but I couldn't delete the original. I had wanted to move it into a namespace called something like com.google.v8, but that was not to be.

Given that I was stuck with gc in the global namespace, my scheme to populate that space with TLDs ("org", "com", etc.) had been thwarted. So now extensions live under 'extensions'. This caused a ripple effect throughout the existing code which chewed up a good chunk of the afternoon. Sucks.

From an architectural perspective, this was something I was considering anyway. With an arbitrary collection of TLDs in the global namespace, the risk of collision with a name in a user script became non-trivial. Now user scripts will be able to know to avoid a small, well-known set of names, specifically 'gc' and 'extensions'. I may even get around to putting user scripts into their own namespace so they don't have to worry about colliding with anything at all.

Monday, September 29, 2008

enumerateDirectory recursively

I've just checked in a revision to org.icongarden.fileSystem.enumerateDirectory which takes one or two parameters. The first argument is the same as before, and the second optional argument is a boolean indicating whether the caller wishes the method to descend into child directories. If so, then each returned object whose property type has the value "directory" will have an additional property children whose value is an array of zero or more objects with the same set of properties as its parent.

debut enumerateDirectory and some housekeeping

I've just checked in the method org.icongarden.fileSystem.enumerateDirectory. Like its sibling setWorkingDirectory, it takes a single parameter which must be an array of pathname components as returned by its sibling getWorkingDirectory. It returns an array of directory entry names objects whose properties name, type, and inode describe each directory entry. If I end up supporting Windows, inode won't be present. type has one of the following values:
  • (unknown to extension)
  • (unknown to system)
  • regular file
  • directory
  • named pipe
  • socket
  • character device
  • block device
  • symbolic link
On Windows, not all of these will be possibilities.

I've retired org.icongarden.workingDirectory, and org.icongarden.fileSystem will have a much broader scope. I always intended to have a org.icongarden.fileSystem; I don't know why I got distracted with org.icongarden.workingDirectory.

I also tweaked the way exceptions are thrown in C++ so it's easier to "demand" that conditions be true in a way which causes a JavaScript exception if they're not. Making this super-convenient is key to actually doing it often enough to be useful to JavaScript programmers.

getWorkingDirectory and setWorkingDirectory

I did end up exposing the working directory as an array of strings as planned. However, I found the syntax for exposing that array directly to be ugly and awkward and confusing and cumbersome, so I opted for a pair of methods, org.icongarden.fileSystem.getWorkingDirectory and org.icongarden.fileSystem.setWorkingDirectory. If the first array element is empty, it means the array represents an absolute ("full") path; I expect to use this convention elsewhere within this particular extension. getWorkingDirectory always returns an absolute path, but you can pass a relative path to setWorkingDirectory and it will figure out the right thing to do. Neither of these functions handles duplicate slashes or "..", since they are unneeded; these conventions result from storing paths as strings (which is wrong, in my view). I'll probably end up having to deal with these cases eventually anyway for the sake of compatibility.

Sunday, September 28, 2008

fork on Win32

I got some questions in a chat room the other day which prompted me to realize I'm using fork in a way that's completely within specification but uncommon. I decided I had better understand fork on Windows before going too much farther. And it looks as if I'm screwed.

Most people call fork as a means to an end. They want to start another process or they want some concurrency. I call fork as an end in itself; I want precisely what it does, no more and no less. I want a clone of the current process, including the address space, so I can drop privileges and run to completion as a regular (non-root) user.

It turns out Windows doesn't support fork. At all.

If you're one of those people who call fork as a means to an end, you have alternatives. If you were just going to call exec after fork, then replace both calls with CreateProcess. If you just wanted some concurrency, then call CreateThread instead.

But I'm not one of those people. I actually want fork to fork. So I'm screwed.

The question now is what to do about it. Do I abandon support for Windows? Or do I re-architect into separate processes? I am frankly leaning toward ditching Windows support, since I would never in a million years use it myself, and since fork is super-cheap on Linux, which is key support for being able to claim Counterpart is fast.

Update: It seems silly to kick Windows to the curb just because I'll have to write some different engine code that will be slower. The valuable code is the extensions. At worst, the Windows implementation can be a regular CGI program, and someone who cares enough will come along and figure out how to make it fast. That someone might even be me.

rethought workingDirectory

Overnight, I rethought how workingDirectory should work. One of my big-picture goals here is to expose things in a way which will make sense to JavaScript programmers rather than C++ programmers. To that end, I think the current working directory should be exposed as an array of pathname components rather than a set of functions. This will not only eliminate the need to understand platform differences but will also provide a single point of interface. I will need to figure out how to convince V8 to watch for changes to this array after I expose it, but I suspect that's mostly a matter of research rather than development. The enumerate functionality shouldn't be strongly associated with working directories. I think I'll probably go back to fleshing out org.icongarden.fileSystem.

Saturday, September 27, 2008

debut org.icongarden.workingDirectory

I haven't had much time to work on Counterpart lately, but I did get a few hours this afternoon to throw at it, and I've checked in the beginning of the org.icongarden.workingDirectory extension. For now, it has two methods:
  • get returns a string containing a native path to the current working directory.
  • enumerate returns an array containing the names of the directory entries contained by the current working directory. In the near feature, this array will contain objects instead of strings, and each object will describe various other aspects of the corresponding directory entry, such as its type and its modification date.
Both these methods trust the environment variable PWD if it's present, which is normally the case because Counterpart sets it just before calling user scripts. However, if it's not set, these methods will figure out what it ought to be and set it before returning so they will run faster in the future. Once I finish enumerate, I plan to add two more methods.
  • descend changes the working directory to the child of the current working directory specified in its single parameter.
  • ascend changes the working directory to the parent of the current working directory.
This should be enough for the kinds of server-side scripts I have written in the past. The super-simple API is designed to insulate the client programmer from having to know too much about how file systems on specific platforms are organized.

These methods are implemented in C++. I may eventually build a script-based front end to them which knows how to explore file systems and/or represent paths (without separator characters, of course).

logo

project logo
Above is the logo for the project. I posted it here in hopes of being able to refer to it from the project site, but it appears that does not work; perhaps blogspot denies hot-linking. In any case, it's now archived here in case I get run over by a truck.