Impossible Day
2024-10-03
Impossible day is an Recurse event where you have one day to hack together a project that feels… well… impossible.
This is a whole day to spend on something that is well beyond the edge of your abilities — something that feels totally impossible right now.
Recap
Impossible day is over now. I made decent headway into exploring stable hashing for JS code, but there’s a lot further to go there. The next big chunk of work would be creating a means of loading the JS in given a specific hash and then utilizing that in an editor to do Unison style write-labels-read-hashes. I’ll very likely revisit this idea and see how far I can take it. If you were inspired by this project or want me to continue pushing it further let me know.
Idea
I want to build a content-addressable module system for JavaScript. This project is inspired by Unison’s content addressability. I’m pretty sure this is generally impossible, but I’m going to take a good whack at it.
The principle is this: given two bits of javascript code that are functionality the same (even if syntactically different) they should hash to the same thing. I should then be able to request that hash from a module system and get the related code. These aren’t modules in the traditional sense, they’re just bits of code. I may need to add some base types for the hashes (e.g. this hash refers to a function)
Starting point
I want to use Deno for this but I also want people to be able to play with it so I’ll probably just build it in val.town. I’m going to embed it below so you can see my live progress. Just refresh the page.
Okay, so to start:
- write a simple hash function
- write a function to transform a JS string to an AST
- write the base content hashing algorithm
Hashing
I don’t really want to do anything fancy for hashing so I’m just using a SHA256 hashing function. Deno uses the same crypto.subtle
API that browsers expose, but you’ve got to do a little extra faffing.
const hash = (content: string) =>
crypto.subtle
.digest("SHA-256", new TextEncoder().encode(content))
.then((b) => new Uint8Array(b))
.then(Array.from)
.then((a) => a.map((b: any) => b.toString(16).padStart(2, "0")).join(""));
JS to AST
I’m going to assume knowledge of ASTs throughout this.
Content-hashing a string of JS wouldn’t be very useful because little things like whitespace would result in different hashes. Converting to the AST removes that variance and gives me the opportunity to apply other normalizations to further reduce variance. My immediate goal is just to get straight AST hashing done though, so that’s where we start.
Content Hashing
I’m using acorn for AST generation. The absolute simplest content hashing is JS string -> AST -> string -> hash.
Here’s my base test case:
const variantA = js`console.log('hello world!')`;
const variantB = js`console.log( "hello world!" );`;
assertEqual(variantA, variantB);
These two JS variants pretty similar with small differences (quotes, spaces, semicolons). I didn’t think any of those changes were structurally important, so I expected the ASTs to be the same. They’re actually not though. Here’s the output of a basic assert:
[Diff] + Actual / - Expected
Node {
body: [
Node {
- end: 27,
+ end: 30,
expression: Node {
arguments: [
Node {
- end: 26,
- raw: "'hello world!'",
- start: 12,
+ end: 27,
+ raw: '"hello world!"',
+ start: 13,
type: "Literal",
value: "hello world!",
},
],
callee: Node {
computed: false,
end: 11,
object: Node {
end: 7,
name: "console",
start: 0,
type: "Identifier",
},
optional: false,
property: Node {
end: 11,
name: "log",
start: 8,
type: "Identifier",
},
start: 0,
type: "MemberExpression",
},
- end: 27,
+ end: 29,
optional: false,
start: 0,
type: "CallExpression",
},
start: 0,
type: "ExpressionStatement",
},
],
- end: 27,
+ end: 30,
sourceType: "script",
start: 0,
type: "Program",
}
The start
and end
values are mostly noise for my purpose, so I’ll actually need to remove them, but that’s not a big deal. What’s really surprising to me though is the raw
differences between the two strings. I mean, I guess it shouldn’t be. "
and '
do have semantic differences. Instead of programmatically handling that myself, I’m thinking about taking the lazy way out and just sending the content through something like prettier. It’ll do some unnecessary formatting, but it’ll do some of its own normalization without me having to think too much about it.
Edit: Upon revisiting this I realized that raw
was likely just safe to delete too. So I’ve nixed that which gets me to a place where my naive algorithm works.
Naive Normalization
I’ve implemented two basic things:
- A
walk
function that can generically walk an object (in this case the AST) and modify it - A
normalize
function to start massaging the AST into something stable for hashing
During normalization, we basically are just deleting anything that’s not semantically necessary.
Code Chunking
When considering content addressing bits of code, it doesn’t really make sense (to me) to have multiple top level declarations to be hashed up together. Said differently, when passing a string of JavaScript to the js
template string function, that could contain multiple top level things: variables, functions, classes, etc. We’re passing whole JS programs. The next step then, is to break up these programs into rule level declarations and do the hashing from there.
The Program
root node has a body
key that’s an array of expressions or declarations. I just split the body and hash each declaration individually.
const normalizedASTs = ast.body
.map((chunk) => structuredClone(chunk))
.map((chunk) => {
walk(chunk, normalize);
return chunk;
});
Instead of just returning a single normalized AST and hash, I return a contentMap
which is each an object where the key is the hash and the value is the normalized AST.
Serialization
I’m realizing it may be necessary to retain the fields from the normalized AST to get code back out of the content map. Ultimately the content-addressable system isn’t really valuable if it’s only one way. I guess the next step will be figuring out serialization of the AST.
Edit: Well, a quick test using a library called astring
indicates that it may not be important that the raw
, start
, and end
properties are retained. Good to know! Love it when something turns out to be simpler than expected.