Building a Language Server for Typical

2024-09-09

This document is a work in progress

I’ve pushed up a WIP LSP for this on GitHub.

Introduction

As I’m working on building out my Deno Desktop Framework I’ve been eyeing Typical as a serialization library to communicate between typescript and rust. There’s a lot that I like about it and even though I technically won’t need the asymmetric fields, it still solves the basic serialization between rust and typescript problem that I have.

Typical has its own DSL that looks a little something like this:

struct SendEmailRequest {
    to: String = 0
    subject: String = 1
    body: String = 2
}

choice SendEmailResponse {
    success = 0
    error: String = 1
}

^229d35

It’d be nice to have syntax highlighting and it’d be doubly nice to have a language server. Obviously a tall ask for a one-off DSL, right?… right?

Okay, no, I’m not writing that. Thankfully I don’t have to. There’s a project called Langium that takes in an EBNF grammar and generates a language server from it. They’ve got a playground that I plugged some Typical syntax into and (mostly) figured out how to write the syntax for. The one thing that’s not really working as I expect is the path string for the import. For the PATH terminator I ended up having to put the 's in the terminator instead of having them stand alone because it wasn’t parsing / and . correctly otherwise. Probably a bug with the lexer? Anyway, I digress.

Let’s explore how I used Langium to build out the LSP.

Writing the Grammar

First things first, we need the EBNF grammar that Langium uses for its base generation. Here’s what I came up with:

grammar Typical

entry Schema:
	(imports+=Import | declarations+=Declaration)*;

Import:
    'import' path=PATH ('as' alias=ID)?;

Declaration:
    variant=('struct' | 'choice') name=ID '{' 
        (fields+=Field | deleted+=Deleted)+
    '}';

Field:
    (Rule)? name=ID (':' Type)? '=' index=INDEX;

Deleted:
    'deleted' indexes+=INDEX*;

fragment Rule:
    rule=('asymmetric' | 'optional' | 'required');

fragment Type:
	type=('String' | 'Bool' | 'Bytes' | 'F64' | 'S64' | 'U64' | 'Unit' | ArrayType | CustomType);

ArrayType:
    '[' Type ']';

CustomType:
    (module=ID '.')? type=[Declaration:ID];

terminal ID: /[_a-zA-Z][\w_]*/;
terminal PATH: /'[\w_\/\.-]*'/;
terminal INDEX returns number: /[0-9][\w_]*/;

hidden terminal COMMENT: /#[^\n\r]*/;
hidden terminal WS: /\s+/;

You can play with the grammar and see the AST it would generate in the playground.

I worked through the Schema to see how Stephan is constructing the AST internally. Let’s break it down by type.

Schema

This just represents the entire Typical document. From the source it’s just described as a struct of comments, imports, and declarations.

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L21
pub struct Schema {
    pub comment: Vec<String>,
    pub imports: BTreeMap<Identifier, Import>,
    pub declarations: Vec<Declaration>,
}

In an EBNF you need an entry point for the grammar.

entry Schema:
	(imports+=Import | declarations+=Declaration)*;

It’s fairly straight forward even if it looks a little confusing. Schema is 0 or more (via the *) of imports or declarations. imports and declarations here end up being arrays of those types.

Note that comments aren’t included in this. Comments are treated as hidden terminals and treated specially by Langium’s lexer. The definition for comments is as the bottom of the file:

hidden terminal COMMENT: /#[^\n\r]*/;

let’s look at the grammar for imports and declarations in turn.

Import

You can import Typical types from a different file.

import 'foo.t'
import 'bar.t' as baz

In the above example, foo could be used to reference types from the foo.t file. So if there was a struct called Test it could be accessed by foo.Test. The import from bar.t has an alias of baz so if the same Test struct was in bar.t it’d be accessed by baz.Test.

Here’s typical’s Rust representation of the import.

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L28
pub struct Import {
    pub source_range: SourceRange,
    pub path: PathBuf, // The literal path
    pub namespace: Option<Namespace>, // A normalized form of the path
}

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L87C1-L93C2
pub struct Namespace {
    // This is a representation of a path to a schema, relative to the base directory, i.e., the
    // parent of the schema path provided by the user. However, it differs from paths as follows:
    // - It doesn't include the file extension in the final component.
    // - It can only contain "normal" path components. For example, `.` and `..` are not allowed.
    pub components: Vec<Identifier>,
}

Something about the above that I haven’t quite figured out is how it stores the import alias. At first I thought that’s what the namespace was, but that’s just path components. I’m not too sure and haven’t spent the time tracing the execution logic to see where it’s grabbed from. Regardless, it’ll need to be represented in my grammar.

Import:
    'import' path=PATH ('as' alias=ID)?;

So this parses the literal import a property path of type PATH (specified below) and optionally the literal as and a property alias of type ID.

PATH and ID are terminals, kind of like comments, except their not hidden terminals. These are patterns than can be captured to represent some type in the code. Langium strays a bit in how you define terminals by allowing you to just write them in regex which is what I did. I was never really able to specify the PATH terminal without the single quotes that surround the import file path. At first I put them directly in the import definition, but that didn’t really work because I think it expected whitespace around them and the other text. Regardless, where I landed works fine, it just means I end up needing to strip out the first and the last character when manipulating the path later.

terminal ID: /[_a-zA-Z][\w_]*/;
terminal PATH: /'[\w_\/\.-]*'/;

You’ll see ID and PATH referenced later. Let’s move onto declarations.

Declaration

This is really the meat of the Typical DSL. A declaration is either a choice (similar to an enum in languages like rust) or a struct. Refer back to the example I included above.

Here’s how Typical implements it in its schema:

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L35
pub struct Declaration {
    pub source_range: SourceRange,
    pub comment: Vec<String>,
    pub variant: DeclarationVariant,
    pub name: Identifier,
    pub fields: Vec<Field>,
    pub deleted: BTreeSet<usize>,
}

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L45
pub enum DeclarationVariant {
    Struct,
    Choice,
}

Important stuff here is variant, name, fields, and deleted. Here’s my definition of Declaration:

Declaration:
    variant=('struct' | 'choice') name=ID '{' 
        (fields+=Field | deleted+=Deleted)+
    '}';

I capture variant as either the struct or choice literals. name is an identifier and the body of the declaration is surrounded in { and } characters. For the actual body, it looks a lot like the Schema implementation with the exception of a + indicating there should be one or more matches for it to be valid. I’ll note that strictly speaking there should ever only be one deleted clause (which is a place to indicate which indices are no longer being used). Instead of over complicating the grammar specification, I’ve left a looser definition that can later be checked via validations.

Deleted is just

Deleted:
    'deleted' indexes+=INDEX*;

and INDEX is another terminal ^d9c62d

terminal INDEX returns number: /[0-9][\w_]*/;

One special note about INDEX is that it has this weird returns clause. This defines the typescript type that the terminal maps to. By default it’s a string. In this case, we want to map to a number so we need to manually specify returns number.

Deleted is the easy part. Let’s look at Fields.

Field

A field is just an entry inside a declaration. So given the Typical struct

struct Task {
	done: Bool = 0
}

The done: Bool = 0 part is the field. It looks like there’s a lot going on in the source, but there are only a few important parts.

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L51C1-L58C2
pub struct Field {
    pub source_range: SourceRange,
    pub comment: Vec<String>,
    pub rule: Rule,
    pub name: Identifier,
    pub r#type: Type, // Uses TypeVariant
    pub index: usize,
}

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L61
pub enum Rule {
    Asymmetric,
    Optional,
    Required,
}

//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L74C1-L84C2
pub enum TypeVariant {
    Array(Box<Type>),
    Bool,
    Bytes,
    Custom(Option<Identifier>, Identifier), // (import, name)
    F64,
    S64,
    String,
    U64,
    Unit,
}

Okay, breaking this down we have a rule (asymmetric | optional | required), a name, a type, and an index. Here’s what I came up with

Field:
    (Rule)? name=ID (':' Type)? '=' index=INDEX;

Rules syntax is optional because it defaults to required. Likewise the type syntax is optional because it defaults to Unit. I’ve defined Rule as a fragment. This wasn’t strictly necessary, but it helped keep the definition cleaner. I could’ve done the same thing for variant in the declaration.

fragment Rule:
    rule=('asymmetric' | 'optional' | 'required');

I’ve already talked about INDEX, so types are next.

Type

In the previous section we saw the TypeVariant as expressed in Typical’s schema code. It’s made up of primitive types (String, Bool, U64, etc), arrays, and custom types. Arrays are interesting because they’re a recursive type definition, meaning any valid type can be an array type (even other array types). Custom types are references to other types defined in the user’s typical schema.

Here’s the grammar I came up with

fragment Type:
	type=('String' | 'Bool' | 'Bytes' | 'F64' | 'S64' | 'U64' | 'Unit' | ArrayType | CustomType);

ArrayType:
    '[' Type ']';

CustomType:
    (module=ID '.')? type=[Declaration:ID];

Type here is expressed as a fragment because I want it to be defined inline of the field. If type was a regular definition then the field AST would have something like fields[0].type.type which is slightly annoying. With Type a fragment, I was able to inline it in the Field definition and keep the AST flatter.

CustomType either refers to a declaration in the current file (if no module is specified) or in a different file if a module is specified. Something else I want to call out for CustomType is that the type section has this weird square bracket syntax. [Declaration:ID] essentially means it’s looking for an ID terminal that references a Declaration. This provides autocomplete for declaration types but the default scoping rules provided by Langium means it only does so for the current file. We’ll have to expand on that!

Resolving References

TODO