Building a Language Server for Typical
2024-09-09
This document is a work in progressI’ve pushed up a WIP LSP for this on GitHub.
Introduction
As I’m working on building out my Deno Desktop Framework I’ve been eyeing Typical as a serialization library to communicate between typescript and rust. There’s a lot that I like about it and even though I technically won’t need the asymmetric
fields, it still solves the basic serialization between rust and typescript problem that I have.
Typical has its own DSL that looks a little something like this:
struct SendEmailRequest {
to: String = 0
subject: String = 1
body: String = 2
}
choice SendEmailResponse {
success = 0
error: String = 1
}
^229d35
It’d be nice to have syntax highlighting and it’d be doubly nice to have a language server. Obviously a tall ask for a one-off DSL, right?… right?
Okay, no, I’m not writing that. Thankfully I don’t have to. There’s a project called Langium that takes in an EBNF grammar and generates a language server from it. They’ve got a playground that I plugged some Typical syntax into and (mostly) figured out how to write the syntax for. The one thing that’s not really working as I expect is the path string for the import. For the PATH
terminator I ended up having to put the '
s in the terminator instead of having them stand alone because it wasn’t parsing /
and .
correctly otherwise. Probably a bug with the lexer? Anyway, I digress.
Let’s explore how I used Langium to build out the LSP.
Writing the Grammar
First things first, we need the EBNF grammar that Langium uses for its base generation. Here’s what I came up with:
grammar Typical
entry Schema:
(imports+=Import | declarations+=Declaration)*;
Import:
'import' path=PATH ('as' alias=ID)?;
Declaration:
variant=('struct' | 'choice') name=ID '{'
(fields+=Field | deleted+=Deleted)+
'}';
Field:
(Rule)? name=ID (':' Type)? '=' index=INDEX;
Deleted:
'deleted' indexes+=INDEX*;
fragment Rule:
rule=('asymmetric' | 'optional' | 'required');
fragment Type:
type=('String' | 'Bool' | 'Bytes' | 'F64' | 'S64' | 'U64' | 'Unit' | ArrayType | CustomType);
ArrayType:
'[' Type ']';
CustomType:
(module=ID '.')? type=[Declaration:ID];
terminal ID: /[_a-zA-Z][\w_]*/;
terminal PATH: /'[\w_\/\.-]*'/;
terminal INDEX returns number: /[0-9][\w_]*/;
hidden terminal COMMENT: /#[^\n\r]*/;
hidden terminal WS: /\s+/;
You can play with the grammar and see the AST it would generate in the playground.
I worked through the Schema to see how Stephan is constructing the AST internally. Let’s break it down by type.
Schema
This just represents the entire Typical document. From the source it’s just described as a struct of comments, imports, and declarations.
//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L21 pub struct Schema { pub comment: Vec<String>, pub imports: BTreeMap<Identifier, Import>, pub declarations: Vec<Declaration>, }
In an EBNF you need an entry point for the grammar.
entry Schema:
(imports+=Import | declarations+=Declaration)*;
It’s fairly straight forward even if it looks a little confusing. Schema is 0 or more (via the *
) of imports or declarations. imports
and declarations
here end up being arrays of those types.
Note that comments aren’t included in this. Comments are treated as hidden terminals
and treated specially by Langium’s lexer. The definition for comments is as the bottom of the file:
hidden terminal COMMENT: /#[^\n\r]*/;
let’s look at the grammar for imports
and declarations
in turn.
Import
You can import Typical types from a different file.
import 'foo.t'
import 'bar.t' as baz
In the above example, foo
could be used to reference types from the foo.t
file. So if there was a struct
called Test
it could be accessed by foo.Test
. The import from bar.t
has an alias of baz
so if the same Test
struct was in bar.t
it’d be accessed by baz.Test
.
Here’s typical’s Rust representation of the import.
//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L28 pub struct Import { pub source_range: SourceRange, pub path: PathBuf, // The literal path pub namespace: Option<Namespace>, // A normalized form of the path } //https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L87C1-L93C2 pub struct Namespace { // This is a representation of a path to a schema, relative to the base directory, i.e., the // parent of the schema path provided by the user. However, it differs from paths as follows: // - It doesn't include the file extension in the final component. // - It can only contain "normal" path components. For example, `.` and `..` are not allowed. pub components: Vec<Identifier>, }
Something about the above that I haven’t quite figured out is how it stores the import alias. At first I thought that’s what the namespace was, but that’s just path components. I’m not too sure and haven’t spent the time tracing the execution logic to see where it’s grabbed from. Regardless, it’ll need to be represented in my grammar.
Import:
'import' path=PATH ('as' alias=ID)?;
So this parses the literal import
a property path
of type PATH
(specified below) and optionally the literal as
and a property alias
of type ID
.
PATH
and ID
are terminals, kind of like comments, except their not hidden terminals. These are patterns than can be captured to represent some type in the code. Langium strays a bit in how you define terminals by allowing you to just write them in regex which is what I did. I was never really able to specify the PATH
terminal without the single quotes that surround the import file path. At first I put them directly in the import definition, but that didn’t really work because I think it expected whitespace around them and the other text. Regardless, where I landed works fine, it just means I end up needing to strip out the first and the last character when manipulating the path later.
terminal ID: /[_a-zA-Z][\w_]*/;
terminal PATH: /'[\w_\/\.-]*'/;
You’ll see ID
and PATH
referenced later. Let’s move onto declarations.
Declaration
This is really the meat of the Typical DSL. A declaration is either a choice
(similar to an enum in languages like rust) or a struct
. Refer back to the example I included above.
Here’s how Typical implements it in its schema:
//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L35 pub struct Declaration { pub source_range: SourceRange, pub comment: Vec<String>, pub variant: DeclarationVariant, pub name: Identifier, pub fields: Vec<Field>, pub deleted: BTreeSet<usize>, } //https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L45 pub enum DeclarationVariant { Struct, Choice, }
Important stuff here is variant
, name
, fields
, and deleted
. Here’s my definition of Declaration
:
Declaration:
variant=('struct' | 'choice') name=ID '{'
(fields+=Field | deleted+=Deleted)+
'}';
I capture variant
as either the struct
or choice
literals. name
is an identifier and the body of the declaration is surrounded in {
and }
characters. For the actual body, it looks a lot like the Schema
implementation with the exception of a +
indicating there should be one or more matches for it to be valid. I’ll note that strictly speaking there should ever only be one deleted
clause (which is a place to indicate which indices are no longer being used). Instead of over complicating the grammar specification, I’ve left a looser definition that can later be checked via validations.
Deleted is just
Deleted:
'deleted' indexes+=INDEX*;
and INDEX
is another terminal ^d9c62d
terminal INDEX returns number: /[0-9][\w_]*/;
One special note about INDEX
is that it has this weird returns
clause. This defines the typescript type that the terminal maps to. By default it’s a string. In this case, we want to map to a number so we need to manually specify returns number
.
Deleted is the easy part. Let’s look at Fields.
Field
A field is just an entry inside a declaration. So given the Typical struct
struct Task {
done: Bool = 0
}
The done: Bool = 0
part is the field. It looks like there’s a lot going on in the source, but there are only a few important parts.
//https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L51C1-L58C2 pub struct Field { pub source_range: SourceRange, pub comment: Vec<String>, pub rule: Rule, pub name: Identifier, pub r#type: Type, // Uses TypeVariant pub index: usize, } //https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L61 pub enum Rule { Asymmetric, Optional, Required, } //https://github.com/stepchowfun/typical/blob/c575520dc9df5d91d5dced702b6c1b8171a78dd1/src/schema.rs#L74C1-L84C2 pub enum TypeVariant { Array(Box<Type>), Bool, Bytes, Custom(Option<Identifier>, Identifier), // (import, name) F64, S64, String, U64, Unit, }
Okay, breaking this down we have a rule (asymmetric | optional | required
), a name, a type, and an index. Here’s what I came up with
Field:
(Rule)? name=ID (':' Type)? '=' index=INDEX;
Rules syntax is optional because it defaults to required
. Likewise the type
syntax is optional because it defaults to Unit
. I’ve defined Rule
as a fragment. This wasn’t strictly necessary, but it helped keep the definition cleaner. I could’ve done the same thing for variant
in the declaration.
fragment Rule:
rule=('asymmetric' | 'optional' | 'required');
I’ve already talked about INDEX, so types are next.
Type
In the previous section we saw the TypeVariant
as expressed in Typical’s schema code. It’s made up of primitive types (String
, Bool
, U64
, etc), arrays, and custom types. Arrays are interesting because they’re a recursive type definition, meaning any valid type can be an array type (even other array types). Custom types are references to other types defined in the user’s typical schema.
Here’s the grammar I came up with
fragment Type:
type=('String' | 'Bool' | 'Bytes' | 'F64' | 'S64' | 'U64' | 'Unit' | ArrayType | CustomType);
ArrayType:
'[' Type ']';
CustomType:
(module=ID '.')? type=[Declaration:ID];
Type
here is expressed as a fragment
because I want it to be defined inline of the field. If type was a regular definition then the field AST would have something like fields[0].type.type
which is slightly annoying. With Type
a fragment, I was able to inline it in the Field definition and keep the AST flatter.
CustomType
either refers to a declaration in the current file (if no module is specified) or in a different file if a module is specified. Something else I want to call out for CustomType
is that the type
section has this weird square bracket syntax. [Declaration:ID]
essentially means it’s looking for an ID
terminal that references a Declaration
. This provides autocomplete for declaration types but the default scoping rules provided by Langium means it only does so for the current file. We’ll have to expand on that!
Resolving References
TODO