Here's a high level overview of how joinery
works.
Compilation procedes in several phases:
- Tokenize.
- Split the source into identifiers, punctuation, literals, etc. All tokens contain the original source code, location information, and surrounding whitespace.
- Parse into AST.
- We use the
peg
crate. This is a Parsing Expression Grammar (PEG) parser. This is a bit ad hoc as grammars go, butpeg
is a very nice library. - We make heavy use of
#[derive]
macros to implement the AST types.
- We use the
- Check types.
- The internal type system is defined in
src/types.rs
. This is distinct from the simplisitic "source level" type system parsed bysrc/ast.rs
, and better suited to doing inference. - Name lookup is handled in
src/scopes.rs
. Note that SQL requires several different kinds of scopes.
- The internal type system is defined in
- Apply transforms.
- A list of transforms is supplied by each database driver.
- Transforms use Rust pattern-matching to match parts of the AST, and build new AST nodes using
sql_quote!
. Note thatsql_quote!
outputs tokens, so we need to call back into the parser. This is closely patterned after Rust programmatic macros usingsyn
andquote
. - After applying a transform, we may need to check types again to support later transforms. This works a bit like an LLVM analysis pass, where specific transforms may indicate that the require types, and the harness ensures that valid types are available.
- The output of a transform must be structurally valid BigQuery SQL, though after a certain point it may no longer type check.
- Emit SQL.
- This consumes AST nodes and emits them as database-specific strings. We prefer to do as much work as possible using AST transforms, but sometimes we can't represent database-specific features in the AST.
- Run.
- This is a slightly dodgy layer that knows how to run SQL. Mostly it's intended for running our test suites, not for production use. Some of the Rust database drivers have problems reading complex data types back into Rust.
These traits are implemented by many of the nodes in the AST:
- In
src/tokenizer.rs
.Spanned
keeps track of where in the source code a token or AST node was found. This includes afile_id
.ToTokens
is used to convert AST nodes back into source code. This is used bysql_quote!
to build up SQL strings.
- In
src/ast.rs
.Emit
is used to convert AST nodes into database-specific SQL strings. This can be derived using#[derive(Emit)]
, in which case it just callsEmitDefault
. This is where we override and customize how specific parts of the AST are emitted for certain databases.EmitDefault
can be automatically derived to emit an AST node by printing every token it contains, recursively. This is optional.
Drive
andDriveMut
are generic AST walking interfaces provided byderive-visitor
. We use these for lots of custom AST traversals, especially transforms.Node
is a helper trait that indicates that a given type implements most of the tokenizer and AST traits.
- In
src/infer/mod.rs
.InferTypes
handles the main part of type inference.InferColumnName
is what figures out whatSELECT a, b AS c
create two columns nameda
andc
.
These traits are implemented by types:
- In
src/unification.rs
.Unify
is used to combine two types into a single type. It's what helps us determine thatARRAY(1, 2.0, NULL)
is in fact anARRAY<FLOAT64>
.- We should probably pull out more shared traits for types.
These traits are implemented by database drivers:
- In
src/drivers/mod.rs
.Locator
implements a URL-like locator for a database.Driver
provides the main interface for talking to a database. This also provides things like the list of transforms to apply to the AST. This is "trait safe" and can be referred to asBox<dyn Driver>
.DriverImpl
is a helper forDriver
. This knows about how a database represents types and values, and it isn't "trait safe".
These traits are implemented by transforms:
- In
src/transforms/mod.rs
.Transform
provides an interface for custom AST transforms, generally usingDriveMut
to walk the AST andsql_quote!
to build new AST nodes. If a transform requires up-to-date type information, it must overriderequires_types
to returntrue
.
See also src/scopes.rs
. We have more than one kind of scope, but the shared trait interface is still in flux.