Skip to content

Latest commit

 

History

History
238 lines (161 loc) · 8.74 KB

Design.md

File metadata and controls

238 lines (161 loc) · 8.74 KB

Phoron Design and Implementation

Overall Design

The Phoron (or Jasmin) syntax is very much line-oriented in the spirit of most assemblers. This makes parsing and code generation relatively straightforward. The complexity lies in the way the Constant Pool is constructed, how the Attributes are generated, as well as handling the offsets within the Constant Pool as well as lables within specific Instructions.

This project uses the phoron_core crate to perform the low-level serialisation of the type-checked and decorated AST into JVM bytecode.

References:

  1. The JVMww 19 Reference
  2. The Jasmin User Guide.

Algorithm

  • Read the Phoron source file, lex it, and then parse it into an AST.
  • Type-check the AST and generate a decorated AST.
  • Use the type information in the decorated AST to generate the Constant Pool (CP) (as described in the next section).
  • Fix all the indices used in instructions and labels, into the CP, using the generated CP representation in the previous step.
  • Construct the ClassFile structure as requited by phoron_core.
  • Use phoron_core to serialise to JVM bytecode.`

Constant Pool

The CP is represented in Phoron as:

  • A hashmap of key-value pairs where each key is a String, and each value is a (usize, CpInfo) pair, where CpInfo is the phoron_core type representing CP entries.
  • The String key represents the name of the CP entity.
  • The usize component of the value represents the CP index proper, and the CpInfo represents the
  • This hashmap is then used to construct a vector of CpInfo objects according to the scheme accepted by phoron_core.

CP entries are generated from the following sources:

  • Instructions such as ldc, ldc_w, ldc2_w et al, which explicitly index into literal constants stored in the CP.
  • Class, Fieldref, Methodref, InterfaceMethoref, and NameAndType definitions in the Phoron file.

Attributes

How Phoron directives map to the JVM model:

.version populates the major_version and minor_versionfields in the ClassFile object.

.source generates a SourceFile attribute. If this attribute is not explicitly specified, the name of the Phoron file is taken as the value for this attribute.

.class is simply associated with the ClassFile object being generated during the current run of Phoron. Its index in the CP is also used for the this_class field in ClassFile. The is used to set various access flags as valid for classes.

.super is used to populate the super_class attribute (as an index into the CP) of the ClassFile. The is used to set various access flags as valid for classes.

.interface sets the [ACC_INTERFACE}() access flag in the ClassFile.

.implements popiulates the interfaces field of ClassFile.

.field sets the fields fields of ClassFile. Each entry in this vector is a FieldInfo object.

.method populates the methods field of ClassFile. Each entry in this vector is a MethodInfo object.

.limit .limit stack N and CodeAttribute populate the max_stack and max_locals fields of the CodeAttribute of the relevant method.

.var generates a LocalVariable entry in the LocalVariableTable attribute.

.line generates a LineNumber entry in the LineNumberTable attribute.

.throws populates the Exceptions attribute of the relevant method.

.catch creates entries in the exception_table of the CodeAttribute of the relevant method.

Error Reporting

Since Phoron is mroe strongly (read strictly) typed than Jasmin, it behooves us to provide better error reporting at all stages than what is currently available in Jasmin. To that end, this section is divided into three sub-sections - the first describes the actual implementation details for the error-reporting mechanism, the second describes how the error recovery mechanism which is used, mutatis mutandis, at each stage of processing to provide decent error messages. and the third specifies the template for diagnostic reports.

Implementation Details

When a source file is read in, it is first converted into a SourceFile:

pub struct SourceFile {
  src_file: String, // the name of the source file
  src: String, // the actual raw sourc code
  beginnings: Vec<Po>> // absolute byte offsets for the beginning of each line
}

This is then fed into the lexer, which produces tokens with spans:

pub struct Spanned<T> {
  node: T,
  span: Span
}
pub struct Span {
  low: Pos,
  high: Pos
}
pub struct Pos(u32);

where each u32 value is an absolute byte offset from the beginning of the source code.

pub enum Token {
}
pub fn lex(&self) -> LexerResult<Spanned<Token>> 

During parsing, each AST node is likewise decorated with a span for the entire non-terminal AST:

pub fn parse(&mut self) -> ParserResult<Spanned<PhoronProgram>> 
pub struct PhoronProgram {
  pub header: Spanned<PhoronHeader>,
  pub body: Spanned<PhoronBody>,
}

and so on.

Error Recovery

If we are in a function, parse_x, then we skip tokens until an element in FIRST(x) (meaning a token that can start the phrase x) is encountered. If encountered, parsing continues from this stage. If not encountered, then we look for a token in FOLLOW(x) (meaning a token that marks the end of the phrase x). If encountered, then we skip parsing this phrase, report the error, and continue parsing the next phrase. If not encountered, then we report the error, and stop parsing since error recovery is not meaningful at this stage.

Diagnostic Template

Phoron is line-based, and so the greatest possible span will correspond to a single line. However, this template is more general purpose that could potentially work with multiple-lines (by merging spans), and the template shown below describes how a single line might be reported along with the diagnostic information, but, again, a similar mechanism could be used to reporting multiple lines.

  <File:Line:Col>: <Error Text>
            |
 <LINE NUM> | <SOURCE CODE LINE> generated from the span
            |          ^^^^   

The template is represented by the following structure:

pub struct Diagnostic {
  src_file: String, // name of the source file
  line: u32, // line number
  col: u32, // column number
  stage: Stage, // lexer, parser, cp analyzer, or code gen
  level: Level, // info, warning, error
  src: String, // source code corresponding to the merged span
}


which is generated by the following reporting function:

  pub fn report_error(stage: Stage, level: Level, span: Span, text: String) 

   pub enum Level {
     Warning,
     Info, 
     Error
   }
    pub enum Stage {
      Lexer,
      Parser,
      ConstantPoolAnalyzer,
      CodegGenerator,
    }

and is used by the emitter to generate a nicely-formatted error/diagnostic message.

So we need the followoing functionality in order to generate and report as per this template:

- `merge_span` - this will take another span, and "merger" that with the current span. This is used to convert the token spans into spans for the AST nodes. The algorithm is simple:

      new_span.low = min(curr_span.low, other_span.low), and
      new_span.high = max(curr_span.higg, other_span.high)

  So, essentially merging intervals.

- `span_to_location` : this will take a `Span` and generate the <File, Line number, Column number> information. This will make use of the `beginnings` field of the `SourceFile` for faster lookups using binary search.

  So, for instance, we have the following situation:

  ```
    Beginnings         Line Numbers
      0                  0
     13                  1
     21                  2
     31                  3
     34                  4

  ```

  Then, given a span like so:

  ```
  Span {
    low: Pos(15),
    high: Pos(18),
  }
  ```

  Then `File` will be the source file name, `Line ` will be 2 (1 + pos using Binary Search), and `Column` will be 3 (low - beginning + 1).

- `span_to_source` - this will take a `Span` and generate the region of source code as a string (a single line in the case of `Phoron`) in the format specified above.

- `emit_diagnostic` - this will generate the pretty-printed (and colour-coded) diagnostic report in the form of the template above.