Esrap TODO
As access to special variables can be slow, it may make sense to
store *cache*
and *head*
in slots of a structure context
and
this structure in a new special variable *context*
.
We’re interested in the production/failure, and the position.
The vast majority of parses result in failures, and we cons up a
failed-parse
for each. Unless the whole parse fails, the only
thing we are interested in is the position.
…and even if the whole parse fails, we use the additional information only to display a fancy error message.
So: unless parse
has been called with :debug t
, use the
fixnum
indicating the position to indicate a failed parse.
In the case of a success and matching a sequence we don’t really
need the whole result object for each submatch. Maybe return result
as multiple values from each rule, and have with-cached-result
cons up the object to store it when?
(or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9")
can be implemented
with a range-check for char-code
. Similarly (or “foobar” “foo”
“bar”) can be implemented using tricks from pkhuong’s STRING-CASE.
The cache is a big bottleneck. Try out a few different designs. Early experiments show that while it’s easy to make something that conses less, it’s not trivial to make it much faster than the current simple hash-table based version.
Some statistics:
- 5-10% of positions in a given text have only failure results. If we can efficiently record the rules these are for…
- 40-50% of positions in a given text end up with exactly one successful result, irrespective of number of failures. Not sure if we can use this.
- 75% of positions in a given text end up with results. This should make a good estimate for the size of the cache needed.
GC is another related bottleneck. Not because we cons so much, but because we have this massive cache that keeps being written to, so we have boxed objects on dirty pages.
To reduce the GC pressure first optimize the result handling. If the issue still exists, see the first option below.
Maybe: Map rule to a position cache. In the position cache, need to be able to differentiate between 3 states: no result, success, failure. Need to also be able to store the result. If we store results in a single global result vector, and use N bits per position in the position cache: 0 no result, 1 failure, anything else is the position of the result object in the global vector.
Maybe: PCL-style multikey cache.
Maybe: Basic two-level cache. (Version of this on a branch.)
Parsing is currently thread-safe if parse
has been
compiler-macroexanded. *RULES*
needs locking, but isn’t used
during actual parsing.
Rules should be contained in grammars, so that symbols like cl:if
can refer to different rules in different contexts. Grammars can
also enforce rule numbering, making caching results easier. It
should be possible to inherit from other grammars.
but I would like to have that behind a flag so it can be turned off for debugging. No need to export or document the flag.
If keeping it doesn’t make things clearly worse or implementation much harder, I would keep it. Therefore I would prefer ADD-RULE and &co to default to grammar at runtime, and have that as either a keyword or an optional argument. Analogously to how INTERN &co work.
should IMO choose the grammar (if not explicitly given) at compile-time if at all possible. Earlier is better there, I suspect.
I see the attraction, but this means that all rules in a grammar are public and prone to conflict, no?
If I have a grammar G1 in package BAR that specifies a rule called BAR:WHITESPACE and a G2 in FOO that specified FOO:WHITESPACE, then G3 in QUUX can use both without problems. If both rules are really called “WHITESPACE”, things can get confusing pretty quickly…
Have standard-grammar
instances that define things like digit,
whitespace, ascii, etc.
This will probably be done in a different system.
(defrule decimal (+ (or "0" "1" ...))
(:subseq-function parse-integer))
Make it easy to specify character ranges, eg. (char #\0 #\9)
.
Structures and classes have a mixture of documentation strings and documentation comments. Which do we want? After deciding, make this consistent.
For parse errors, in particular “incomplete parse” errors, provide a better description of where the parse actually failed, which rule or rules were involved and what input was expected.