This is the artifact for the POPL 2022 paper "A formal foundation for symbolic evaluation with merging". A copy of the paper is included in the artifact.
The artifact is packaged as a Docker image. See the hardware requirements and installation sections for more details. The repository is simply the source of the Docker image.
The paper uses Rosette and Rosette* to refer to versions 3 and 4 of the Rosette system, respectively. The artifact may refer to them as Rosette 3 and Rosette 4.
- List of claims
- Hardware requirements
- Installation and sanity testing
- Evaluation instructions
- Additional artifact description
We mechanized our symbolic semantics and proved its correctness in Lean.
We also created a reference interpreter named Leanette, which is proven correct.
These are inside the subdirectory lean
.
The following table links sections in the paper with files under the directory lean
.
File | Description | Note |
---|---|---|
lean/src/cs/lang.lean |
Syntax and concrete semantics of λ_c |
Section 3.1 |
lean/src/cs/detc.lean |
Determinism of λ_c |
Theorem 1 |
lean/src/cs/sym.lean |
Symbolic factory interface | Section 4.1 |
lean/src/cs/svm.lean |
Evaluation rules for S_c |
Section 4.2 |
lean/src/cs/thm.lean |
Properties and correctness of S_c |
Theorem 2â8 |
lean/src/interp/sym/defs.lean |
Leanette implementation | Section 6.1 |
lean/src/interp/sym/thm.lean |
Correctness of Leanette | Theorem 9 |
lean/src/sym_factory/sym_factory.lean |
Symbolic factory F_L |
Section 6.1 |
lean/src/sym_factory/red.lean |
Reducibility and correctness of F_L |
Theorem 10 |
We also created a Racket implementation, known as Rosette*, which implements the symbolic semantics with an optimized symbolic factory.
The subdirectory rosette-4/rosette
(or this link if you wish to view it outside the Docker image)
contains an implementation of Rosette*, as described in Section 6.2.
To make sure that Rosette* is implemented correctly, we tested it via solver-aided differential testing.
The subdirectory diff-testing
contains the differential testing setup that tests Leanette against Rosette*,
as described in Section 6.3 (and also Figure 7, Table 1, and Table 2).
We compared the interface of Rosette* and Rosette by porting benchmarks written in Rosette to Rosette*.
Two subdirectories rosette-benchmarks-3
and rosette-benchmarks-4
contain the benchmarks described in Section 7.1.1 and 7.1.2.
The directory interface
contains tools that read these benchmarks to generate Table 3.
We tested that the performance of Rosette* matches that of Rosette on the benchmarks mentioned above.
The subdirectory perf
contains a tool that tests the performance of Rosette* and Rosette for SymPro benchmarks
(which are in rosette-benchmarks-3
and rosette-benchmarks-4
), as described in Section 7.2 (and Table 4).
The following table lists SymPro benchmarks.
Benchmark | ID | Entry file |
---|---|---|
Bagpipe | bagpipe |
bagpipe/setups/textbook/run.rkt |
Bonsai | bonsai |
bonsai/nanoscala.rkt |
Cosette | cosette |
cosette/cidr-benchmarks/oracle-12c-bug.rkt |
Ferrite | ferrite |
ferrite/rename.rkt |
Fluidics | fluidics |
fluidics/ex2.rkt |
GreenThumb | greenthumb |
greenthumb/GA/output/0/driver-0.rkt |
IFCL | ifcl |
ifcl/test.rkt |
Memsynth | memsynth |
memsynth/case-studies/synthesis/ppc/ppc0.rkt |
Neutrons | neutrons |
neutrons/filterWedgeProp.rkt |
Nonograms | nonograms |
nonograms/puzzle/src/run-batch-learn-rules.rkt |
Quivela | quivela |
quivela/test-inc-arg.rkt |
RTR | rtr |
rtr/benchmarks/all.rkt |
SynthCL | synthcl |
synthcl/examples/sobelFilter/test.rkt |
Wallingford | wallingford |
wallingford/tests/all-tests.rkt |
WebSynth | websynth |
websynth/test/all-tests.rkt |
Separately, the subdirectory jitterbug-benchmarks
contains a tool that tests the performance of Rosette* and Rosette for Jitterbug benchmarks.
To use this artifact, you will need a machine capable of running Docker with at least 8GB of free disk space, and at least 16GB of RAM. It is known that Docker does not work on ARM-based systems such as Apple M1.
For reference, we tested this artifact on a machine running Linux 5.4.0 and Docker 20.10.8 with an Intel Core i7-7700K CPU @ 4.20GHz, configured with 16GB of RAM. The results you obtain may vary from ours depending on your particular machine setup, including CPU, available RAM, concurrently running processes, etc.
The only required installation is Docker. See https://docs.docker.com/install/ for details on how to install Docker. If you are using macOS (non-ARM-based), Docker For Mac is a good solution.
After installing Docker, you can download and run the Docker image by running:
# Download image (~2GB download)
$ docker pull unsat/leanette-popl22-artifact:latest
# Run Image
$ docker run -it --name leanette-artifact unsat/leanette-popl22-artifact:latest
This will drop you into a shell inside the container,
in the /workspace
directory, the main directory containing
source code and experiments.
We have installed vim
into the container for convenience to edit and view files;
you can also use Docker to copy files into and out of the container,
see https://docs.docker.com/engine/reference/commandline/cp/.
If you leave the container and wish to get back, you will
need to restart it with the command
docker start -ia leanette-artifact
. If you wish to instead start from a fresh
copy of the image, run docker rm leanette-artifact
to remove the container and then follow the instructions
for the first run of the container above.
To test that Leanette works, run:
$ cd /workspace/lean
$ lean src/workspace/load.lean
This should output:
"(ans (state #t #t) (term_bool #t))"
To test that Rosette* works, run:
$ cd /workspace
$ raco cross -q --workspace /workspace/rosette-4 racket examples/correct-ex-1.rkt
This should produce the output (unsat)
.
Next, run:
$ raco cross -q --workspace /workspace/rosette-4 racket examples/incorrect-ex-1.rkt
This should produce a model, for example:
(model
[y 0])
Compared to Rosette, Rosette* adds the statement assume
to the language.
Therefore, correct-ex-1.rkt
, which uses the assume
statement, should fail to run in Rosette.
$ raco cross -q --workspace /workspace/rosette-3 racket examples/correct-ex-1.rkt
should fail with output beginning with examples/correct-ex-1.rkt:12:4: assume: unbound identifier
.
However, the program examples/rosette-3.rkt
which does not use the assume
statement, should succeed when run with Rosette.
$ raco cross -q --workspace /workspace/rosette-3 racket examples/rosette-3.rkt
This should produce a model, for example:
(model
[y -1])
Test the oracle on a predefined program that passes the differential testing (Figure 7b in the paper) by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket tester.rkt ./predefined/fig-7b.rkt
You should see the following output:
======= Program ./predefined/fig-7b.rkt =======
Computed in X.XXs
Rosette*:
(halt (state #t #f))
Rosette* evaluation is trivial
Rosette* state size: 2
Computed in X.XXs
Leanette:
(ans (state #t #f) (undef))
Leanette evaluation is non-trivial
Leanette state size: 46
Leanette and Rosette* agree
Note: it may appear that Leanette's state is trivial, but that's because the output shown above is also reduced by Rosette's symbolic factory. The actual state size is 46, which is not trivial.
Next, test the reporter by running:
$ raco cross -q --workspace /workspace/rosette-4 racket report.rkt
You should see the following output:
âââââââ¬ââââââ¬ââââ¬âââââââââââââââââââââ¬ââââ¬ââââ¬âââââââââââââââââââââ¬âââââââ¬âââââââ¬ââââââââââââââââââââ¬ââââ¬âââââââââââââââââââ¬ââââ¬âââââââââââ
âRangeâCountâAvgâ(Leanette state) SymâMaxâAvgâ(Rosette* state) SymâMax âAvg â(Leanette time) MaxâAvgâ(Rosette time) MaxâAvgâValidated?â
âââââââŒââââââŒââââŒâââââââââââââââââââââŒââââŒââââŒâââââââââââââââââââââŒâââââââŒâââââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â9 - 9â1 â9 â1 â46 â46 â0 â+nan.0â+nan.0âX.X âX.XâX.X âX.Xââ â
âââââââŽââââââŽââââŽâââââââââââââââââââââŽââââŽââââŽâââââââââââââââââââââŽâââââââŽâââââââŽââââââââââââââââââââŽââââŽâââââââââââââââââââŽââââŽâââââââââââ
Similarly, test the oracle on a predefined program that does not pass the differential testing (Figure 7a in the paper) by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket tester.rkt ./predefined/fig-7a.rkt
You should see the following output:
======= Program ./predefined/fig-7a.rkt =======
Computed in X.XXs
Rosette*:
(ans (state #t #t) #t)
Rosette* evaluation is trivial
Rosette* state size: 2
Leanette runs out of fuel
Leanette evaluation is trivial
Leanette state size: 0
Next, test the reporter by running:
$ raco cross -q --workspace /workspace/rosette-4 racket report.rkt
You should see the following output:
program at ./predefined/fig-7a.rkt fails differential testing because Leanette runs out of fuel
Note that report.rkt
is a script to generate Table 1.
Because in sanity test runs there is only one trivial data point, a lot of values do not make sense, so it results in +nan.0
.
Run the following command:
$ cloc --version
It should produce the output 1.86
.
There is no sanity test for this step (the previous sanity tests already verify that Rosette* and Rosette can be run).
In general, the output related to time duration will not exactly match the paper due to differences in hardware and machine configuration. However, we hope it won't deviate significantly, and that the data you obtain will maintain relative order with respect to other data points. For example, if a table in the paper indicates that that Rosette* is faster than Rosette in a particular benchmark, Rosette* should still be faster than Rosette in your run too, even though the measured times will not exactly match.
Expected time: less than 1 minute
To check all proofs, run the following commands:
$ cd /workspace/lean
$ ./reset.sh
$ leanproject build
The script reset.sh
resets the lean
directory (e.g., clears compiled files).
The command leanproject build
checks proofs in every file and warns for any uses of sorry
(i.e., unchecked proofs).
There should be exactly one warning, as follows:
lean/src/test_sorry.lean:1:0: warning: declaration 'test_sorry' uses sorry
This is an intentional sorry
inserted into the project to ensure that leanproject build
can detect unchecked proofs.
The file test_sorry.lean
is not used otherwise. To validate this, you can run:
$ git rm src/test_sorry.lean
$ git commit -m "remove test_sorry.lean" # we require that the worktree is clean, so commit the change
$ ./reset.sh
$ leanproject build
To check proofs in a specific file, run lean <file.lean>
.
Unless the file is test_sorry.lean
, there should be no warning or error.
Expected time: N/A
You can find the the implementation in /workspace/rosette-4/rosette
(or this link if you wish to view it outside the Docker image).
If you wish, you may create your own Rosette* files and run them. See the reference for more details.
More thorough evaluation of Rosette* is described in the next sections of this guide (solver-aided differential testing and performance evaluation).
Expected time: less than 4 hours
The results of Figure 7 should already be reproduced as a part of the sanity testing process. This section details differential testing of randomly generated programs, which consists of multiple steps:
- Compile Leanette so that it can readily run tests
- Generate tests
- Run the oracle on the generated tests
- Check the quality of the tests (coverage) by seeing if the tests can detect mistakes in intentionally incorrect interpreters.
Expected time: less than 1 minute
Simply run:
$ cd /workspace/lean
$ leanproject build
If you have already done this as a part of the previous step (and have not run reset.sh
after that), this step can be skipped.
Expected time: less than 1 minute
There are two possible choices here. You do not need to do both. Only one suffices.
Generate 10,000 test files as described in the paper by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt
The script reset.sh
resets the diff-testing
directory (e.g., clear previously generated test files and results).
The script generation.rkt
generates 10,000 test files in the directory generation
with seed 42.
Feel free to inspect them if you wish to. If you wish to reproduce Table 1, we recommend that you keep using the default test generation setting,
but otherwise, you can change the seed to other numbers. See the reference for how to do so.
If you don't have enough time to run the oracle on all 10,000 test files, you can generate fewer programs by running:
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt --sub-iterations <sub> --iterations <main>
The above command will generate <sub>
* <main>
test files.
In the default test generation setting, <sub>
is 20 and <main>
is 500. We recommend that you consider lowering <sub>
first.
For example, the command:
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt --sub-iterations 1 --iterations 500
will generate 500 test files. On our machine with four threads, the oracle can check all 500 programs under 10 minutes.
Note however that any setting that deviates from the default will not reproduce Table 1. However, the differential testing should still pass (see the test report section for more details).
Expected time: depends on how many tests you generate. For 10,000 tests:
- Expected time with one thread: less than 16 hours
- Expected time with four threads: less than 4 hours
Run the oracle:
$ cd /workspace/diff-testing
$ raco cross -q --workspace /workspace/rosette-4 racket tester.rkt ./generation
This command runs the oracle with 4 threads and 80s timeout and outputs the result to the screen.
It also creates a log file at /workspace/diff-testing/workspace/log.rktd
that the reporter will read.
Due to hardware difference, you may need to adjust these parameters.
See the reference for more details.
(Note that in the paper, we used the timeout of 40s, but inside Docker, we find that 40s is not enough, so we bump the timeout up to 80s.)
Expected time: less than 1 minute
Report the result by running:
$ cd /workspace/diff-testing
$ raco cross -q --workspace /workspace/rosette-4 racket report.rkt
The output consists of two parts: the programs that do not pass the differential testing and a table similar to Table 1 in the paper.
First, for programs that do not pass the differential testing, there are 5 possible reasons:
- Leanette and Rosette* disagree: this case should never happen, as it indicates that Rosette* and Leanette actually disagree. If you found one, please let the authors know!
- Leanette has a fatal error: this case should never happen. It's likely that there is a technical error somewhere. If you found one, please let the authors know!
- Leanette runs out of fuel: this case is possible when the generated program contains an unreachable infinite loop, but Leanette's symbolic factory is not strong enough to deduce the unreachability. The case is accounted for in our framework (see Section 5 and Example 9) and should not be counted as a mistake. It would be very rare for this case to happen. In our evaluation in the paper, this case does not arise from randomly generated programs on our machine.
- Rosette* timeout: either the program takes a really long time to evaluate or it contains an infinite loop. It would be very rare for this case to happen. In our evaluation in the paper, this case does not arise from randomly generated programs on our machine.
- Leanette timeout: the program takes a really long time to evaluate. It would be very rare for this case to happen. In our evaluation in the paper, this case does not arise from randomly generated programs on our machine.
You can re-run the oracle on each program that doesn't pass the differential testing to see more details.
If you encounter the timeout problems with the default test generation setting for the generation process, it should be the case that all generated programs are terminated. So you can also re-run that specific program with higher timeout.
Second, for the table, if you use the default test generation setting, the result should be similar to Table 1 with minor differences in evaluation time.
âââââââââââ¬ââââââ¬ââââ¬âââââââââââââââââââââ¬ââââââââ¬ââââââ¬âââââââââââââââââââââ¬ââââ¬ââââ¬ââââââââââââââââââââ¬ââââ¬âââââââââââââââââââ¬ââââ¬âââââââââââ
âRange âCountâAvgâ(Leanette state) SymâMax âAvg â(Rosette* state) SymâMaxâAvgâ(Leanette time) MaxâAvgâ(Rosette time) MaxâAvgâValidated?â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â1 - 12 â1,006â6 â220 â635 â41 â88 â20 â4 â3.0 â2.7â0.3 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â13 - 24 â1,021â19 â349 â1,142 â77 â174 â36 â6 â2.9 â2.6â1.2 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â25 - 38 â1,042â32 â430 â2,855 â102 â224 â34 â6 â2.8 â2.7â0.3 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â39 - 50 â1,036â44 â433 â16,979 â242 â187 â34 â7 â3.0 â2.7â0.4 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â51 - 65 â1,040â58 â450 â7,523 â186 â227 â30 â8 â2.9 â2.7â0.3 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â66 - 77 â1,040â72 â459 â30,386 â474 â239 â45 â8 â3.6 â2.7â0.3 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â78 - 88 â1,053â83 â454 â19,835 â426 â231 â36 â8 â3.3 â2.7â0.5 â0.1ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â89 - 102 â1,042â95 â427 â45,893 â482 â190 â41 â9 â4.4 â2.7â0.3 â0.2ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â103 - 119â1,009â111â425 â61,439 â686 â218 â45 â9 â5.2 â2.7â0.4 â0.2ââ â
âââââââââââŒââââââŒââââŒâââââââââââââââââââââŒââââââââŒââââââŒâââââââââââââââââââââŒââââŒââââŒââââââââââââââââââââŒââââŒâââââââââââââââââââŒââââŒâââââââââââ€
â120 - 158â711 â129â324 â293,171â2,001â159 â36 â9 â40.0 â2.7â0.4 â0.2ââ â
âââââââââââŽââââââŽââââŽâââââââââââââââââââââŽââââââââŽââââââŽâââââââââââââââââââââŽââââŽââââŽââââââââââââââââââââŽââââŽâââââââââââââââââââŽââââŽâââââââââââ
The following list details our injected mistakes:
Patch file | Description |
---|---|
diff-testing/patch/incorrect_and.patch |
Make and(b_a, b_b) = b_a where b_a and b_b are non-literal symbolic booleans |
diff-testing/patch/incorrect_or.patch |
Make or(b, b) = tt where b is a non-literal symbolic boolean |
diff-testing/patch/incorrect_lt.patch |
Make op(<, z_a, z_b) = z_b < z_a where either z_a or z_b is a non-literal symbolic integer |
diff-testing/patch/incorrect_call_op.patch |
In CallOp , skip strengthen |
diff-testing/patch/incorrect_call.patch |
In Call , change the guards for evaluation of the body to be tt |
diff-testing/patch/incorrect_call_bad.patch |
In CallBad , use the input symbolic state without asserting γ |
diff-testing/patch/incorrect_let.patch |
In Let , do not bind any variable |
diff-testing/patch/incorrect_if_sym.patch |
In IfSym , swap not(truth(v)) and truth(v) when calling merge_R(., .) |
diff-testing/patch/incorrect_if_true_false.patch |
In IfTrue and IfFalse , swap the conditions |
diff-testing/patch/incorrect_error.patch |
In Err , do Abort |
diff-testing/patch/incorrect_abort.patch |
In Abort , do Err |
There are two possible choices here. You do not need to do both. Only one suffices. We recommend the default test generation setting as it's going to be much quicker and matches the paper.
Expected time: less than 5 minute
We have already identified test programs in generation
that cause faultily patched Leanette (BadLeanette) to produce incorrect results.
First, run:
$ cd /workspace/lean
$ leanproject build
$ cd /workspace/diff-testing
$ ./reset.sh
$ raco cross -q --workspace /workspace/rosette-4 racket generation.rkt
$ git status
The commands will generate 10,000 test files and prepare Leanette.
git status
should report that the worktree is clean (no "Untracked files" or "Changes").
If it is not clean, running git clean -df
and git reset --hard
from /workspace
will make the worktree clean.
Second, run:
$ ./test-coverage.sh
The script test-coverage.sh
will patch Leanette with each patch file and run the oracle on known test programs that cause them to fail differential testing.
The expected output is that the oracle reports failure on all of them.
program at ./generation/prog23160718406451015591104819237807520888040421765198.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog100208336541399653250221187751391297369.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog478473500332049125701102687487886134892618785332689813519393109671586439862663588789828131845030156.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog16803.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog666831569393266787587089860792333636466692132130.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog19436053114232394.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog15.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog1331343393596762924043919245445877806201864368967369509032193564215325013490794460521988648842748377343143771160948246706431109369.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog271706286423030047271490064405043091576078507211894322485583089278608494173230103175221675458186681751625690187786390.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog735983.rkt fails differential testing because Leanette and Rosette* disagree
program at ./generation/prog100026410668776740492900673318090171927233255361720922983271273849413253676502727166331182369672360102050306966877856993674940497.rkt fails differential testing because Leanette and Rosette* disagree
Note that Lean will also emit a lot of errors and warnings. This is expected, since the proofs are no longer valid after patching. However, Lean should still be able to run Leanette with no problem.
Expected time: depends on various factors, but generally, it could take extremely long time, as you will run the test oracle on all generated tests for 11 times.
If you don't want to use the default test generation setting, you can instead run:
$ git status # should report the directory is clean
$ raco cross -q --workspace /workspace/rosette-4 racket patcher.rkt ./patch/<file.patch> ./generation
$ git reset --hard
which will apply a given patch to Leanette, and run the oracle on programs in generation
.
You can interrupt (ctrl-c) the process when you notice that the oracle reports that Rosette* and (Bad)Leanette disagree.
If you know Lean, you can also modify Leanette to create your own BadLeanette and run the oracle yourself on generation
.
Note that the modification must make the interpreter incorrect (it is possible to modify it without compromising its correctness, so you need to be careful).
The expected result is that for every patch, one of the generated tests in generation
will catch the discrepancy between Rosette* and BadLeanette, provided that there are enough generated tests.
Expected time: less than 20 minutes
To generate Table 3, run:
$ cd /workspace/interface
$ ./reset.sh
$ ./run.sh # (~17 minutes)
The script reset.sh
removes any previously generated outputs.
The script run.sh
counts the number of lines by going through module dependencies,
performs line diffing, and reports the result. It will print out the intermediate results
as it runs, ending with a table that matches Table 3 from the paper:
âââââââââââââââââââââââââââââââââââââââââââââ¬ââââââââââââ¬âââââââââââââ¬âââââââââ¬âââââ¬ââââââââââ
âBenchmark âRosette LoCâRosette* LoCâLoC diffâ âSolver â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âBagpipe~\cite{weitz:bagpipe} â2,643 â2,643 â+2 â-2 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âBonsai~\cite{chandra:bonsai} â437 â437 â+1 â-1 âBoolectorâ
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âCosette~\cite{chu:cosette} â774 â774 â+0 â-0 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âFerrite~\cite{ferrite} â348 â348 â+2 â-2 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âFluidics~\cite{willsey:puddle} â98 â98 â+0 â-0 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âGreenThumb~\cite{phothilimthana:greenthumb}â3,554 â3,556 â+13 â-11 âBoolectorâ
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âIFCL~\cite{rosette:pldi} â483 â483 â+13 â-13 âBoolectorâ
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âMemSynth~\cite{memsynth} â13,303 â13,307 â+8 â-4 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âNeutrons~\cite{pernsteiner:neutrons} â37,174 â37,174 â+2 â-2 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âNonograms~\cite{butler:nonograms} â3,368 â3,374 â+11 â-5 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âQuivela~\cite{aws:quivela} â1,010 â1,009 â+10 â-11 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âRTR~\cite{kazerounian:rtr} â1,649 â1,640 â+12 â-21 âCVC4 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âSynthCL~\cite{rosette:pldi} â2,257 â2,256 â+42 â-43 âBoolectorâ
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âWallingford~\cite{borning:wallingford} â2,532 â2,533 â+12 â-11 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âWebSynth~\cite{rosette:pldi} â1,181 â1,189 â+14 â-6 âZ3 â
âââââââââââââââââââââââââââââââââââââââââââââŒââââââââââââŒâââââââââââââŒâââââââââŒâââââŒââââââââââ€
âJitterbug~\cite{jitterbug} â29,280 â28,935 â+478 â-823âBoolectorâ
âââââââââââââââââââââââââââââââââââââââââââââŽââââââââââââŽâââââââââââââŽâââââââââŽâââââŽââââââââââ
Expected time: less than 1 hour
Run the following command:
$ cd /workspace/perf
$ ./reset.sh
$ ./perf.sh sympro
The output should roughly match Table 4a and the SymPro row of Table 4c in the paper, with minor differences in columns about time duration.
âââââââââââââ¬ââââââââââââââââ¬âââââ¬ââââââââ¬ââââââ¬âââââââââââââââââ¬âââââ¬ââââââââ¬ââââââ
âBenchmark â(Rosette) TotalâEvalâSolvingâTermsâ(Rosette*) TotalâEvalâSolvingâTermsâ
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âBagpipe â25 â24 â2 â47 â26 â24 â2 â48 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âBonsai â23 â21 â3 â664 â46 â42 â3 â1,222â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âCosette â13 â9 â5 â131 â14 â10 â5 â154 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âFerrite â23 â13 â10 â34 â32 â13 â19 â40 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âFluidics â20 â7 â13 â284 â25 â8 â18 â308 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âGreenThumb â56 â8 â48 â239 â6 â2 â4 â74 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âIFCL â154 â11 â143 â383 â119 â12 â107 â438 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âMemSynth â24 â22 â2 â61 â27 â25 â2 â163 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âNeutrons â44 â43 â<\,1 â444 â13 â12 â<\,1 â172 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âNonograms â17 â3 â14 â51 â23 â4 â19 â73 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âQuivela â35 â33 â2 â1,113â39 â34 â4 â1,340â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âRTR â353 â332 â21 â741 â135 â101 â34 â1,106â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âSynthCL â252 â16 â236 â726 â247 â18 â229 â732 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âWallingfordâ12 â<\,1â11 â4 â12 â<\,1â11 â5 â
âââââââââââââŒââââââââââââââââŒâââââŒââââââââŒââââââŒâââââââââââââââââŒâââââŒââââââââŒââââââ€
âWebSynth â17 â12 â5 â1,012â24 â19 â5 â778 â
âââââââââââââŽââââââââââââââââŽâââââŽââââââââŽââââââŽâââââââââââââââââŽâââââŽââââââââŽââââââ
ââââââââââââââ¬ââââââ¬âââââ¬ââââââââââââ¬ââââââ¬âââââ¬âââââââââââââââ¬ââââââ¬âââââ¬âââââââââââââ¬ââââââ¬âââââ
â(Total) BestâWorstâAvg â(Eval) BestâWorstâAvg â(Solving) BestâWorstâAvg â(Terms) BestâWorstâAvg â
ââââââââââââââŒââââââŒâââââŒââââââââââââŒââââââŒâââââŒâââââââââââââââŒââââââŒâââââŒâââââââââââââŒââââââŒâââââ€
â0.11 â1.95 â0.85â0.26 â2.06 â0.90â0.09 â1.90 â1.00â0.31 â2.65 â1.06â
ââââââââââââââŽââââââŽâââââŽââââââââââââŽââââââŽâââââŽâââââââââââââââŽââââââŽâââââŽâââââââââââââŽââââââŽâââââ
Also see the reference for how to run a specific benchmark or fewer benchmarks.
Expected time: less than 24 hours
This step describes how to generate the results described in section 7.2 that compare the performance of the Rosette version of Jitterbug to that of the Rosette* port.
First, switch to the directory containing the Jitterbug benchmarks.
$ cd /workspace/jitterbug-benchmarks
Generating the performance results will take several hours, as the benchmarks are run single-threaded for consistency across runs.
The script will output the raw performance data to the console and to the files jitterbug-rosette3-data.csv
and jitterbug-rosette4-data.csv
in this directory.
You may choose to skip this step and instead use the pre-computed copies of the data to see the
formatted data without waiting for benchmarks to complete.
To run the benchmarks and generate fresh versions of the data, run the following:
$ ./run-jitterbug-benchmarks.sh # ~19 hours
Note that running the benchmarks will overwrite the pre-computed copies of the data. You can revert to the pre-computed
versions with git checkout '*.csv'
.
After the command completes (or if you are using the pre-computed copies of the data),
run the following command to format the data as it appears in Table 4b and the Jitterbug row of table 4c in the paper.
$ ./show-performance-table.sh # <1 second
If using the pre-computed data, this will produce the output below. If you ran the benchmarks to generate the data yourself, the results may vary depending on your particular machine (e.g., CPU, amount of available RAM, etc.)
Table 4b:
| Total (s.) || Evaluation (s.) || Solving (s.) || Terms (*10^3) ||
+ mean + median + max ++ mean + median + max ++ mean + median + max ++ mean + median + max ++
Rosette | 50 | 22 | 9,963 || 2 | 1 | 69 || 48 | 20 | 9,894 || 119 | 15 | 1,678 ||
Rosette* | 38 | 20 | 4,100 || 1 | 1 | 73 || 36 | 19 | 4,027 || 120 | 23 | 1,837 ||
Performance results for the 668 ported and original Jitterbug benchmarks.
Table 4c (Jitterbug results only)
| Total || Evaluation || Solving || Terms ||
+ best + worst + average ++ best + worst + average ++ best + worst + average ++ best + worst + average ++
Jitterbug | 0.23 | 6.03 | 0.91 || 0.33 | 2.18 | 0.87 || 0.22 | 6.48 | 0.91 || 0.79 | 8.12 | 1.09 ||
Performance ratios between the ported and original code for SymPro and Jitterbug benchmarks.
See the reference for additional artifact description.