CGen is a tool for encoding SHA-1 and SHA-256 hash functions into CNF in DIMACS format, also into ANF polynominal system in PolyBoRi output format. The tool produces compact optimized encodings. The tool allows manipulating CNF encodings further via assignment of variable values and subsequent optimization.
- Project page: https://cgen.sophisticatedways.net.
- Article: https://blog.sophisticatedways.net/2020/10/tailored-compact-cnf-encoding-for-sha-1.html.
- Source code is published under MIT license.
- Source code is available on GitHub: https://github.com/vsklad/cgen.
CGen is built to make it easier to analyse SAT problems through manipulation of their CNF (DIMACS) and ANF (PolyBoRi) representations. SHA-1/SHA-2 (SHA) encodings are chosen as a reference/example implementation. The tool produces SHA algorithm encodings with different assignments of both message and hash values. Design of the tool is not limited to these algorithms.
Beyond encoding techniques, for variable assignments CGen implements unit propagation, equivalences reasoning and limited binary clauses resolution. This way, it is a simple SAT preprocessor similar to SatELite and Coprocessor. CGen does not implement any variant of DPLL algorithm and does not make any decisions with respect to variable values.
CGen implements three notable features.
The encoding is optimized to minimize both number of variables and number of clauses/equations. The assumption is that without redundancies, it may be easier to analyse the problem's complexity and structure. Application of the below techniques results in substantially more compact encodings than those published to date and known to the author. As an example, full SHA-1 is encoded into CNF with 26,156 variables and 127,200 clauses.
- New variables are introduced only when necessary. For example, rotation, negation, operations where all arguments are constants, require no additional variables.
- All boolean algebra operations/primitive functions are encoded in the most efficient possible way.
- For CNF, every combination of bitwise n-nary addition is statically optimized using Espresso Logic Minimizer, as a set of pre-defined clauses. This includes simlifications when one or more operands are constants.
- For boolean algebra operations, any resultant constant values are optimized away during encoding. For example, (x ^ ~x ^ y) is encoded as (y) without any new clauses or variables generated.
- Unit propagation, equivalences reasoning and limited binary clauses resolution executed while assigning variables in an existing encoding.
A set of literals and constants can be grouped and given a name. For example, hash value is a set of 160 literals and constants, depending on the encoding parameters. Such "named variable" may change throughout subsequent manipulations. CGen tracks these changes. Further, CGen looks for internal structure when a large number of binary variables is grouped together, to produce the most compact representation. For example, {1/32/1}/16/32 describes a set of 16 32-bit words with 512 variables corresponding sequentially to each bit. These descriptions are stored within the DIMACS file as specially formatted comments. For a SHA encoding, at least two variables are defined, M for message and H for the hash value.
In particular, named variables are specified as a combination of:
- A 1- or 2-dimentional array of CNF literals and constants. Second dimention allows for mapping arrays of words for SHA algorithms.
- Sequences of literals and constants where diference (step) between elements is the same.
- Sequences of binary constants which are grouped into binary and hexadecimal numbers.
CGen allows defining and assigning named variables via command line. For a new encoding, SHA message if specified, is embedded into it. In all other instances CGen performs basic validation of the input, then applies the implemented pre-processing techniques to optimise the encoding. Also, it is possible to define named variables by editing the DIMACS file manually.
In particular, CGen supports:
- Use of binary, hexadecimal and character-sequence constants (with bits mapped to binary variables).
- Setting of specific bits/binary variables of the named variable.
- Assigning random values to a subset of randomly chosen binary variables.
- Padding of 1-block SHA messages.
- Computing hash value given a constant message, then assigning it along with some (or zero) bits of the message fixed.
Furthermore, it is possible to combine or sequence multiple assignments of the same named variable within the same encoding, running the tool several times with different parameters and analysing the results.
Source code is hosted at GitHub. To obtain and build:
git clone https://github.com/vsklad/cgen
make
CGen has no external dependencies other than C++ STL. The makefile relies on GNU g++ compiler.
To run the tool please launch "./cgen" for OSX/Linux and "cgen.exe" for Windows.
The tool supports the following parameters:
cgen encode (SHA1|SHA256) [-f (CNF|ANF)] [-r <rounds>] [-v <name> <value>]... [<encoder options>] <output file name>
cgen (assign | define) [-v <name> <value>]... <input file name> <output file name>
cgen --help
cgen --version
encode (SHA1|SHA256) - generate the encoding and
save it in the chosen format (-f option) to the specified <output file name>
can generate the encoding for specified number of rounds (-r option),
also partialy or fully assign the message and hash values (-v option)
assign - read <input file name>, assign variables as specified and save it to <output file name>
requires input file in CNF/DIMACS format produced by the same tool
this is due to variable specifications encoded as DIMACS comments
these variable specificatiosn may be added to an externally produced file
using the <define> command
define - define one or more named variables
accepts any CNF/DIMACS files, including those produced by other means
records variable definitions in the output file as DIMACS comments
"random", "compute" and "except" options are not supported within variable definitions
-v <name> <value>
specification of the named variable,
its mapping to binary variables and/or its constant values
named variable is a sequence of CNF literals and constants represented with bits
named variable definitions are recorded in the "var" statements
in comments within the DIMACS file
<name> is the name of the variable, case sensitive
may contain characters '0'..'1', 'a'..'z', 'A'..'Z', '_'
first symbol may not be a digit
an option with the particular variable may only appear once
<value> is the variable value or specification
constant binary variables may be grouped into bytes
and further represented as a string, a hexadecinal or binary number
sequential representation can be further optimized as below
if a value includes spaces, it must surrounded with double-quotes
to allow the shell process it as a single argument
supported specifications:
<value> = ((<hex_value> | <bin_value> | <str_value> | <var_value>) [<pad>][<except_range>]) |
<rnd_value> | compute
<hex_value> = 0x['0'..'9', 'a'..'f', 'A'..'F']...
a big endian hexadecimal constant
<bin_value> = 0b['0'..'1']...
a big endina binary constant
<str_value> = ([\0x21..\0x7F]... | "[\0x20..\0x7F]...") [<pad>][<except_range>]
an ASCII string, with quotes if includes space
big endian format is assumed if mapped to word(s)
<rnd_value> = random:<number>
assigns <number> of binary variables (bits) to random values
binary variables (bits) are chosen randomly as well
<number> must be between 1 and the total number of binary variables
compute
the value is computed and the formula is produced as follows:
1. assign other variables to the specified constant values without <exclusion range>
2. determe the computed value by encoding/evaluating the formula with those assignments
3. encode the formula with all variables and <except_range>
<pad> = pad:(SHA1 | SHA256)
the variable treated as a SHA-1/SHA-256 message and is padded accordingly
<except_range> = except:<first>..<last>
range of binary positions/variables to keep variable,
i.e. ignore the constant assignment of
one-based index, i.e. <first> and <last> must be between 1 and the size of the variable
<var_value> = (<literal> | <hex_value> | <bin_value>)[<sequence_clause>]
<var_value> = "{" <var_value> ["," <var_value>]..."}"[<sequence_clause>]
defines a named variable by describing a set of literals and constants
<sequence_clause> = "/"<count>["/"<step>]
defines a sequence where the first item is associated <var_value>
there are <count> elements in the sequence
each subsequent item is produced by adding <step> to the previous one
for compound items, the same <step> applies to every literal within the item
-f (CNF|ANF)
encoding type/form/format to produce, CNF by default (if missing)
CNF refers to CNF saved in DIMACS format
ANF refers to ANF saved in PolyBoRi output format
-r <value>
number of SHA1 rounds to encode
the value mut be between 1 and 80, defaults to 80 if not specified
-h | --help
output parameters/usage specification
--version
output version number of the tool
--normalize_variables | --nv
reindex variables such that no negations are present
within named variable definitions within the output file;
may add aditional clauses/equations as necessary
<encoder options> [--add_max_args=<value>] [--xor_max_args=<value>]
these options define how the encoder processes addition and xor's for multiple operands
these options are allowed for CNF only (not allowed for ANF)
specify maximal number of binary variables to be encoded together, plus the result
any constants are optimized out when encoding and are outside this number
<add_max_args> is a number between 2 and 6, 3 if not specified
defines maximal number of binary variables to be added together
<xor_max_args> is a number between 2 and 10, 3 if not specified
defines maximal number of binary variables to be xor'ed together
Two variables are pre-defined for both SHA-1 and SHA-256.
M - message, a sequence of up to 55 bytes (440 bits) due to self-imposed 1 block limitation
ASCII/hexadecimal constant is padded according to SHA1 specification
random assignments are not padded
the value is used for encoding, i.e.
the formula is produced for the specified value originally, with maximal optimization
H - hash value, a set of 160 bits for SHA-1 grouped into 5 32-bit words,
256 bits and 8 words for SHA-256 respectively
the value is assigned afterwards with recursive UP and euqivalences optimizations
-
Produce a generic SHA-1 CNF encoding without any message or hash value assigned.
encode SHA1 sha1.cnf
-
Evaluate "sha1.cnf" with "CGen" ASCII string as a message padded for SHA-1, output hash value. "sha1H.cnf" is empty since the formula is fully evaluated/satisfied.
assign -vM string:CGen pad:sha1 sha1.cnf sha1H.cnf
-
Evaluate "sha1.cnf" with "CGen" ASCII string as a message padded for SHA-1. Then, compute hash value and assign it to "sha1.cnf". Store the result in "sha1H.cnf".
assign -vM string:CGen pad:sha1 except:1..512 -vH compute sha1.cnf sha1H.cnf
-
Assign hash value to "sha1.cnf". Result is equivalent to the previous example.
assign -vH 0x8dc19dbaebdceb30360c0e52c97dd6944e9889e9 sha1.cnf sha1H.cnf
-
Generate a SHA-1 encoding with hash value assigned. Result is equivalent to the previous example.
encode SHA1 -vH 0x8dc19dbaebdceb30360c0e52c97dd6944e9889e9 sha1H.cnf
-
Generate SHA-1 encoding for the given ASCII string as a message with hash value computed and assigned, also with the first 8 bits of the message left as variables. Note that since the values are big-endian, the first 8 bits are the most significant bits of the first word of the message.
encode SHA1 -vM string:CGen pad:sha1 except:1..8 -vH compute sha1.cnf
-
Generate a SHA-1 encoding with both message and hash value set to random values except for the random 8 bits of the message which are kept as variables.
encode SHA1 -vM random:504 -vH random:160 sha1_random.cnf
-
Produce a generic SHA-1 ANF encoding without any message or hash value assigned.
encode SHA1 -fANF sha1.anf
It is possible to verify the correctness of the resulting formula using the following techniques and the tool itself.
- Generating a SHA encoding with a particular message outputs its hash value. This value can be verified using any alternative SHA implementation including those available online, e.g. http://www.sha1-online.com.
- Assigning a message to a previously generated SHA encoding outputs its hash value; the encoding may be:
- fully unassigned, with all message and hash value bits as binary variables
- with some bits set beforehand using "assign" command
- with full hash value set beforehand using "assign" command
- assigning the correct message will output SATISFIABLE
- assigning a conflicting message will result in a CONFLICT
A sample encoding is obtained using SHA1-SAT with the following parameters:
sha1-sat --cnf --tseitin-adders --message-bits=0 --hash-bits=0 > vn_original.cnf
CGen can be used to define the "M" and "H" variables. Equivalent definitions can also be added manually to "vn_original.cnf" file as comments.
cgen define -vM {{ "{{" }}32/32/-1}/16/32} -vH {2752/32/-1}/5/32 vn_original.cnf vn_cgen.cnf
Subsequently, assigning a message ("CGen" ASCII string padded for SHA-1) yields its hash value.
cgen assign -vM string:CGen pad:sha1 vn_cgen.cnf vn_evaluated.cnf
Additional variable assignments are possible. Note that variable numbers are native to the encoding and might change after optimizations.
var W = {{ "{{" }}32/32/-1}/16/32, {543/31/-1, 544}/64/32}
var A = {2912/32/-1}/80/32
CGen outputs an encoding in the below text format when ANF is chosen for output.
- Any line that starts with the symbol "c" is considered to be a comment and can be ignored
- The first few lines are comments which show:
- number of variables and equations that form the polynominal system
- encoding parameters (algorithm, number of rounds)
- named variable definitions (see Named variables above)
- Following the header comments, there is a list of equations, one equation per line.
- each equation is represented with an expression and should be interpreted as = 0
- each equation is a set of terms; the terms are summed up (xor'ed) which is represented with symbol "+"
- a term is one of:
- a constant 1
- a single variable
- a product/conjunction of multiple variables represented with "*" symbol
- each variable reference consists of "x" prefix followed by variable number; variable numbers start from 1
- the tool will never output the same term twice within the same equation, duplicates are optimized out
In many respects, CGen is an evolution of work done by other researchers. Below is the list of publications and tools used during CGen development.
- Johanovic et al, 2005, http://csl.sri.com/users/dejan/papers/jovanovic-hashsat-2005.pdf
- Marjin Heule, 2008, https://repository.tudelft.nl/islandora/object/uuid%3Ad41522e3-690a-4eb7-a352-652d39d7ac81
- Norbert Manthey, 2011, https://www.researchgate.net/publication/51934532_Coprocessor_-_a_Standalone_SAT_Preprocessor
- Vegard Nossum, 2012, https://www.duo.uio.no/handle/10852/34912
- Legendre et al., 2014, https://eprint.iacr.org/2014/239.pdf
- Nejati et al., 2016, https://www.researchgate.net/publication/306226194_Adaptive_Restart_and_CEGAR-based_Solver_for_Inverting_Cryptographic_Hash_Functions
- Motara, Irving, 2017, https://researchspace.csir.co.za/dspace/bitstream/handle/10204/9692/Motara_19661_2017.pdf?sequence=1&isAllowed=y
- Mate Soos blog, https://www.msoos.org
- SHA1-SAT by Vegard Nossum
- Espresso Logic Minimizer by Robert Brayton
ANF representation has been inroduced thanks to and based on advice received from Mate Soos. Authors of Bosphorus used the resulting SHA-256 ANF encoding as one of benchmarks discussed in their paper.
CGen does not reuse any pre-existing source code.
I developed CGen part of hobby research into SAT and cryptography. If you have any questions about the tool, you can reach me at [email protected].