An implementation of [1].
If you need the data generated by the code and used in the paper, please contact the author.
[1] Takamasa Okudono, Masaki Waga, Taro Sekiyama and Ichiro Hasuo. Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces. Proc. AAAI 2020, to appear.
The experiment system consists of some programs. All programs are basically made to take required data by zip files, excute the main procedure, and save the result as a zip file (originally it was made to communicate with our S3 bucket, but now modified so that it can run locally.)
make_prob_automaon
: generate probabilistic WFA randomlymake_paren_data
: generate a list of random words with parenthesis and their score. About half of the words are made so that their parenthesis match.make_word_dataset
: generate random words consist of the given alphabettrain_rnn_on_wfa
: takes a word set and a WFA and learn RNNs. It is expected to use for WFA2RNN2WFAtrain_rnn_on_file
: takes a list of words and their weigth and learn RNNs. It is expected to use for the weighted parenthesis experimentprecalc_rnn
: takes an RNN and a word set, run the RNNs for the words and returns the accuracy and the elapsed timernn2wfa
: take an RNN and runs RNN2WFA procedurewrapup_rnn
: return RNNs and words and returns the CSV table of the accuracyeval_perf
: take a result ofprecalc
and an extracted WFA and returns the accuracy and the elapsed timewrapup_eval
: take the results ofeval_perf
, analyze the result and make a CSV table of the statistics
- All needed modules are installed automatically by using Pipenv.
- Graphviz: to run
depict_wfa
The filename of output is printed at the end of STDOUT.
alphabets
: alphabet set that the WFA takesn_states
: number of states of the generated WFAdegree
: maximum number of outgoing edge for each states. We recommend3
.-o
: output filename. The timestamp is added in front of this.
alphabet
: alphabet which the generated words consist of--max_length
: maxmum length of generated words. We recommend20
.--num
: number of words. We recommend10000
. 10% of them are used for testing and 90% are used for training.--seed
: random seed-o
: output filename. The timestamp is added in front of this.
alphabets
: alphabet which the generated words consist ofn_samples
: number of words. We recommend10000
. 10% of them are used for testing and 90% are used for training.--max_length
: maxmum length of generated words. We recommend20
.--length_sampling
: way to generate words.all_words
just pick randomly from the possible words.all_lengths
pick a length from the possible lengths, and pick a word from the possible words of the picked length. We recommendall_lengths
.--sorted
: when this is specified, the words are generated from the language shown in Experiment section and used for "realistic" setting.-o
: output filename. The timestamp is added in front of this.
--embed_dim
: the dimension of the output of the embedding.50
is recommended.--hidden_output_dims
: the dimensions of the outputs of the hidden layers.50,50
is recommended.--batch_size
:1
is recommended--n_epochs
: number of epochs--max_length
: maxmum length of generated words. We recommend20
.-o
: output filename. The timestamp is added in front of this.--words
: word set generated bymake_word_dataset
--wfa
: wfa file generated bymake_prob_automaton
--embed_dim
: the dimension of the output of the embedding.50
is recommended.--hidden_output_dims
: the dimensions of the outputs of the hidden layers.50,50
is recommended.--batch_size
:1
is recommended--n_epochs
: number of epochs--max_length
: maxmum length of generated words. We recommend20
.-o
: output filename. The timestamp is added in front of this.traindata
: words and values generated bymake_paren_data
rnn
: output rnn generated bytrain_rnn_on_wfa
ortrain_rnn_on_file
--paren
: enable this if the rnn is fromtrain_rnn_on_file
-o
: output filename. The timestamp is added in front of this.
{regr,sample,search}
:regr
is our method based on regression.search
is BFS-based method.eqq_param
: json file to determine the behavior of Equivalence queryrnn
: output rnn generated bytrain_rnn_on_wfa
ortrain_rnn_on_file
--tol_rank_init
: Initial value of \tau in Algorithm 2 (in extended version)--tol_rank_decay_rate
: r in Algorithm 2 (in extended version)--tol_rank_lower_bound
: when \tau (in Algorithm 2) is less than this value, the procedure is halted--timeout
: timeout to halt the procedure and get WFA at the moment-o
: output filename. The timestamp is added in front of this.--max_length
: maxmum length of accepting words. We recommend100
Makes the list of accuracy of generated RNNs by train_rnn_on_wfa
or train_rnn_on_file
.
To give the target RNNs, please rewrite the variable rnns
.
extracted
: extracted WFA file generated byrnn2wfa
precalc
: calculation file generated byprecalc_rnn
-o
: output filename. The timestamp is added in front of this.
Makes the table of statistics by taking the results of eval_perf
.
To give the files, please rewrite eval_perfs
.
When regr
is specified in rnn2wfa
, the json file given to eqq_param
has to take this form (placeholders are surrounded by angle brackets):
{
"comment": "<your comment here>",
"eps": <value corresponds to 'e' in Algorithm 1. We recommend 0.05.>,
"max_length": <put 20, because it is disabled when experimental_constant_allowance is enabled. We recommend 20.> ,
"eta": <put 0.1, because it is disabled when experimental_automatic_eta is enabled>,
"gamma": <put 0.1. the value does not affect the behavior>,
"cap_m": <value corresponds to 'M' in Algorihtm 1. We recommend 2 or 5.>,
"depth_eager_search": <put -1. this function is not used now>,
"regressor_maker_name": <regreessor to construct 'p' in Algorithm 1. This have to be 'gpr' or 'krr'. We recommend 'gpr'>,
"experimental_constant_allowance": true,
"experimental_automatic_eta": true,
"experimental_disable_restarting": true,
"experimental_bfs": true,
"experimental_sort_by_dist": true,
"experimental_disable_equivalent": false,
"experimental_skip_existing": false,
"experimental_halt_long_word": <length to halt the exploration. We recommend 20.>,
"experimental_wfa_only": false
}
When search
is specified:
{
"comment": "<your comment here>",
"e": <value corresponds to 'e' in Algorithm 1. We recommend 0.05.>,
"quit_number": <value corresponds to 'n' in the paragraph 'The baseline Algorithm: BFS(n)'>,
"experimental_reset": true,
"experimental_disable_equivalent": false,
"experimental_skip_existing": false,
"experimental_halt_long_word": <length to halt the exploration. We recommend 20.>
}