No Syntaxation Without Representation: Syntactic Considerations for Neural Machine Translation Data Augmentation
Paper: NoSyntaxationWithoutRepresentation.pdf
- notebooks obtain the data:
Download_dataset_iwslt2017.ipynb
: download and produce 10% of the data for the paper
- notebooks to train models:
TrainLSTM.ipynb
: all LSTM methods excluding sequence matching methodssimilarity_ds_k=2.ipynb
andsimilarity_ds_k=10.ipynb
: LSTM sequence matching methods using similarityTrainTransformer.ipynb
: all Transformer methodsLanguageModel.ipynb
: training the language model used inLMsample
andsoft
methods
- other notebooks:
BEAM_BLEU.ipynb
: evaluation, re-compute BLEU score with beam search, compute POS BLEU scoreLM_POS_Experiments.ipynb
: experiment, looking at how well the language model matches part of speechCustomTransformer.ipynb
: development, developing and testing the transformer architecture, contains links to transformer resources
- functions for transformer models:
embeddingTF.py
:Embedder
andPositionalEncoding
sublayersTF.py
:SublayerConnection
(layer norm & residual connection),FeedForward
,attention
,MultiHeadedAttention
, andclones
(replicates layers)layersTF.py
:EncoderLayer
andDecoderLayer
stacksTF.py
:Encoder
andDecoder
, which construct the encoder and decoder stacks from the encoder and decoder layers, respectivelyencoderTF.py
:FullEncoder
, which allows for augmentation to occur in the embedding - positional encoding - encoder structuredecoderTF.py
:FullDecoder
, which allows for augmentation to occur in the embedding - positional encoding - decoder structureseq2seqTF.py
:Seq2SeqTF
, which contains the custom encoder and decoders and fully defines the transformer seq2seq modelbatchTF.py
:BatchTF
, which formats source and target inputs to yield shifted targets, source mask, and target mask (future_mask
provides decoder-specific masking)trainTF.py
:train
, which usestrain_epoch
andval_epoch
to create the training scheme,greedy_decode
, andtranslate_corpus
- functions for lstm models:
train.py
: training functions for LSTM modelsSeq2Seq.py
: model classEncoderLSTM.py
: encoder class, including functions for all augmentationsDecoderLSTM.py
: decoder class, including functions for seqmix augmentations
- other functions:
load_data.py
: creating and loading pickled datasets and dataloadersload_lm.py
: load the language model developed inLanguageModel.ipynb
- Download the full data from torchtext:
from torchtext.datasets import IWSLT2017
train_iter, valid_iter, test_iter = IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))
- Run
Download_dataset_iwslt2017.ipynb
to get 10% sample of dataset and save as pickles - Run
load_and_save(batch1 = True)
fromload_data.py
to build the dataloaders used in our LSTM models and save to pickle files - Run
load_and_save(batch1 = False)
fromload_data.py
to build the dataloaders used in our Transformer models and save to pickle files - In our code, we use
load_pickled_dataloaders(batch1 = True)
andload_pickled_dataloaders(batch1 = False)
fromload_data.py
to load dataloaders from the pickle files. You'll need to pass inPARENT_DIR
as the location of yourdata
folder.
- Save download the following directories and save to your own
data
folderdataloaders10perc
: used for LSTM models and the LM https://drive.google.com/drive/folders/18K6XpYgTmLZkPtLQw4-8gqUyeAGWPF-u?usp=sharingdataloaders10perc_batchsize32
: used for transformer models, larger batch size https://drive.google.com/drive/folders/16_hx53i473FjJfn4sfdLQ4ZTdxDehUBT?usp=sharing
- In our code, we use
load_pickled_dataloaders(batch1 = True)
andload_pickled_dataloaders(batch1 = False)
fromload_data.py
to load dataloaders from the pickle files for the LSTM and transformer, respectively. You'll need to pass inPARENT_DIR
as the location of yourdata
folder.
sys
3.7.12tqdm
4.62.3numpy
1.19.5matplotlib
3.2.2pandas
1.1.5torch
1.10.0+cu111torchtext
0.11.0spacy
2.2.4transformers
4.6.0sentence_transformers
2.1.0os
*typing
*pickle
*timeit
*operator
*collections
*copy
*random
*math
*
* (Python 3.6.9 Standard Library)