Skip to content

BIU-NLP/context2vec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

The context2vec toolkit

With this code you can:

  • Use our pre-trained models to represent sentential contexts of target words and target words themselves into a low-dimensional vector representations.
  • Learn your own context2vec models with your choice of a learning corpus and hyperparameters.

Please cite the following paper if using the code:

context2vec: Learning Generic Context Embedding with Bidirectional LSTM
Oren Melamud, Jacob Goldberger, Ido Dagan. CoNLL, 2016 [pdf].

Requirements

  • Python 2.7
  • Chainer 1.7 (chainer)
  • NLTK 3.0 (NLTK) - optional (only required for the AWE baseline and MSCC evaluation)

Installation

  • Download the code
  • python setup.py install

Quick-start

  • Download pre-trained context2vec models from [here]
  • Unzip a model into MODEL_DIR
  • Run:
python context2vec/eval/explore_context2vec.py MODEL_DIR/MODEL_NAME.params
>> this is a [] book
  • This will embed the entire sentential context 'this is a __ book' and will output the top-10 target words whose embeddings are closest to that of the context.
  • Use this as sample code to help you integrate context2vec into your own application.

Training a new context2vec model

  • CORPUS_FILE needs to contain your learning corpus with one sentence per line and tokens separated by spaces.
  • Run:
python context2vec/train/corpus_by_sent_length.py CORPUS_FILE [max-sentence-length]
  • This will create a directory CORPUS_FILE.DIR that will contain your preprocessed learning corpus
  • Run:
python context2vec//train/train_context2vec.py -i CORPUS_FILE.DIR  -w  WORD_EMBEDDINGS -m MODEL  -c lstm --deep yes -t 3 --dropout 0.0 -u 300 -e 10 -p 0.75 -b 100 -g 0
  • This will create WORD_EMBEDDINGS.targets file with your target word embeddings, a MODEL file, and a MODEL.params file. Put all of these in the same directory MODEL_DIR and you're done.
  • See usage documentation for all run-time parameters.

NOTE:

  • The current code lowercases all corpus words
  • Use of a gpu and mini-batching is highly recommended to achieve good training speeds

Evaluation

Microsoft Sentence Completion Challenge (MSCC)

  • Download the train and test datasets from [here].
  • Split the test files into dev and test if you wish to do development tuning.
  • Download the pre-trained context2vec model for MSCC from [here];
  • Or alternatively train your own model as follows:
    • Run context2vec/eval/mscc_text_tokenize.py INPUT_FILE OUTPUT_FILE for every INPUT_FILE in the MSCC train set.
    • Concatenate all output files into one large learning corpus file.
    • Train a model as explained above.
  • Run:
python context2vec/eval/sentence_completion.py Holmes.machine_format.questions.txt Holmes.machine_format.answers.txt RESULTS_FILE MODEL_NAME.params

Senseval-3

  • Download the 'English lexical sample' train and test datasets from [here].
  • Download the senseval scorer script(scorer2) from [here] and build it.
  • Train your own context2vec model or use one of the pre-trained models provided.
  • For development runs do:
python context2vec/eval/wsd/wsd_main.py EnglishLS.train EnglishLS.train RESULTS_FILE MODEL_NAME.params 1
scorer2 RESULTS_FILE EnglishLS.train.key EnglishLS.sensemap
  • For test runs do:
python context2vec/eval/wsd/wsd_main.py EnglishLS.train EnglishLS.test RESULTS_FILE MODEL_NAME.params 1
scorer2 RESULTS_FILE EnglishLS.test.key EnglishLS.sensemap

Lexical Substitution

The code for the lexical substitution evaluation is included in a separate repository [here].

Known issues

  • All words are converted to lowercase.
  • Using gpu and/or mini-batches is not supported at test time.

License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%