Skip to content

BIU-NLP/ag-evaluation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AG-Evaluation

AG is a method for reliable evaluation of distributional semantic models.
It was introduced in the paper Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure.

Here we provide:

Requirements

  • Python 2.7
  • gensim (only for the example script)

Example

Run the following line on shell:

$ python sample.py

The code in sample.py loads a gensim word2vec model and runs evaluation on the 'nn' dataset.
Notice the model it uses (model.vec) covers only part of the vocabulary, so some of the comparisons in the datasets will not be used (to get warnings for oov words, just change the the print_oov parameter to True).

Can I use a model created by other library (not gensim)?

Sure, the model does not have to be a gensim model.
It just needs to be encapsulated in a class with a method "similarity" which takes two words and returns a score.

Can I use the AG method to evaluate models of other languages?

Of course, you just need to provide matching datasets which follow the structure described in the paper.

Can I perform more fine-grained analysis?

Yes, you can filter comparisions by different properties of the Comparison class (declared in evaluator.py).
For example, by changing the lambda in the last line of sample.py from comp: comp.set_name == 'nn' to comp: comp.set_name == 'nn' and comp.compare_type == 'randoms', you include only "positive-random" comparisons in the evaluation.

The provided datasets

The 'datasets' directory is divided into several sub-directories:

  • "basic" - in these datasets, all the words are base forms
  • "inflected" - these datasets contain the same words as 'basic', but inflected to other forms (to evaluate the effect of rich morphology)
  • "rare" - in these datasets, all the target words are rare (occur less than 100 times in Hebrew wikipedia)
  • "ambiguous" - in these datasets, the target words are morphologically ambiguous (to evaluate the ambiguity effect)
  • "cohyponyms" - datasets in which the preferred-relation is defined as "cohyponyms" (in contrast to "hyponym-hypernym" in the other datasets)

References

If you make use of this software for research purposes, we'll appreciate citing the following:

@InProceedings{avraham-goldberg:2016:RepEval,
  author    = {Avraham, Oded  and  Goldberg, Yoav},
  title     = {Improving Reliability of Word Similarity Evaluation by Redesigning Annotation Task and Performance Measure},
  booktitle = {Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {106--110},
  url       = {http://anthology.aclweb.org/W16-2519}
}

Contact

For any question, please contact [email protected]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%