Annotating-Non-Restrictive

Code, models, and corpus of non-restrictive noun phrase modifications.
Published in "Annotating and Predicting Non-Restrictive Noun Phrase Modifications" (Stanovsky and Dagan, ACL 2016)

Generating the corpus

To get the annotated corpus, you'll first need to obtain the CoNLL 2009 corpus from LDC (specifically, we'll use CoNLL2009-ST-English-train.txt).

Once you get it, run:

./generateCorpora.sh CoNLL2009-ST-English-train.txt

This will generate the corpus (train, dev and test splits) in the "corpus" directory.

The corpus will be generated in the corpus directory. Each CoNLL token will contain these additional two fields:

Restrictiveness, which has the following possible values: * RSTR -- Marking a restrictive modifier. * NON-RESTR -- Marking a non-restrictive modifier. * _ -- Marks an un-annoated token.
Modifier Type, marking the type of this modifier. Has the following possible values (see paper for example and evaluation):
- _ -- This token is not a modifier.
- APPOS-MOD -- Appositional modifier.
- INF-MOD -- Infinitival modifier.
- POSTADJ-MOD -- Postfix adjectival modifier.
- PP-MOD -- Prepositional modifier.
- PREADJ-MOD -- Prefix adjectival modifier.
- PREVERB-MOD -- Prefix verbal modifier.
- RC-MOD -- Relative Clause modifier.

classifiers -- Contains the code for the classifiers described in the paper.
diffs -- The diff files which, in conjunction with the CoNLL data, generate our annotated corpus.
features -- The CRF features for each of the training instances, used to train both CRF models.
models -- Pre-trained models, acheiving the results described in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
classifiers		classifiers
corpus		corpus
diffs		diffs
features		features
models		models
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
generateCorpora.sh		generateCorpora.sh