Skip to content

Latest commit

 

History

History
executable file
·
314 lines (242 loc) · 14.6 KB

README.md

File metadata and controls

executable file
·
314 lines (242 loc) · 14.6 KB

Simple ukWaC corpus semantic roles API

If you have a directory containing gzipped ukWaC xml files with extracted NP heads, you can use this API to read them. There are basically two ways you can do this: by reading corpus files directly and reading .h5 file. In order to read from .h5 file, you'll need to generate it first and this could require ~11-15h depending on the selected options. Direct reading gives you more output flexibility, .h5 file reading is roughly x3 faster.

Quick Start

You want to parse ukWaC corpus semantic roles and extract as much as possible from the data, you don't care about speed. You need the output to contain a governor and its corresponding set of arguments, you also need source and text data. The heads should all be lemmatized and delexicalized. In addition, you want to apply a word frequency list of top 50k most frequent tokens.

from ukwac_api import Corpus, CorpusReader

reader = CorpusReader('march_2017_heads_low')
reader.create(None, 0, True)  # h5name, compression, lemmatize
reader.generate_freqlist('wordfreq.pickle', 50000)  # wordfreq, size
reader.set_wordfilter('wordfreq.txt')
it = reader.read('set', True, ['source', 'text'], True)  # mode, delexicalize, extract, lemmatize
for govset in it:
    print govset[:-1], govset[-1]['source'], govset[-1]['text']
    

API Classes and Methods

Corpus:

    .create(h5name='ukwac.h5')  # parse and serialize the corpus into .h5 format
    .clean_up()  # remove all .h5 and .pickle files generated by create() method
    .generate_freqlist('wordfreq.pickle', 50000)  # use wordfreq.pickle 
    .set_wordfilter('wordfreq.txt')  # set word filter

CorpusReader(Corpus):

    .connect(h5file)  # set the path to .h5 corpus file and use it
    .disconnect()  # close .h5 corpus file and switch to direct corpus parsing
    .read(mode, delexicalize, extract, lemmatize)  # return corpus iterator

ukWaC xml head file structure:

<sents>
<text/>
<s>
   <predicate>
      <governor>take/vbd/2</governor>
      <dependencies>
         <dep algorithm="malt" source="i/prp/1" text="i" type="a0">i/prp/1</dep>
         <dep source="take/vbd/2" text="take" type="v">take/vbd/2</dep>
         <dep algorithm="failed" source="very/rb/3 little/jj/4" text="very little" type="a2">very/rb/3 little/jj/4</dep>
         <dep algorithm="malt" source="part/nn/5" text="part" type="a1">part/nn/5</dep>
         <dep algorithm="malt" source="in/in/6 the/dt/7 conversation/nn/8" text="in the conversation" type="am-loc">conversation/nn/8</dep>
      </dependencies>
   </predicate>
</s>
...
</sents>

<sents> – a root element of the xml file, contains <text> and <s>.

<text> – contains “id” attribute that is a url address of the sentences below until next <text> element.

<s> – an element that contains a sentence processed by head searching algorithms.

<predicate> – sentence predicate block.

<governor> – governor of the respective predicate block, e.g. examine/vbz/12, where examine is a word, vbz is a POS-tag and 12 is word index in a sentence <s> element.

<dependencies> – a block of dependencies for the current sentence governor, contains <dep> elements.

<dep> – a single dependency of a current governor, contains a head word extracted by one of the head searching algorithms, e.g. state/nn/16 where state is a head word, nn is a POS-tag, 16 is a word index. The element also contains the following attributes:

  • algorithm – contains a name of a head searching algorithm used for head finding, e.g. malt, malt_span, linear or failed if no algorithm was able to find a head.

  • source – contains tokenized, lemmatized and POS-tagged clause governed by the <governor>. This clause is used by head searching algorithms.

  • text – contains tokenized, lemmatized clause governed by the <governor>. Similar to “source” but without POS-tags.

  • type – dependency type that follows PropBank annotation.

Direct reading: CorpusReader.read(mode, delex, extract, lemmatize)

Iteratevily read through all .xml/.gz files found in the specified directory.

mode parameter

You can set the reading mode to be either "set", "random_set" or "single". In "set" mode the output will contain the governor and all its respective dependants within one sentence.

('sent id','governor', 'gov postag', 
    (('arg', 'arg postag', 'role', 'alg'), (...), (...))
)

"random_set" mode (.h5 only) returns randomly (without replacement) chosen govset from .h5 file. The search algorithm looks for a valid chunk of gov and its args and when found returns a tuple.

('sent id','governor', 'gov postag', 
    (('arg', 'arg postag', 'role', 'alg'), (...), (...))
)

In "single" mode the output will contain governor -> dependant pair and their semantic attributes.

('sent id','governor', 'gov postag', 'arg', 'arg postag', 'role', 'algorithm')

In other words, these are two different representations of the same data.

USAGE:

    reader = CorpusReader('ukwac_heads')
    reader.read('set')

delex parameter

During direct parsing you can also choose whether to delexicalize the input or not. Delixalization is the process of replacing string data with integers.

For example:

('establish', 'vbg', 'constitution', 'nnp', 1, 3) --> (234, 21, 120, 10, 1, 3)

However, for delexicalization to occur you need .pickle files that are generated by .create() method together with .h5 files. If no required .pickle is found, the parameter will be ignored. Delexicalization does not apply to extracted parameters source and text.

USAGE:

    reader = CorpusReader('ukwac_heads')
    reader.read('set', True)

extracted parameter

As you can see above, xml head files also contain source and text attributes. You can extract their values by supplying a list of the attributes to extract as extracted parameter to read() method. These values are added to the dict and appended to the output. You can then easily access them as output[-1]['source'] for example.

USAGE:

    reader = CorpusReader('ukwac_heads')
    reader.read('set', True, ['source','text'])

You can add any attribute, e.g. type, its value also will be added to the dictionary even though it is already getting extracted as the main component of the output.

lemmatize parameter

Applies nltk lemmatization to governors and dependants. source and text attribute values are lemmatized since they come from ukWaC corpus (actually not 100%, due to ukWac preprocessing inconsistencies some words may retain their original form).

USAGE:

    reader = CorpusReader('ukwac_heads')
    reader.read('set', True, ['source','text'], True)

Words filter

You can create or generate a list of words that must be included into the output. A typical use case is when you need only top 50k most frequent words in you output. You can use .generate_freqlist() method that will pick up previously saved wordfreq.pickle and generate n-most frequent words list for you.

USAGE:

    corpus = Corpus('ukwac_heads')
    corpus.generate_freqlist('wordfreq.pickle', 50000)
    reader = CorpusReader('ukwac_heads', 'wordfreq.txt')
    reader.read()

Algorithms

If you are aware of the ukWaC head searching algorithms (malt, malt_span, linear, failed), you can choose the results of which algorithm to include. A typical use case is when you don't want to include failed NP heads.

USAGE:

    reader = CorpusReader('ukwac_heads', 'top50k.txt', ['malt','malt_span', 'linear'])
    reader.read()

.h5 file creation: Corpus.create(h5name, compress, lemmatize, mode)

Parse ukWac corpus and store the output in three .h5 files corresponding to train/valid/test sets with the following ratio 70/20/10. You can apply lemmatization, which will obviously reduce your vocabulary size. All parsed data is stored as integer numbers, e.g. governor "go" is encoded as 3, role type "A0" as 1, "NN" part-of-speech as 5 etc.

The data in .h5 is stored in the following format:

-1, 0, 0, 1, 1, 5, 2, 2, 1, 1, 1, 3, 2, 3, 3, -1, 1, 2, 1, 5, 2, 2, -1, ...

Where -1 is a delimiter for governor set.

USAGE:

    reader = CorpusReader('ukwac_heads')
    reader.create('ukwac.h5')  # requires ~10-12h

After the operation is finished you can read the parsed corpus data from saved .h5 file. When reading from .h5 file, mode, delex params will be ignored. The data stored in .h5 is in 'single' mode format and is already delexicalized. If you do not supply any .h5 name, create() method will generate word and postag mappings (.pickles) only. Since dumping data into .h5 requires more processing time, generating mappings is faster.

h5name parameter

If specified, create method will dump the extracted data into .h5 file. If not specified, only .pickle dicts for delexicalization and frequency lists will be generated.

compress parameter

If h5name is specified, applies compression ratio [0-9] to .h5 file.

lemmatize parameter

Applies nltk lemmatization to governors and dependants. source and text attribute values are lemmatized since they come from ukWaC corpus (actually not 100%, due to ukWac preprocessing inconsistencies some words may retain their original form).

USAGE:

    corpus = Corpus('ukwac_heads')
    corpus.create(None, 0, True)
    reader.connect('ukwac.h5')
    reader.read()  # mode and delexicalize params will be ignored

In order to read "train", "valid" and "test" .h5 files you'll need to instantiate new Corpus or CorpusReader class, connect the required file and call read() method.

mode parameter

You can set the reading mode to be either "set" or "single". In "set" mode the output will contain the governor and all its respective dependants within one sentence.

('sent id','governor', 'gov postag', 
    'arg0', 'arg0 postag', 'role0', 'alg0', 
    'arg1', 'arg1 postag', 'role1', 'alg1', 
    ...
)

In "single" mode the output will contain governor -> dependant pair and their semantic attributes.

('sent id','governor', 'gov postag', 'arg', 'arg postag', 'role', 'algorithm')

In other words, these are two different representations of the same data.

IMPORTANT!

.h5 file is created using "single" mode by default. If you attempt to use "set" read mode on such .h5 file, it will fail. In other words, if you plan to read .h5 file using "single" mode, do create('ukwac_single.h5'). If you plan to read .h5 using "set" or "random_set" mode, do create('ukwac_set.h5', mode='set').

USAGE:

    reader = Corpus('ukwac_heads')
    reader.create('ukwac.h5', compress=0, lemmatize=False, mode='set')

Words filter

You can control the number of included governors and their dependants by applying word filter before .h5 file creation. For example, if you only want to include top 10k or 50k most frequent words in .h5.

USAGE:

    reader = CorpusReader('ukwac_heads', 'wordfilter.txt')
    reader.create('ukwac.h5')  # only the words from wordfilter.txt are included

Algorithms

Similarly to direct parsing you can control the results of which NP head algorithm to include

USAGE:

    reader = CorpusReader('ukwac_heads', 'wordfilter.txt', ['malt','malt_span'])
    reader.create('ukwac.h5')  # only the words from wordfilter.txt and malt, malt_span algs are included

Direct reading allows you choose delexicalization options while having slower reading times. .h5 file reading achieves fastest reading times but no delexicalization option is available.

Keep in mind that converting the corpus into .h5 format reduces its size from 18GB to 8-12GB. Above that, you can use compression parameter and reduce the corpus size even more up to 4.3GB.

Speed benchmarks

Direct corpus parsing is 50% slower per file, on the other hand, conversion to .h5 requires ~10-12h, ~2.5GB RAM and ~12GB of free space (~4.3GB compression=5). The fastest option is to parse the corpus and create .h5 file which contains delexicalized elements (int32). Reading from this file is very fast since the operation does not involve any intermediate preprocessing. You can't apply word filter during reading only before .h5 creation.

I/O speed (per single xml file):

type time cpu
direct reading 4.5s Intel Core i5-4210U 2700 MHz
direct reading (with lemmatization) 10.5s Intel Core i5-4210U 2700 MHz
creating .pickles 3.8s Intel Core i5-4210U 2700 MHz
creating .pickles (with lemmatization) 7.3s Intel Core i5-4210U 2700 MHz
conversion to .h5 8.3s Intel Core i5-4210U 2700 MHz
conversion to .h5 (with lemmatization) 12.5s Intel Core i5-4210U 2700 MHz
reading .h5 (~4MB) 2s Intel Core i5-4210U 2700 MHz
reading .h5 (~1.4M with 9 compression) 2s Intel Core i5-4210U 2700 MHz

I/O speed (ukWaC corpus 3495 gzipped files ~18GB):

type time cpu
direct reading 6-7h AMD Opteron(tm) Processor 6380, 2500 Mhz
direct reading 3h Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
direct reading (with lemmatization) 3.4h Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
direct reading (with lemmatization, extracting) 4.45h Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
conversion to .h5 (with lemmatization) 14-15h AMD Opteron(tm) Processor 6380, 2500 Mhz
generating .pickles 6.4h AMD Opteron(tm) Processor 6380, 2500 Mhz
generating .pickles (with lemmatization) 9.2h AMD Opteron(tm) Processor 6380, 2500 Mhz
conversion to .h5, generating .pickles 11-12h AMD Opteron(tm) Processor 6380, 2500 Mhz
reading .h5 2h AMD Opteron(tm) Processor 6380, 2500 Mhz

Relevant resources:

ukWaC corpus

"A large corpus automatically annotated with semantic role information"

"An exploration of semantic features in an unsupervised thematic fit evaluation framework"