-
Notifications
You must be signed in to change notification settings - Fork 4
Resource generation guide
##Distributional similarity models
The distsim project implements a distributional similarity tool. The input of the process is a corpus, defined in files (with an implementation of the SentenceReader interface). The output of the process is composed of two Redis databases, per each generated distributional model. The generation process is characterized by a set of configuration files which defines the type of the distributional models, as well as various engineering aspects.
The distsim project can be downloaded from the Git repository of the Excitement open Platform.
A demonstration of the tool, with an executable jar, can be downloaded from the artifactory repository
For details on distributional similarity methods and a guide to the tool and its options see the user guide. The rest of this paper, describes the steps for generating the basic and common models for the EOP - Lin (proximity and dependency), balanced exclusion, and DIRT -, based on a given set of pre-defined configurations, with special focus on configuration parameters that might be tuned according to the given input corpus.
##Building distributional similarity model files
A set of running configurations with applicative scripts is provided with the demo.
As a first step, can use these configurations to be build four distributional models for the given corpus, by applying the build-model script, for each of the four model types:
>configurations/lin/build-model configurations/lin/proximity/
>configurations/lin/build-model configurations/lin/dependency/
>configurations/bap/build-model configurations/bap/
>configurations/dirt/build-model configurations/dirt/
The generated model files will be stored at the 'models' directory.
In order to build models, based on your corpus, you may need to modify some of the configurations, as follows.
###Main configuration parameters
####Input corpus
The demo contains a tiny corpus for English and German. You can/should define your own corpus, by setting the path of the corpus directory/file at the corpus property of the cooccurence-extractor module.
Models: lin, bap, dirt
File: coocuurence-extraction-memory.xml
Module: cooccurence-extractor
Property: corpus
Value: the path to your corpus dir/file
####Sentence reader
The eu.excitementproject.eop.distsim.builders.reader.SentenceReader interface defines the method for reading the next sentences from the corpus, and the representation of the read sentence, e.g., String (raw text), BasicNode (root of the parsed sentence).
Models: lin, bap, dirt
File: coocuurence-extraction-memory.xml
Module: cooccurence-extractor
Property: sentence-reader-class
Value: the name of the SentenceReader class
There are various kinds of implementations for this interface, among them the following may be relevant for your corpus:
- eu.excitementproject.eop.distsim.builders.reader.LineBasedStringSentenceReader
Returns the next line of the given corpus, as a String.
- eu.excitementproject.eop.distsim.builders.reader.SerializedNodeSentenceReader
Returns a BasicNode representation of the next parsed sentence, by deserializating the next line of the stream.
- eu.excitementproject.eop.distsim.reader.cooccurrence.CollNodeSentenceReader
Returns a BasicNode representation of the next parsed sentence, by the converting the next lines from Conll representation to a BasicNode.
Required configuration property:
part-of-speech-class - the name of a class which extends the eu.excitementproject.eop.common.representation.PartOfSpeech class, mapping a specific set of part-of-speeches into the canonical representation, defined by eu.excitementproject.eop.common.representation.CanonicalPosTag.
- eu.excitementproject.eop.distsim.builders.reader.UIMANodeSentenceReader
Returns a BasicNode representation of the next parsed sentence, given a UIMA Cas representation of parsed corpus.
Required configuration property:
ae-template-file a path for the analysis engine template file of the given UIMA Cas (otherwise, a default one will be selected).
- eu.excitementproject.eop.distsim.builders.reader.UKwacNodeSentenceReader
Returns a BasicNode representation of the next parsed sentence, given a UkWAC corpus.
Required configuration property:
** is-corpus-index** true for a case of an index UkWac representation.
- eu.excitementproject.eop.distsim.builders.reader.XMLNodeSentenceReader
Returns a BasicNode representation of the next parsed sentence, given the EOP's serialization of parsed corpus (as defined in the eu.excitementproject.eop.common.representation.parse.tree.dependency.basic.xmldom.XmlTreePartOfSpeechFactory class).
Required configuration property:
ignore-saved-canonical-pos-tag true if the part-of-speech representation ignores saved canonical tags (default, true).
The given configuration files in the demo, for instance, defines XMLNodeSentenceReader for the English corpus, and CollNodeSentenceReader for the Germnan one.
In case none of the above fits your corpus representation, you should implement your own SentenceReader class.
The eu.excitementproject.eop.distsim.builders.cooccurrence.CooccurrenceExtraction interface defines the method for extracting co-occurrences from sentences (see user guide).
Models: lin, bap, dirt
File: coocuurence-extraction-memory.xml
Module: cooccurence-extractor
Property: extraction-class
Value: the name of the CooccurrenceExtraction class
There are various kinds of implementations for this interface:
- eu.excitementproject.eop.distsim.builders.cooccurrence.NodeBasedWordCooccurrenceExtraction Extraction of co-occurrences, composed of lemma-pos pairs and their dependency relation, from a given parsed sentence, represented by a BasicNode (this class is currently defined in the demo configurations for the lexical models: bap and lin).
Required configuration properties: relevant-pos-list - a (comma separated) list of part-of-speeches, which defines the relevant words for the model (where only words assigned to one of these part-of-speeches are considered). The part-of-speeches are defined by their canonical form, as defined in eu.excitementproject.eop.common.representation.partofspeech.CanonicalPosTag class.
- eu.excitementproject.eop.distsim.builders.cooccurrence.RawTextBasedWordCooccurrenceExtraction Extraction of co-occurrences, composed of word pairs in of a given raw text sentence.
Required configuration properties:
window-size the size of window in which word pairs are taken, e.g., for window-size 2, two words ahead and two words back are candidate pairs for the given word (default, 3).
stop-words-file the path to a text file, composed of stop words to be filtered from the model, word per line (default, no stop-word list).
-
eu.excitementproject.eop.distsim.builders.cooccurrence.NodeBasedPredArgCooccurrenceExtraction Extraction of co-occurrences, composed of predicates and arguments, from a given parsed sentence, represented by a BasicNode (this class is currently defined in the demo configurations for the syntactic model: dirt).
-
eu.excitementproject.eop.distsim.builders.cooccurrence.TupleBasedPredArgCooccurrenceExtraction Extraction of co-occurrences, composed of predicates and arguments, based on a given string, composed of a binary predicate and its arguments, in the format: arg1 \t predicate \t arg2
The eu.excitementproject.eop.distsim.builders.elementfeature.ElementFeatureExtraction interface defines the method for extracting elements and features from co-occurrences (see user guide).
The properties of the configured ElementFeatureExtraction class are defined in a separated module, indicated by the element-feature-extraction-module property.
Models: lin, bap, dirt
File: element-feature-counting-memory.xml
Module: element-feature-extractor
Property: element-feature-extraction-module
Value: the name of the configuration module which defines the ElementFeatureExtraction class
The configuration module which defines the ElementFeatureExtraction should include at least one property - 'class' - which define the name of the ElementFeatureExtraction class. There are various kinds of implementations for this interface, some of them may requitre additional configuration properties (beside the 'class' property), as follows:
- eu.excitementproject.eop.distsim.builders.elementfeature.LemmaPosBasedElementFeatureExtraction Given a co-occurrence of two items, each composed of lemma and pos, and their dependency relation, extracts two element-feature pairs where the element is the one lemma-pos and the feature is the other lemma-pos, with or without the dependency relation. (this option is currently configured in the demo for the lexical models: lin and bap).
Required configuration properties:
stop-word-file the path to a text file, composed of stop words to be filtered from the model features, word per line (default, no stop-word list).
include-dependency-relation should the dependency relation be included as part of the feature (as done in lin-dependency model) or not (as done in lin-proximity model).
min-count the minimal number of occurrences for each element (i.e., a word that occurs less then this minimal number would not form an element).
- eu.excitementproject.eop.distsim.builders.elementfeature.WordPairBasedElementFeatureExtraction Given a co-occurrence of two items, each composed of a string word, extracts two element-feature pairs where the element is the element is one of the words and the feature is the other word, with no dependency relation.
Required configuration properties:
stop-word-file the path to a text file, composed of stop words to be filtered from the model features, word per line (default, no stop-word list).
min-count the minimal number of occurrences for each element (i.e., a word that occurs less then this minimal number would not form an element).
- eu.excitementproject.eop.distsim.builders.elementfeature.BidirectionalPredArgElementFeatureExtraction Given a co-occurrence of predicate and argument and their dependency relation, extracts the predicate as an element, and the dependency relation with the argument as a feature (this option is currently configured in the demo for the syntactic model: dirt).
Required configuration properties:
stop-word-file the path to a text file, composed of stop words to be filtered from the model features, word per line (default, no stop-word list).
slot is it the first argument of the given binary predicate (X) or the second one (Y).
min-count the minimal number of occurrences for each element (i.e., a word that occurs less then this minimal number would not form an element).
The current configuration of the demo defines min-count 10 for lexical models (lin, bap) and 100 for the syntactic one (dirt).
###Usage of the models
Each of the lexical models (e.g., lin and bap) is stored in two Redis databases, one for left-to-right similarities, and the other right-to-left similarities. The Redis databases (rdb files) are located at the model directory under redis/db (redis/db/lin/proximity, redis/db/dependency, redis/db/bap).
Syntactic models are represented by one Redis DB, for left-2-right similarities. The Redis databases (rdb files) are located at the model directory under redis/db (redis/db/dirt).
Note:
The process makes use of a sort option which in not available in Windows. In addition, the current official distribution of Redis does not support Windows (Microsoft develops and maintains a Win32-64 experimental version of Redis), so it is assumed the the process will be applied on a Unix\Linux platform. In case, you problem with this, please let me know.
####Running a resource access program
The lexical models can be accessed by the eu.excitementproject.eop.distsim.resource.SimilarityStorageBasedLexicalResource class, which implements the LexicalResource interface.
You can test your lexical models by applying the eu.excitementproject.eop.distsim.resource.TestLemmaPosSimilarity program, with the appropriate configuration file.
The configuration file of the TestLemmaPosSimilarity program defines the two redis database files (l2r-redis-db-file, r2l-redis-db-file), the number of retrieved rules for each query (top-n-rules), and the name of the resource (resource-name).
>java -cp distsim.jar eu.excitementproject.eop.distsim.resource.TestLemmaPosSimilarity configurations/lin/proximity/knowledge-resource.xml
>java -cp distsim.jar eu.excitementproject.eop.distsim.resource.TestLemmaPosSimilarity configurations/dependency/knowledge-resource.xml
>java -cp distsim.jar eu.excitementproject.eop.distsim.resource.TestLemmaPosSimilarity configurations/bap/knowledge-resource.xml
In case your model is not based on lemma-pos element, but on words (e.g., by configuring LineBasedStringSentenceReader and WordPairBasedElementFeatureExtraction, on a given raw text corpus), you can test it by applying the eu.excitementproject.eop.distsim.application.TestWordSimilarity program, with the appropriate configuration file.
>java -cp distsim.jar eu.excitementproject.eop.distsim.application.TestWordSimilarity configurations/lin/proximity/knowledge-resource.xml
The dirt model can be accessed by the eu.excitementproject.eop.core.component.syntacticknowledge.SimilarityStorageBasedDIRTSyntacticResource class, which implements the SyntacticResource interface.
You can test your dirt model by applying the eu.excitementproject.eop.core.component.syntacticknowledge.TestDIRTSimilarity program, with the appropriate configuration file.
The configuration file of the TestDIRTSimilarity program, defines the file of the model's Redis l2r database file (l2r-redis-db-file), the number of retrieved rules for each query (top-n-rules), and the name of the resource (resource-name). Note, that this program should be with the EOP's core class path.
>java -cp ... eu.excitementproject.eop.core.component.syntacticknowledge.TestDIRTSimilarity configurations/dirt/knowledge-resource.xml
###Scaling to larger corpus
The current version is based on memory. For the case of English, lexical models based on the huge UkWAC corpus, and DIRT model based on the two CDs of Reuters, were generated with 64G RAM. For the case of German and Italian, 32G RAM seems to be sufficient.
When moving to larger corpus, the memory for the Java programs (in the build-model scripts) should be increased accordingly.
In addition, the number of threads should be set according to the system hardware abilities.
In case, you have a memory problem, there is an option to apply a memory-free map reduce program. Contact us ([email protected]) for more details.
The development of the Wiki acquisition tool was made using Intel ® Core ™ I7-2670QM CPU with 2.2Ghz (Quad Core) and 8GB of RAM memory and complete running on the English Wikipedia was made in about a week using the same machine as the Database server and application server. Therefore, any weaker machine might consume more time for the execution.
The tool is multi-lingual (the IDM module can be tuned for each specific language, by implementing the eu.excitementproject.eop.lexicalminer.definition.idm.IIDM and the eu.excitementproject.eop.lexicalminer.definition.idm.SyntacticUtils interfaces, as currently done for English and Italian.
The software requirements are:
- Installed MySQL server. (Development was made on version 5.5.27)
- Java version 1.6.
- JWPL (Java based Wikipedia Library) for creating and filling the Wikipedia database. Full instruction on how to get a full Wikipedia JWPL database from Wikipedia dumps can be found on http://code.google.com/p/jwpl/wiki/DataMachine.
- Python 2.7 (For EasyFirst English parser)
-
Create the system scheme using the script "CreateDB.sql" supplied under the "DB_Scripts" folder.
-
Fill in the Wikipedia Miner configuration file according to your requirements. The file is divided to modules in which every one of them responsible for specific number of parameters. The modules are self-explained. The most important module which you probably want to edit is "Extractors" which determine the extractors that will be used to fill the database. Other modules are the JWPL database configuration, the target Database configuration (which will be filled with rules) and "processing_tools" which determine which processing tools will be used. Note, that the classpath should refer to the libraries of the configured tools - the current configuration for instance, requires lingpipe, stanford-postagger, stanford-ner, opennlp-tools, gate jars. In addition, the parameters of the stopwords file path and the JARS folder path are at the top of the configuration file as ENTITY.
Important note – do not fill the database using both the lexicalIDM and syntacticIDM extractors. Using them both can result in wrong classifiers ranks.
-
The system uses the log4j framework as a logger mechanism. Make sure you have a log4j configuration file (log4j.properties) in the log4j directory. You can change the logger configuration as you wish)
-
For the English EasyFirst parser which run in a server-client manner run the server side on port 8081.
-
Run the system using
java -Xmx<Allocated Memory Size> eu.excitementproject.eop.lexicalminer.wikipedia.MinerExecuter <Configuration file path>
- We set "Allocated Memory Size" to 4000M but bigger value can reduce the running time.
- The system has recovery mechanism which can recover from crash and skip the data which already been processed. In case of a crash all you have to do is to run the system again and mechanism will be used automatically.
-
During the execution you can view the log files which will be written the location defined in the log4j.properties file.
-
After a full success execution you have a database contains all the rules that extracted by the system. This database doesn't contain some indexes which important for retrieving the rules. To add those indexed run the script "CreateIndexes.sql" from the DB_Scripts folder.
-
In this point you have a full database ready to use by the retrieval tool. If you choose to run the lexical or syntactic extractors and wish to use an offline classifier, please go the "Build the offline classifiers" section, otherwise, you can skip to "How to retrieve rules" section.
###Build the offline classifiers
-
The DB now contains all the rules and indexes, but the statistical data should be collected in order to run the classifiers. In order to gather this statistics run the "CollectStatictics.sql" from DB_Scripts folder. (This script can take a while)
-
To run the offline classifiers you need to choose which of the supplied classifiers you want using the classifiers configuration file.
-
Run the classifiers ranks calculation process using
java-Xmx<Allocated Memory Size> eu.excitementproject.eop.lexicalminer.definition.classifier.BuildClassifiers <Configuration file path>
An example of such configuration file is given at: /src/eu/excitementproject/eop/lexicalminer/definition/classifier/ BuildClassifiersConfig.xml
- The database is now ready to retrieve rules using the offline classifiers.
- Define the desired classifier and the other configurable parameters in the retrieval configuration file.
An example of such configuration file is given at: /src/eu/excitementproject/eop/lexicalminer/LexiclRulesRetrieval/ wikipediaLexicalResourceConfig.xml
- Use the LexicalResource interface with an instance of WikipediaLexicalResource that get the retrieval configuration file path as a parameter.
An example of this usage is given ain the main method of eu.excitementproject.eop.lexicalminer.LexiclRulesRetrieval.WikipediaLexicalResource class.