Skip to content
oferbr edited this page Jan 21, 2014 · 2 revisions

BIUTEE is a transformation-based EDA, provided now as part of the Excitement Open Platform (EOP). This EDA requires a specific LAP, provided as part of the EOP as well. This page describes how to run these LAP and EDA. An older (and now obsolete) stand-alone version of these components can be found here: http://cs.biu.ac.il/~nlp/downloads/biutee.

BIUTEE is provided with a folder called the BIUTEE Environment, containing its configuration, proprietary working folder, file-based resources, and more. It is also provided with very large DB-based resources, downloaded and handled separately.

This guide explains how to download and run BIUTEE. All steps mentioned here are an expansion of the general EOP detailed manual

Table of Contents

System Requirements

BIUTEE requires the items specified as EOP's Supported Platforms and Prerequisites, and additionally:

  • Memory: At least 5 GB RAM
  • Disk Space: At least 20 GB free

How to Download and Build BIUTEE

  1. Follow the corresponding steps in EOP's detailed manual (How to Donwload and Build the EOP). In this guide, the local folder where the EOP source code was extracted to will be referred to as $EOP.
  2. Download the zip file of EOP-resources as well, as explained in that "detailed manual". After extracting the zip file, you will find a folder named "BIUTEE_Environment" within it. In this guide, this folder will be referred to as $BIUTEE.
  3. Linux only: Make sure the file $BIUTEE/third-party/nagel_sentence_splitter/linux_64/tokenizer is in your system's path. For example, by copying it to /usr/bin.
  4. In order to use BIUTEE in the Eclipse IDE, import the code as Maven projects.
Downloading and installing DB-based resources is specified here.

How to Run BIUTEE

Defining Environment Variables

Please create an environment variable called DATA and set it to the path of the data directory in the BIUTEE Environment. On UNIX system using bash shell, you could do it using the command below (your-path is the path of the data directory in the BIUTEE Environment):

export DATA="your-path"

From now on, you should see this path when you enter the following command in that terminal:

echo $DATA

You may use these external tutorials for defining environment variables in Windows and in Linux.

Running Scenarios

BIUTEE can be run via two interfaces:

  1. EOP Interface, accessing LAP and EDA. Currently this entire interface is provided via the class '''eu.excitementproject.eop.biutee.rteflow.systems.excitement.BiuteeMain''', and uses the configuration file $BIUTEE/workdir/biutee.xml.
  2. Legacy Interface, accessing proprietary classes for preprocessing, training and testing. These use the configuration file $BIUTEE/workdir/biutee_legacy.xml (which has the same content as $BIUTEE/workdir/biutee.xml, only with a slightly different structure, more details here).

BIUTEE can be run on these kinds of input:

  1. RTE Pairs - used in RTE 1-5 main task. It is formatted as an XML file, consisting of a sequence of text-hypothesis pairs. This is the most kind of input, if you are not sure what to use - use this kind.
  2. RTE Sum - used in RTE 6-7. It is formatted as a folder, with topics, where each topic has documents and hypotheses. To train and test on this input, it must first be indexed (as described later in the steps table).

Running Steps

The following table describes how to run BIUTEE via command line, in different scenarios. The steps are presented in the order in which they should be run.

Note that you must follow only one specific scenario. For example, if you wish to run via the EOP interface and use RTE Pairs input, follow only the EOP+Pairs rows (and the ALL rows, which apply to all scenarios). According to this, you should be running steps: 1, 3, 4, 5, 10, 11.

For further details regarding running EOP in general via command line, see here.

# Scenario Step Command Notes
1 ALL Configure general system parameters Edit configuration file biutee.xml / biutee_legacy.xml
  • Parameters like number of threads and knowledge resources. More details here.
2 Legacy + Sum Perform indexing Refer to standalone version
  • To perform this step, please download BIUTEE's previous version and follow the required steps on the user guide (Section 1.7 and Section 3).
3 ALL Run EasyFirst parser server Windows: runeasyfirst.bat
Linux: runeasyfirst.sh
  • Must be run on a separate command line window from BIUTEE.
  • The server must be running at least when BIUTEE's LAP/preprocessing is running, but may be left running in other times. The same EasyFirst run may be used for multiple runs of BIUTEE.
  • You may want to shut it down when it is not required, to conserve system resources. Do it by pressing Ctrl-C.
4 ALL Configure training parameters Edit configuration file
  • Mostly set the dataset to be the devset. More details here.
5 EOP + Pairs Preprocess training data + Train
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.
excitement.BiuteeMain'''
-Dexec.args="biutee.xml lap_train,train"
  • In order to just preprocess, instead of lap_train,train provide only lap_train.
  • Similarly, in order to just train, provide only train.
  • [1] The preprocessing output is a java-serialized file, with a name and path determined by the configuration parameter rte_pairs_preprocess/ serialization_filename.
  • [2] The training output is several java-serialized files named labeled_samplesX.ser and serialized_resultsX.ser, and some XML files named model_search_X.xml and model_predictions_X.xml.
6 Legacy + Pairs Preprocess training data
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.
rtepairs.RTEPairsPreProcessor'''
-Dexec.args="biutee_legacy.xml train"
[1]
7 Legacy + Pairs Train
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.rtepairs.
RTEPairsETETrainer'''
-Dexec.args="biutee_legacy.xml"
[2]
8 Legacy + Sum Preprocess training data
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.rtesum.
preprocess.RTESumPreProcessor'''
-Dexec.args="biutee_legacy.xml"
[3] The preprocessing output is a java-serialized file, with a name and path determined by the configuration parameter rte_sum_preprocess/ serialization_filename.
9 Legacy + Sum Train
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.
rtesum.RTESumETETrainer'''
-Dexec.args="biutee_legacy.xml"
[2]
10 ALL Configure testing parameters Edit configuration file
  • Mostly set the dataset to be the testset. More details here.
11 EOP + Pairs Preprocess testing data + Test
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.
excitement.BiuteeMain'''
-Dexec.args="biutee.xml lap_test,test"
  • In order to just preprocess, instead of lap_test,test provide only lap_test.
  • Similarly, in order to just test, provide only test.
  • The preprocess output is a series of XML files in the folder $BIUTEE/workdir/lap_output. Each XMI is a dump of the UIMA-CAS of one text-hypothesis pair.
  • [4] The test output is written in the log file logfile.log.
12 Legacy + Pairs Preprocess testing data
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.
rtepairs.RTEPairsPreProcessor'''
-Dexec.args="biutee_legacy.xml test"
[1]
13 Legacy + Pairs Test
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.rtepairs.
RTEPairsETETester'''
-Dexec.args="biutee_legacy.xml"
[4]
14 Legacy + Sum Preprocess testing data
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.rtesum.
preprocess.RTESumPreProcessor'''
-Dexec.args="biutee_legacy.xml"
[3]
15 Legacy + Sum Test
mvn -f $EOP/biutee/pom.xml exec:java
-Dexec.mainClass=
'''eu.excitementproject.eop.
biutee.rteflow.systems.
rtesum.RteSumETETester'''
-Dexec.args="biutee_legacy.xml"
[4]

NOTES:

  1. All commands must be run from $BIUTEE/workdir. This could be achieved using the cd command, like: cd C:\Biutee\workdir.
  2. For the mvn commands to work, you need the Maven executable to be in your system path. If it is not, add it, or provide full path to it in the commands.
  3. In order to run via Eclipse IDE, perform the specified steps by running each class denoted by -Dexec.mainClass=, with program arguments denoted by -Dexec.args= (without enclosing parentheses), and working directory $BIUTEE/workdir.

JVM Parameters

To improve JVM efficiency, it is recommended to run it with these JVM parameters:

  • -server, for using Java server VM.
  • -Xmx2g, for allocating 2GB of memory. Other values can be used, according to available memory and the number of threads used. When preprocessing at least 1.5GB must be allocated. When training and testing, at least 4GB must be allocated, and an additional 1GB for each additional thread. For example, when using 3 threads, allocate at least 6GB.
  • -XX:+UseParallelGC, -XX:+UseParallelOldGC and -XX:ParallelGCThreads=<math>\alpha</math>, for using parallel garbage collection, with a threads. <math>\alpha</math> can be specified as the number of threads determined in the configuration file.
In order to specify JVM parameters, put them as a concatenated value of the environment variable MAVEN_OPTS. More details in one of the notes here.

Configuration File

IMPORTANT NOTE: A copy of the BIUTEE configuration file is available next to the sample configuration files of the other EDAs of the EOP in the config directory. Also, another copy is available next to the sample configuration files of the other EDAs of the EOP in the EOP Resources archive. However, neither of these copies are effective at the moment. Therefore, please do not edit them. Instead, please apply changes to the original copy, which resides in the workdir directory of the BIUTEE_Environment directory of the EOP Resources archive.

A key element in the BIUTEE environment is the configuration file, found at $BIUTEE/workdir/biutee.xml. Note that a second configuration file is provided as well: $BIUTEE/workdir/biutee_legacy.xml. This has exactly the same content as the main configuration file, but has a slightly different structure - a section here is a module, and a property is a param.

Most values in the configuration file can stay exactly as provided. We bring here the details of some of the values you may wish (or need) to change.

Section Property Value
rte_pairs_preprocess training_data Path to a pairs dataset XML, for training data.
rte_pairs_preprocess training_data_annotated true/false - indicates whether the training dataset is annotated (has gold-standard annotations).
  • Must be true for training.
rte_pairs_preprocess training_serialization_filename Path to a file where preprocessing output will be written to, for training data.
rte_pairs_preprocess test_data Path to a pairs dataset XML, for test data.
rte_pairs_preprocess test_data_annotated true/false - indicates whether the training dataset is annotated (has gold-standard annotations).
  • If the dataset is annotated, the system will output the test accuracy at the end of the test.
rte_pairs_preprocess test_serialization_filename Path to a file where preprocessing output will be written to, for test data.
rte_sum_preprocess dataset Path to a training sum dataset folder.
  • Note that this parameter is used for both training and test.
rte_sum_preprocess serialization_filename Path to the file where preprocessing output (of the training data) will be written to.
  • Note that this parameter is used for both training and test.
rte_pairs_train_and_test serialized_training_data Path to the file where preprocessing output (of the training data) was written to.
rte_pairs_train_and_test serialized_test_data Path to the file where preprocessing output (of the test data) was written to.
rte_sum_train_and_test training_data An indication to the sum training data, as 3 values connected with #:
  • Dataset name: RTE6 or RTE7
  • Type: DEV or TEST
  • Relative path to the dataset folder
For example: RTE6#DEV#RTE6_DEVSET
rte_sum_train_and_test serialized_training_data Path to the file where preprocessing output (of the training data) was written to.
rte_sum_train_and_test test_data An indication to the sum test data, as 3 values connected with #:
  • Dataset name: RTE6 or RTE7
  • Type: DEV or TEST
  • Relative path to the dataset folder
For example: RTE6#TEST#RTE6_TESTSET
rte_sum_train_and_test serialized_test_data Path to the file where preprocessing output (of the test data) was written to.
rte_pairs_train_and_test, rte_sum_train_and_test threads Number of threads to be used during training and testing.
  • Preprocessing is always single-threaded.
  • The JVM parameter -Xmx must be set according to the number of threads to allow a heap that is large enough. If this is not set as required, your system may work very slow, and might crash. Usually, 4GB suffices for a single thread, plus 1GB for any additional thread.
transformations knowledge_resources A comma-separated list of knowledge resources, out of these values: WORDNET, WIKIPEDIA, GEO, CATVAR, BAP, LIN_DEPENDENCY_ORIGINAL, LIN_PROXIMITY_ORIGINAL, LIN_DEPENDENCY_REUTERS, VERB_OCEAN, ORIG_DIRT, REVERB, BINARY_LIN, FRAMENET, SYNTACTIC

These are all values from the enum:
'''eu.excitementproject.eop.
transformations.
builtin_knowledge.
KnowledgeResource'''
transformations multiword_resources A comma-separated list of lexical knowledge resources, out of these values: WORDNET, WIKIPEDIA, CATVAR, BAP, LIN_DEPENDENCY_ORIGINAL, LIN_PROXIMITY_ORIGINAL, LIN_DEPENDENCY_REUTERS, VERB_OCEAN.
  • Values are from to the same enum with true in their last parameter, except for GEO (which must not be used here).
  • For these resources, the system shall handle multi-word expressions.

Visual Tracing Tool

BIUTEE's visual tracing tool assists in tracing various aspects and internal steps of the system.

If you wish to use the tool, perform these steps:

  1. Download and install the program GraphViz from http://www.graphviz.org.
  2. The executable dot is installed with the system. Make sure it is in your system's path.
  3. Configure parameters in biutee.xml, under rte_pairs_preprocess and rte_pairs_train_and_test.
  4. Run:
    mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass=
    '''eu.excitementproject.eop.biutee.rteflow.systems.gui.VisualTracingTool''' -Dexec.args="biutee.xml"

DB-Based Knowledge Resources

@TODO: The links for downloading the resources refer now to the old BIUTEE webpage. Refer to new Maven repository.

Some knowledge resources are stored as MySQL tables, provided as compressed .sql files. In order to use them:

  1. Download the resources from the links in the table below. Each file represents one MySQL schema, and may contain several knowledge resources. Note that you don't need to download them all, you may download only the schema files containing the resources you wish to use.
  2. Install the free SQL server MySQL.
  3. Install its administration tool MySQL Workbench.
  4. Run the server.
  5. Connect to the server via MySQL Workbench, and in it:
    1. Create a user named db_readonly, with password BIUTEE: Users and Privileges --> Add Account
    2. Import the schema files to the database: Data Import/Restore --> Import from Dump Project Folder --> (input folder path containing uncompressed .sql files) --> Load Folder Contents --> (select all required schemas) --> Start Import
    3. Make sure user db_readonly has read (SELECT) privileges to all of the tables in the imported schemas.
  6. Define an environment variable named MYSQL with a value referring to the MySQL server address (name or IP address) and port. For example: dbsql.cs.biu.ac.il:3306.
Schema Name Knowledge Resources in Configuration Schema Download File Size (Compressed)
BAP (Directional Similarity) BAP Download 111 MB
Lin Similarity LIN_DEPENDENCY_ORIGINAL, LIN_PROXIMITY_ORIGINAL Download 236 MB
Original DIRT ORIG_DIRT Download 55 MB
Wikipedia Knowledge Resource WIKIPEDIA Download 214 MB
Binary Lin, Dependency Reuters BINARY_LIN, LIN_DEPENDENCY_REUTERS Download 2.4 GB
Framenet FRAMENET Download 228 KB
Geo (Geographical Knowledge Resource) GEO Download 1.4 MB
ReVerb (Distributional Similarity with Global Constraints) REVERB Download 161 MB

Log File

The system uses the log4j platform for logging. A log4j properties file is automatically created under $BIUTEE/workdir/log4j.properties with recommended values. If a file under that name already exists, the system uses it instead of creating a new one. There is no need to change any of the definitions in the file, but you may do so if you wish to change logging behavior. You may be assisted by the log4j Manual.

Under the recommended values, a new log file is created for every run of the system in $BIUTEE/workdir/logfile.log. If this file already exists from a previous run, it is renamed to logfile.log_''date''_''time''.log.

Clone this wiki locally