Skip to content

Step by Step Tutorial

rzanoli edited this page Nov 28, 2013 · 140 revisions

web page under construction

This guide explains how to set up and use EOP. It offers step-by-step instructions to download, install, configure, and run EOP code and its related support resources and tools. The guide is thought for users who want to use the software platform as it is whereas developers who want to contribute to the code have to follow the instruction reported in the developers distribution page accessible from the menu bar on the right.

The EOP library contains several components for preprocessing data and for annotating textual entailment relations. Components for pre-processing allow for annotating data with useful information (e.g. lemma, part-of-speech) whereas textual entailment components allow for training new models end then using them to annotate entailment relations in new data.

Each of these facilities is accessible via the EOP application program interface (API). In addition, a command line interface (CLI) is provided for convenience of experiments and training. In the rest of this guide we will report examples for both of these possibilities whereas Java code examples have been taken from the material used in the Fall School class for Textual Entailment in Heidelberg (http://fallschool2013.cl.uni-heidelberg.de/). We assume some familiarity with Unix-like command-line shells such as bash for Linux and the referenced operative system to be Ubuntu 12.04 LTS (http://www.ubuntu-it.org/download/).

Questions about EOP should be directed to the EOP mailing list.

Contents:

  1. [Basic Installation](#Basic Installation)
  2. [Hello Word! Example](#Hello Word! Example)
  3. [Advanced Installation](#Advanced Installation)
  4. [Preprocessing data sets](#Preprocessing data sets)
  5. [Annotating by using pre-trained models](#Annotating by using pre-trained models)
  6. [Training new models](#Training new models)
  7. [Evaluating the results](#Evaluating the results)
  8. [Sharing the results](#Sharing the results)

1. Basic Installation

The basic installation is for users who want to start using EOP from now and have PCs with 4GB of RAM installed. The basic installation is then required to install the advanced installation.

Installation, main steps:

1a. [Installing tools and environments](#Installing tools and environments)
1b. [Obtaining the EOP code and installing it](#Obtaining the EOP code and installing it)
1c. [Installing TreeTagger](#Installing TreeTagger)

1a. Installing tools and environments

EOP is written for Java 1.7+. If your JVM is older than 1.7, then you should upgrade. The same thing for Apache Maven given that EOP requires version 3.0.x and older version might not work. The necessity to install some tools like Maven itself or environments like Eclipse depends on the modality being used to run EOP: application program interface (API) vs command line interface (CLI). In the rest of this section we report the list of the tools required to run EOP and for some of them we report a note (i.e. API or CLI) so that users can understand if that tool is necessary for the chosen EOP modality. When that note is not present it means that the tool is required for both the modalities.

This is the list of the tools and environments needed by EOP:

  • Java 1.7
  • Ant (1.8.x or later)
  • Eclipse + m2e Maven plugin (Juno or later) -API
  • Maven tool (3.x or later) -CLI

Installing Java 1.7: Java is a programming language and computing platform first released by Sun Microsystems in 1995. There are lots of applications like EOP that will not work unless you have Java installed. Regardless of your operating system, you will need to install some Java virtual machine (JVM). Given that we want to use Eclipse for Java development, then we need to install the Java Development Kit JDK (the JDK includes--among other useful things--the source code for the standard Java libraries). There are several sources for JDK. Here are some of the more common/popular ones (listed alphabetically):

There are two ways of installing Java in Ubuntu:

Installing Ant: Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other. The main known usage of Ant is the build of Java applications. We use Ant to install the TreeTagger (i.e. a tool for annotating text with part-of-speech and lemma information) that can be used to pre-process data sets.

There are two ways of installing Ant in Ubuntu:

Installing Eclipse: Installing Eclipse is relatively easy, but does involve a few steps and software from at least two different sources. Eclipse is a Java-based application and, as such, requires a Java runtime environment (JRE) in order to run.

There are two ways of installing eclipse IDE in Ubuntu:

In addition to Eclipse we would need to install m2e Maven plugin. Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), we use Maven to manage the EOP project's build and test. The goal of the m2e project is to provide a first-class Apache Maven support in the Eclipse IDE, making it easier to edit Maven's pom.xml, run a build from the IDE and much more. Any “Eclipse for Java” later than “Juno” version already has this plug-in pre-installed whereas if you have installed Eclipse Juno you would need to install m2e separately: http://www.eclipse.org/m2e/

Installing Maven tool: Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), we use Maven to manage the EOP project's build and test. Maven tool (3.x or later) is required to use EOP and you can download and install it from its web site: http://maven.apache.org/

1b. Obtaining the EOP code and installing it

EOP functionalities are accessible both via its application program interface (API) and a command line interface (CLI). Using EOP via API involves using the EOP maven dependencies in the users code whereas the command line can be used by downloading the EOP archive zip file and installing it:

  • [EOP archive zip file distribution](#EOP archive zip file distribution)
  • [EOP maven artifacts distribution](#EOP maven artifacts distribution)

EOP archive zip file distribution

This is the distribution for users who want to use EOP via the command line interface (CLI).

  • Download the Excitement-Open-Platform-1.0.2.tar.gz archive file of the code.

  • Copy the archive file from the directory where it has been saved into the directory where you want to have it, e.g. your home directory:

> cp Excitement-Open-Platform-1.0.2.tar.gz ~/
  • Go into your home directory and extract/unpack it, i.e.
> cd ~/
> tar -xvzf Excitement-Open-Platform-1.0.2.tar.gz

It will create the directory Excitement-Open-Platform-1.0.2 containing the source code.

Installing the EOP resources: Resources like WordNet and Wikipedia as well as the configuration files of the platform and the pre-trained models are distributed in a separated archive file that has to be download and unpack before using it:

  • Click on this link eop-resources-1.0.2.tar.gz to download the archive file of the resources.

  • Copy the archive file into the EOP-1.0.2 directory created in the previous point, e.g.

> cp eop-resources-1.0.2.tar.gz 
~/Excitement-Open-Platform-1.0.2/target/EOP-1.0.2/eop-resources-1.0.2.tar.gz
  • From the EOP-1.0.2 directory where the archive file has been saved, extract/unpack it, i.e.
> cd  ~/Excitement-Open-Platform-1.0.2/target/EOP-1.0.2/
> tar -xvzf eop-resources-1.0.2.tar.gz

It will create the directory eop-resources-1.0.2 containing all the needed files.

EOP maven artifacts distribution

EOP is also distributed via the EOP maven artifactory repository. Maven artifacts include the binary code, the sources code and the javadoc jars. All what you would need to do is to specify a dependency to EOP in the pom.xml file of the project; all transient dependencies are resolved automatically.

<dependencies>
  <dependency>
    <groupId>eu.excitementproject</groupId>
    <artifactId>core</artifactId>
    <version>1.0.1</version>
  </dependency>
</dependencies>

Repositories exist as a place to collect and store artifacts. Whenever a project has a dependency upon an artifact, Maven will first attempt to use a local copy of the specified artifact. If that artifact does not exist in the local repository, it will then attempt to download from a remote repository. The repository elements within a POM specify those alternate repositories to search. To use EOP the following repository has to be included in the pom.xml file of your project:

<repositories>
  <repository>
    <id>FBK</id>
    <url>http://hlt-services4.fbk.eu:8080/artifactory/repo</url>
    <snapshots>
      <enabled>false</enabled>
    </snapshots>
  </repository>
</repositories>

Installing the EOP resources: Resources like WordNet and Wikipedia as well as the configuration files of the platform and the pre-trained models are distributed in a separated archive file that has to be download and unpack before using it:

  • Click on this link eop-resources-1.0.2.tar.gz to download the archive file of the resources.

  • Copy it just next to your home directory would be fine, e.g.

> cp eop-resources-1.0.2.tar.gz ~/
  • extract/unpack it, i.e.
> cd  ~/
> tar -xvzf eop-resources-1.0.2.tar.gz

It will create the directory eop-resources-1.0.2 containing all the needed files.

1c. Installing TreeTagger

TreeTagger is a tool for annotating text with part-of-speech and lemma information. It is essential for EOP German Linguistic processing pipelines, and also needed for some of English pre-processing. Excitement Open platform cannot ship TreeTagger given that it has its own license, which is not compatible with the EOP one.

So you have to install TreeTagger yourself, after reading the license agreement and agree with it: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/Tagger-Licence Actual installation is almost automated with a script. (The script will force you to read the license agreement, and won’t process unless you agree with it). Installing TreeTagger requires these 2 steps:

  1. Adding the TreeTagger maven dependency into the pom.xml file of the users project
  2. Downloading the build.xml file needed
  3. Using the ant tool to download and install TreeTagger

Adding the TreeTagger maven dependency:

<!-- TreeTagger related dependencies -->
        <!-- You need to install TreeTagger -->
        <dependency>
                <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
                <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-bin</artifactId>
                <version>20130228.0</version>
        </dependency>
        <dependency>
                <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
                <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-model-de</artifactId>
                <version>20121207.0</version>
        </dependency>
        <dependency>
                <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
                <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-model-en</artifactId>
                <version>20111109.0</version>
        </dependency>
        <dependency>
                <groupId>de.tudarmstadt.ukp.dkpro.core</groupId>
                <artifactId>de.tudarmstadt.ukp.dkpro.core.treetagger-model-it</artifactId>
                <version>20101115.0</version>
        </dependency>
        <!-- end of TreeTagger related dependencies -->

Downloading the build.xml file: build.xml is a script provided by DKPro. (Thanks! DKPro.) ...................................................

Using the ant tool to download and install TreeTagger:

  • Move into the following directory, i.e.
> cd ..............................................
  • Run the installation script by calling ANT build tool, i.e.
> ant local-maven 

This command will download and wrap the binary and models as Maven modules, and install it on your local Maven repository (on your computer, only). It will take some time. If you face some error, like "MD5SUM mismatch", please see the last section of this document.

TreeTagger Installation will take sometime (about 1 minute). If it works successfully, it will output “BUILD SUCCESSFUL”.

2. Hello World! Example

We dedicated an entire section to an example showing how to annotate a T/H pair with EOP. Running it it also a good way to check if the Basic Installation works correctly. The proposed example involves running both a linguistic analysis pipeline (LAP) for pre-processing that T/H pair and an entailment decision algorithms (EDA) to see if an entailment relation exists between T and H. More in details this task can be split in the following main steps:

  1. Pre-processing a given T/H pair by calling a LAP
  2. Initialize an EDA with configuration & pre-trained model
  3. Using the selected EDA to see if an entailment relations exists between T and H

We will propose this example by using the EOP application program interface (API) and the EOP command line interface (CLI).

Application Program Interface (API)

This section shows, with minimal code, how to annotate entailment relation via API. We will use Eclipse IDE to write and run the code.

  1. Open Eclipse IDE
  2. In Eclipse, navigate to File > New > Other… in order to bring up the project creation wizard.
    @TODO add image
  3. Scroll to the Maven folder, open it, and choose Maven Project. Then choose Next.
    @TODO add image
  4. You may choose to Create a simple project or forgo this option. For the purposes of this tutorial, we will choose the simple project. This will create a basic, Maven-enabled Java project. If you require a more advanced setup, leave this setting unchecked, and you will be able to use more advanced Maven project setup features. Leave other options as is, and click Next.
    @TODO add image
  5. Now, you will need to enter information regarding the Maven Project you are creating. You may visit the Maven documentation for a more in-depth look at the Maven Coordinates (http://maven.apache.org/pom.html#Maven_Coordinates). In general, the Group Id should correspond to your organization name, and the Artifact Id should correspond to the project’s name. The version is up to your discretion as is the packing and other fields. If this is a stand-alone project that does not have parent dependencies, you may leave the Parent Project section as is. Fill out the appropriate information, and click Finish.
    @TODO add image
  6. You will now notice that your project has been created. You will place your Java code in /src/main/java, resources in /src/main/resources, and your testing code and resources in /src/test/java and /src/test/resources respectively.
  7. Open the pom.xml file to view the structure Maven has set up. In this file, you can see the information entered in Step 5. You may also use the tabs at the bottom of the window to change to view Dependencies, the Dependency Hierarchy, the Effective POM, and the raw xml code for the pom file in the pom.xml tab.
    Now you have to specify these dependencies in the pom.xml file:
  • the EOP dependencies and the repository where they are into the pom.xml file they are in Section: EOP maven artifacts distribution.
  • the TreeTagger dependencies as described in Section: Installing TreeTagger.

@TODO add image

  1. Always in Eclipse, navigate to src/main/java > New > Class in order to start writing your code. You can name it as Ex0.
    @TODO add image
  2. Add the following code into the page of the created Java class and navigate to Ex0.java > Run As > Java Application to run the code.

Java code:

package org.excitement;

import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.uima.jcas.JCas;

import eu.excitementproject.eop.common.EDABasic;
import eu.excitementproject.eop.common.TEDecision;
import eu.excitementproject.eop.common.configuration.CommonConfig;
import eu.excitementproject.eop.core.ClassificationTEDecision;
import eu.excitementproject.eop.core.ImplCommonConfig;
import eu.excitementproject.eop.core.MaxEntClassificationEDA;
import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
import eu.excitementproject.eop.lap.dkpro.OpenNLPTaggerEN;

/**
* A simple, minimal code that runs one LAP & EDA.
*
* @author Gil
*
*/
public class Ex0
{
    public static void main( String[] args )
    {

            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.WARN);

        System.out.println( "Hello World!" );
        
        // Here's T-H of this welcome code.
        String text = "The students had 15 hours of lectures and practice sessions on the topic of Textual Entailment.";
        String hypothesis = "The students must have learned quite a lot about Textual Entailment.";
        // Minimal "running" example for Excitement open platform EDAs (Entailment Decision Algorithms)
        // Basically, it is 3 steps:
        // 1. Doing pre-processing by calling a LAP
        // 2. Initialize an EDA with configuration & pre-trained model
        // 3. Annotating entailment relations with the selected EDA
        
        
        // 1) Do pre-processing, via an LAP.
        // Here, it runs one pipeline based on OpenNLPTaggerEN.
        System.out.println( "Running LAP for the T-H pair." );
        JCas annotated_THpair = null;
        try {
                LAPAccess lap = new OpenNLPTaggerEN(); // make a new OpenNLP based LAP
                annotated_THpair = lap.generateSingleTHPairCAS(text, hypothesis); // ask it to process this T-H.
        } catch (LAPException e)
        {
                System.err.print(e.getMessage());
                System.exit(1);
        }

        // 2) Initialize an EDA with a configuration (& corresponding model)
        // (Model path is given in the configuration file.)
        System.out.println("Initializing the EDA.");
        EDABasic<ClassificationTEDecision> eda = null;
        try {
                // TIE (MaxEntClassificationEDA): a simple configuration with no knowledge resource.
                // extracts features from lemma, tokens and parse tree and use them as features.
                File configFile = new File("/hardmnt/norris0/zanoli/Downloads/eop-resources-1.0.2/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml");
                CommonConfig config = new ImplCommonConfig(configFile);
                eda = new MaxEntClassificationEDA();
                eda.initialize(config);
        } catch (Exception e)
        {
                System.err.print(e.getMessage());
                System.exit(1);         
        }
        
        // 3) Now, one input data is ready, and the EDA is also ready.
        // Call the EDA.
        System.out.println("Calling the EDA for decision.");
        TEDecision decision = null; // the generic type that holds Entailment decision result
        try {
                decision = eda.process(annotated_THpair);
        } catch (Exception e)
        {
                System.err.print(e.getMessage());
                System.exit(1);         
        }
        System.out.println("Run complete: EDA returned decision: " + decision.getDecision().toString());
    }
}

Command line interface (CLI)

Another way to run the example before is by using a command line standalone Java class, serving as a unique entry point to the EOP main included functionalities. The class that is located in the gui directory can call both the linguistic analysis pipeline to pre-process the data to be annotated and the selected entailment algorithm (EDA) and it is the simplest way to use EOP.

Go into the EOP-1.0.2 directory, i.e.

> cd  ~/Excitement-Open-Platform-1.0.2/target/EOP-1.0.2/

and call the Demo class with the needed parameters as reported below, i.e.

> java -Djava.ext.dirs=../EOP-1.0.2/ eu.excitementproject.eop.gui.Demo -config
./eop-resources-1.0.2/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml -test
-text "The students had 15 hours of lectures and practice sessions on the topic of Textual Entailment." -hypothesis "The students must have learned quite a lot about Textual Entailment."
-output ./eop-resources-1.0.2/results/

where:

  • MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data to be annotated.
  • test means that the selected EDA has to make its annotation by using a pre-trained model.
  • text is the text.
  • hypothesis is the hypothesis.
  • output is the directory where the result file (results.xml) containing the prediction has to be stored.

3. Advanced Installation

The advanced installation is for users with skills in computer science and that would like to exploit the full potentially of EOP. This installation requires a minimal amount of RAM of 5GB and this is the installation needed by the BIUTEE EDA to run. Before continuing with the next steps be sure that the basic installation has been already done.

  • step1
  • step2

4. Preprocessing data sets

All textual entailment algorithms requires some level of pre-processing; they are mostly linguistic annotations like sentence breaking, POS tagging, tokenization, dependency parse, and so on.

EOP standardizes linguistic analysis modules in two ways:

  1. it has one common output data format
  2. it defines common interface that all pipelines needs to perform.
  • Common Data format In EOP we have borrowed one powerful abstraction that is called CAS. CAS (Common Analysis Structure) is a data structure that is used in Apache UIMA. It has type system strong enough to represent any annotation, and metadata. The following figure shows one CAS example that holds a Text - Hypothesis pair. http://hltfbk.github.io/Excitement-Open-Platform/specification/spec-1.1.3.html#CAS_example. CAS is the data output format in EOP LAP. All pipelines (LAPs) outputs their results in CAS. You can see CAS as a big container with type system that defines many annotations.

  • Common access interfaces In EOP, all pipelines are provided with the same set of “access methods”. Thus, regardless of your choice of the pipeline (e.g. tagger only pipeline, or tagging, parsing, and NER pipeline), they all react to the same set of common methods.

Application Program Interface (API)

Below we reported a Java code consisting of 4 code fragments. Users should proceed with each fragments. First run each fragment, and follow the code comments to understand what happens there. At the end of each code fragment, there is one small task to be resolved. Try to resolve each. As usual we will use Eclipse to write and run the code.

  1. Create a new Java class with Eclipse and name it as Ex1.
  2. Copy the following code into the page of the created class and navigate to Ex1.java > Run As > Java Application to run the code.

Java code:

package package org.excitement;

import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.uima.cas.CASException;
import org.apache.uima.jcas.JCas;
import org.uimafit.util.JCasUtil;

//import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency;

import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
import eu.excitementproject.eop.lap.PlatformCASProber;
import eu.excitementproject.eop.lap.dkpro.MaltParserEN;
import eu.excitementproject.eop.lap.dkpro.TreeTaggerEN;

/**
* This heavily commented code introduces the Linguistic Analysis Pipeline
* (LAP) of EXCITEMENT open platform. Check EX1 exercise sheet first, and proceed
* with this example code.
*
* @author Gil
*/
public class Ex1 {

        public static void main(String[] args) {
                
            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.WARN);

            // remove comments of the following methods one by one, and
            // run it, and read it.
            // ex1_1();
            // ex1_2();
            // ex1_3();
            // ex1_4();
        }
        
        /**
         * This code introduces LAPAccess.generateSingleTHPairCAS
         * [URL]
         */
        public static void ex1_1() {
                // Each and every LAP in EXCITEMENT Open Platform (EOP)
                // implements the interface LAPAccess.
                // Here, lets use the TreeTagger based LAP.
                LAPAccess aLap = null;
                try {
                        aLap = new TreeTaggerEN();
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated TreeTagger LAP: " + e.getMessage());                         
                }
                
                // LAPs (all implement LAPAccess) basically support 3 types of common methods.
        
                //// First interface: generateSingleTHPair
                // LAPs can generate a specific data format that can be
                // accepted by EOP Entailment Decision Algorithms (EDAs). This is supported
                // by the LAPAccess.generateSingleTHPairCAS.
                
                JCas aJCas = null;
                try {
                        aJCas = aLap.generateSingleTHPairCAS("This is the Text part.", "The Hypothesis comes here.");
                } catch (LAPException e)
                {
                        System.out.println("Unable to run TreeTagger LAP: " + e.getMessage());                                                 
                }
                
                // All output of LAPs are stored in a data type that is called CAS.
                // This data type is borrowed from Apache UIMA: for the moment, just think
                // of it as a data type that can hold any annotation data. One way to see
                // it is "smarter" version of CONLL format; just much more flexible, and
                // unlike CONLL, this is "im-memory" format.
                
                // Take a look at a CAS figure; to see how it stores data of a T-H pair.
                // Figure URL: http://hltfbk.github.io/Excitement-Open-Platform/specification/spec-1.1.3.html#CAS_example

                // Here, let's briefly check what is stored in this actual aJCas.
                // Say, how it is annotated?
                try {
                        // This command checks CAS data, and checks if it is compatible for the EDAs
                        PlatformCASProber.probeCas(aJCas, System.out);
                        // the following command dumps all annotations to text file.
                        CASAccessUtilities.dumpJCasToTextFile(aJCas, "test_dump1.txt");
                        System.out.println("test_dump1.txt file dumped.");
                } catch (LAPException e)
                {
                        System.out.println("Failed to dump CAS data: " + e.getMessage());                                                 
                }
                // TODO Task1_1 check out this file, in Excitement-Open-Platform/fallschool/test_dump1.txt         
                
                System.out.println("method ex1_1() finished");
        }
        
        /**
         * This code introduces LAPAccess.processRawInputFormat
         */
        public static void ex1_2()
        {
                // LAPs also support file based mass pre-processing.
                // As an example let's process RTE3 English data with TreeTagger LAP.

                // Initialize an LAP, here it's TreeTagger
                LAPAccess ttLap = null;
                try {
                        ttLap = new TreeTaggerEN();
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated TreeTagger LAP: " + e.getMessage());                         
                }
                
                // Prepare input file, and output directory.
                File f = new File("./src/main/resources/RTE3-dataset/English_dev.xml");
                File outputDir = new File("./target/");
                
                // Call LAP method for file processing.
                // This takes some time. RTE data has 800 cases in it.
                // Each case, will be first annotated as a CAS, and then it will be
                // serialized into one XMI file.
                try {
                        ttLap.processRawInputFormat(f, outputDir);
                } catch (LAPException e)
                {
                        System.out.println("Failed to process EOP RTE data format: " + e.getMessage());                                                 
                }
        
                // TODO Task1_2: now all RTE3 training data is annotated and stored in
                // output dir ( Excitement-Open-Platform/fallschool/target/ )
                // a. Check the files are really there.
                // b. Open up one XMI file to get impression that how the CAS content is
                // stored into XML-based file.
                System.out.println("method ex1_2() finished");
        }
        
        /**
         * This code introduces LAPAccess.addAnnotationOn
         */
        public static void ex1_3()
        {
                // Previous two methods generates "Pair data stored in CAS" (or XMI file)
                // , including Entailment Pair annotation, and so on
                
                // But what if, if you simply wants to annotate a sentence, or something
                // like that. E.g. no Entailment pair, just a single text document
                // annotation.
                
                // All LAP has addAnnotationOn() method is there to give you this
                // capability.
                // The following code shows you how you can do that.
                
                // first, prepare Malt parser based LAP
                LAPAccess malt = null;
                try {
                        malt = new MaltParserEN();
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated MaltParser (with TreeTagger) LAP: " + e.getMessage());                         
                }
                
                // and let's annotate something.
                try {
                        // get one empty CAS.
                        JCas aJCas = CASAccessUtilities.createNewJCas();
                        
                        // Before asking LAP to process, you have to set at least two things.
                        // One is language, and the other is document itself.
                        aJCas.setDocumentLanguage("EN"); // ISO 639-1 language code.
                        String doc = "This is a document. You can pass an arbitary document to CAS and let LAP work on it.";
                        aJCas.setDocumentText(doc);
                        malt.addAnnotationOn(aJCas);
                } catch (LAPException e)
                {
                        System.out.println("Failed to process EOP RTE data format: " + e.getMessage());                                                 
                }
                
                // Malt parser annotates the given aJCas document text.
                // But here, there is no Pair, no TEXTVIEW, or HYPOTHESISVIEW.
                
                // TODO Task1_3 Dump this result of malt parser result to a textfile.
                // Check how the CAS stores dependency parser result.
                // (use CASAccessUtilities.dumpJCasToTextFile())
        }
        
        /**
         * This code introduces how you can iterate over added annotations
         * within a JCas.
         */
        public static void ex1_4()
        {
                // So far, so good. But how can we access annotation results
                // stored in a JCas? You can iterate them, like the followings.

                // First, prepare LAP and process a T-H pair.
                LAPAccess malt = null;
                JCas aJCas = null;
                try {
                        malt = new MaltParserEN();
                        aJCas = malt.generateSingleTHPairCAS("We thought that there were many cats in this garden.", "But there was only one cat, among all the gardens in the city.");
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated MaltParser (with TreeTagger) LAP: " + e.getMessage());                         
                }
                
                // aJCas has now T-H pair.
                // Here, let's iterate over the Tokens on Text side.
                try {
                        JCas textView = aJCas.getView("TextView");
                        System.out.println("Listing tokens of TextView.");
                        for (Token tok : JCasUtil.select(textView, Token.class))
                        {
                                String s = tok.getCoveredText(); // .getCoveredText() let you check the text on the document that this annotation is attached to.
                                int begin = tok.getBegin();
                                int end = tok.getEnd();
                                System.out.println(begin + "-" + end + " " + s);                 
                        }
                } catch (CASException e)
                {
                        System.out.println("Exception while accesing TextView of CAS: " + e.getMessage());                                                 
                }

                // And here, let's iterate over the dependency edges on the Hypothesis side.
                try {
                        JCas hypothesisView = aJCas.getView("HypothesisView");
                        for (Dependency dep : JCasUtil.select(hypothesisView, Dependency.class)) {

                                // One Dependency annotation holds the information for a dependency edge.
                                // Basically, 3 things;
                                // It holds "Governor (points to a Token)", "Dependent (also to a Token)",
                                // and relationship between them (as a string)
                                Token dependent = dep.getDependent();
                                Token governor = dep.getGovernor();
                                String dTypeStr = dep.getDependencyType();

                                // lets print them with full token information (lemma, pos, loc)
                                // info for the dependent ...
                                int dBegin = dependent.getBegin();
                                int dEnd = dependent.getEnd();
                                String dTokenStr = dependent.getCoveredText();
                                String dLemmaStr = dependent.getLemma().getValue();
                                String dPosStr = dependent.getPos().getPosValue();

                                // info for the governor ...
                                int gBegin = governor.getBegin();
                                int gEnd = governor.getEnd();
                                String gTokenStr = governor.getCoveredText();
                                String gLemmaStr = governor.getLemma().getValue();
                                String gPosStr = governor.getPos().getPosValue();

                                // and finally print the edge with full info
                                System.out.println(dBegin + "-" + dEnd + " " + dTokenStr + "/" + dLemmaStr + "/" + dPosStr);
                                System.out.println("\t ---"+ dTypeStr + " --> ");
                                System.out.println("\t " + gBegin + "-" + gEnd + " " + gTokenStr + "/" + gLemmaStr + "/" + gPosStr);
                                }
                } catch (CASException e)
                {
                        System.out.println("Exception while accesing HypothesisView of CAS: " + e.getMessage());                                                 
                }                

                // TODO [Optional Task] Task 1_4
                // ( This is an optional task: you can skip without affecting later exercise)
                //
                // Try to print out the above T-H pair as two bags of lemmas.
                //
                // You can iterate over Lemma type (you will need to import Lemma class),
                // or, you can iterate over Tokens, and use Token.getLemma() to fetch Lemmas.         
                // Then, you can access Lemma value, by calling Lemma.getValue();
                
                System.out.println("ex1_4() method finished");
                
        }
}

Command line interface (CLI)

As usual we can use the standalone Java class, serving as a unique entry point to the EOP main included functionalities, to run the preprocessing.

Go into the EOP-1.0.2 directory, i.e.

> cd  ~/Excitement-Open-Platform-1.0.2/target/EOP-1.0.2/

and call the Demo class with the needed parameters as reported below, i.e.

> java -Djava.ext.dirs=../EOP-1.0.2/ eu.excitementproject.eop.gui.Demo -config
@TODO by Vivi

where:

  • MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data to be annotated.
  • test means that the selected EDA has to make its annotation by using a pre-trained model.
  • text is the text. @TODO by Vivi

5. Annotating by using pre-trained models

Application Program Interface (API)

  1. Create a new Java class with Eclipse and name it as Ex2.
  2. Copy the following code into the page of the created class and navigate to Ex2.java > Run As > Java Application to run the code.

Java code:

package package org.excitement;

import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.uima.jcas.JCas;

import eu.excitementproject.eop.common.DecisionLabel;
import eu.excitementproject.eop.common.EDABasic;
import eu.excitementproject.eop.common.EDAException;
import eu.excitementproject.eop.common.TEDecision;
import eu.excitementproject.eop.common.configuration.CommonConfig;
import eu.excitementproject.eop.common.exception.ComponentException;
import eu.excitementproject.eop.common.exception.ConfigurationException;
import eu.excitementproject.eop.core.ImplCommonConfig;
import eu.excitementproject.eop.core.MaxEntClassificationEDA;
import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
//import eu.excitementproject.eop.lap.dkpro.MaltParserEN;
import eu.excitementproject.eop.lap.dkpro.TreeTaggerEN;

/**
* This example code shows how you can initiate and use an EDA to annotate entailment
* relations by using a pre-trained model.
*
* @author Gil
*
*/

public class Ex2 {

        public static void main(String[] args) {

            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.INFO);

            // read and each of the code sections, one by one.
            
        ex2_1(); // initialize() and process() of EDA
        
        }
        
        /**
         * This method shows initializing an EDA with one existing (already trained) model.
         */
        public static void ex2_1() {
            // All EDAs are implementing EDABasic interface. Here, we will visit
            // "process mode" of an EDA.
            
                ///////
                /// Step #1: initialize an EDA
                ///////
            // First we need an instance of an EDA. We will use a TIE instance.
                // (MaxEntClassificationEDA)
                @SuppressWarnings("rawtypes") // why this? will be explained later.
                EDABasic eda = null;
                try {
                        eda = new MaxEntClassificationEDA();
                        // To start "process mode" we need to initialize the EDA.
                        // We have two TIE configuration in /src/main/resource/config/
                        // let's use "lexical one": MaxEntClassificationEDA_Base+WN+VO_EN.xml
                        File configFile = new File("src/main/resources/config/MaxEntClassificationEDA_Base+WN+VO_EN.xml");

                        CommonConfig config = new ImplCommonConfig(configFile);
                        eda.initialize(config);
                }
                catch (EDAException e)
                {
                        System.out.println("Failed to init the EDA: "+ e.getMessage());
                        System.exit(1);
                }
                catch (ConfigurationException e)
                {
                        System.out.println("Failed to init the EDA: "+ e.getMessage());
                        System.exit(1); 
                }
                catch (ComponentException e)
                {
                        System.out.println("Failed to init the EDA: "+ e.getMessage());
                        System.exit(1);
                }    
                
                // TODO Task ex2_1_a: Take a look at the configuration file.
                // ( XML file that holds the above configuration. src/main/resources/MaxEntClassificationEDA_Base+WN+VO_EN.xml)
                //
                // It is TIE configuration with lexical features (without parse trees),
                // and the configuration uses WordNet and VerbOcean.
                // First note the "model path", since we load already existing model.
                // /fallshool/ code already holds that model in the given path.
                //
                // Note that there are two file paths (for WordNet and VerbOcean), and
                // One model path. When you install EOP on your own computer, you probably
                // need to update those paths.
                // (Note that /eop-resources-1.0.2/ can be downloaded from project webpage,
                // as described in Ex0 exercise sheet)
                
                // Full list of configurations and model files can be found in
                // /Excitement-Open-Platform/core/src/main/resources/configuration-files
                // and
                // /Excitement-Open-Platform/core/src/main/resources/model
                
                ///////
                /// Step #2: call process(), and check the result.
                ///////
                
                // Okay. now the EDA is ready. Let's prepare one T-H pair and use it.
            // simple Text and Hypothesis.
                // Note that (as written in the configuration file), current configuration
                // needs TreeTaggerEN Annotations
        String text = "The sale was made to pay Yukos' US$ 27.5 billion tax bill, Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known company Baikalfinansgroup which was later bought by the Russian state-owned oil company Rosneft.";
        String hypothesis = "Baikalfinansgroup was sold to Rosneft.";
        
        JCas thPair = null;
        try {
                LAPAccess lap = new TreeTaggerEN();
                thPair = lap.generateSingleTHPairCAS(text, hypothesis); // ask it to process this T-H.
        } catch (LAPException e)
        {
                System.err.print("LAP annotation failed:" + e.getMessage());
                System.exit(1);
        }
        
        // Now the pair is ready in the CAS. call process() method to get
        // Entailment decision.
        
        // Entailment decisions are represented with "TEDecision"
        // class.
        TEDecision decision = null;
        try {
                decision = eda.process(thPair);
        } catch (EDAException e)
        {
                System.err.print("EDA reported exception" + e.getMessage());
                System.exit(1);
        }
        catch (ComponentException e)
        {
                System.err.print("EDA reported exception" + e.getMessage());
                System.exit(1);
        }
        
        // And let's look at the result.
        DecisionLabel r = decision.getDecision();
        System.out.println("The result is: " + r.toString());
        
        
        // and you can call process() multiple times as much as you like.
        // ...
        // once all is done, we can call this.
        eda.shutdown();
        
        // TODO Task ex2_1_b
        // Try to ask some more T-H pairs from RTE3 English data.
        // You can find the RTE3 data in /fallschool/src/main/resources/RTE3-dataset
        // Just randomly pick a few pairs from the data, type them in the code of this
        // file (as String t = "xxx", String h = "yyy), and annotate them with the lap,
        // get JCas data, and call EDA process() to get the result.
         
        }
        
}

Command line interface (CLI)

Another way to run EOP by a command line standalone java class, serving as a unique entry point to the EOP main included functionalities. The class that is located in the gui directory is able to call both the linguistic analysis pipeline to pre-process the data to be annotated and the selected entailment algorithm (EDA). It is the simplest way to use EOP.

Go into the EOP-1.0.2 directory, i.e.

> cd  ~/Excitement-Open-Platform-1.0.2/target/EOP-1.0.2/

and call the Demo class with the needed parameters as reported below, i.e.

> java -Djava.ext.dirs=../EOP-1.0.2/ eu.excitementproject.eop.gui.Demo -config
./eop-resources-1.0.2/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml -test
-text "The students had 15 hours of lectures and practice sessions on the topic of Textual Entailment." -hypothesis "The students must have learned quite a lot about Textual Entailment."
-output ./eop-resources-1.0.2/results/

where:

  • MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data to be annotated.
  • test means that the selected EDA has to make its annotation by using a pre-trained model. @TODO by Vivi

6. Training new models

Application Program Interface (API)

  1. Create a new Java class with Eclipse and name it as Ex3.
  2. Copy the following code into the page of the created class and navigate to Ex3.java > Run As > Java Application to run the code.

Java code:

package package org.excitement;

import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.uima.jcas.JCas;

import eu.excitementproject.eop.common.DecisionLabel;
import eu.excitementproject.eop.common.EDABasic;
import eu.excitementproject.eop.common.EDAException;
import eu.excitementproject.eop.common.TEDecision;
import eu.excitementproject.eop.common.configuration.CommonConfig;
import eu.excitementproject.eop.common.exception.ComponentException;
import eu.excitementproject.eop.common.exception.ConfigurationException;
import eu.excitementproject.eop.core.ImplCommonConfig;
import eu.excitementproject.eop.core.MaxEntClassificationEDA;
import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
//import eu.excitementproject.eop.lap.dkpro.MaltParserEN;
import eu.excitementproject.eop.lap.dkpro.TreeTaggerEN;

/**
* This example code shows how you can train a new model on a new data set.
*
* @author Gil
*
*/

public class Ex3 {

        public static void main(String[] args) {

            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.INFO);

            // read and each of the code sections, one by one.
            
            ex3_1(); // start_training() of EDA
        }
        
        /**
         * This method shows how to train a EDA, with a given configuration & training data.
         *
         * The example training process takes a fair amount of time.
         * (around 10-15 minutes).
         * It is recommended to read / follow the code first to its end,
         * before actually running it.
         */
        public static void ex3_1()
        {
                // The other mode of the EDA is training mode. Let's check how this is done
                // with one training example.
                
                // Training also requires the configuration file.
                // We will load a configuration file first.                 
                CommonConfig config = null;
                try {
                        // the configuration uses only WordNet (no VerbOcean)
                        File configFile = new File("src/main/resources/config/MaxEntClassificationEDA_Base+WN_EN.xml");
                        config = new ImplCommonConfig(configFile);
                }
                catch (ConfigurationException e)
                {
                        System.out.println("Failed to read configuration file: "+ e.getMessage());
                        System.exit(1);
                }
                
                // TODO task ex3_1_a
                // Check the above configuration XML file by opening and reading it.
                // Check the following values under the section
                // "eu.excitementproject.eop.core.MaxEntClassificationEDA" (last section).
                //                 modelFile: the new model will be generated here.
                // trainDir: the configuration expects here pre-processed RTE training data as a set of XMI Files.
                // Where the new model will be generated? Where the configuration
                // expects to read pre-processed training data?
                // Also check the first section:
                // What LAP it requires? (top section, "activatedLAP")
                
                        // WARNING: each EDA has different procedures for Training.
                        // So other EDAs like BIUTEE might expect different parameters
                        // for training. One needs to consult EDA-specific documentations
                        // to check this.
                
                // Before calling start_training() we have to provide
                // pre-processed training data. This EDA will train itself with
                // the provided data that is pointed by trainDir.
                                        
                try {
                        LAPAccess ttLap = new TreeTaggerEN();
                        // Prepare input file, and output directory.
                        File f = new File("./src/main/resources/RTE3-dataset/English_dev.xml");
                        File outputDir = new File("./target/training/"); // as written in configuration!
                        if (!outputDir.exists())
                        {
                                outputDir.mkdirs();
                        }
                        ttLap.processRawInputFormat(f, outputDir);
                } catch (LAPException e)
                {
                        System.out.println("Training data annotation failed: " + e.getMessage());                         
                        System.exit(1);
                }
                        
                // Okay, now RTE3 data are all tagged and stored in the
                // trainDir. Let's ask EDA to train itself.
                try {
                        @SuppressWarnings("rawtypes")
                        EDABasic eda = null;
                        eda = new MaxEntClassificationEDA();
                        eda.startTraining(config); // This *MAY* take a some time.
         }
                catch (EDAException e)
                {
                        System.out.println("Failed to do the training: "+ e.getMessage());
                        System.exit(1);
                }   
                catch (ConfigurationException e)
                {
                        System.out.println("Failed to do the training: "+ e.getMessage());
                        System.exit(1);
                }   
                catch (ComponentException e)
                {
                        System.out.println("Failed to do the training: "+ e.getMessage());
                        System.exit(1);
                }   

                System.out.print("Training completed.");
                
                // TODO task ex3_1_b
                // Go to "modelFile" path and check that the newly trained model
                // has been generated.
                
                // TODO task ex3_1_c
                // modify ex3_1 to use this configuration and its newly trained model.
                // Check it actually works.
        
                // TODO [OPTIONAL] task ex3_1_d
                // (You can skip optional tasks without affecting the final mini project)
                //
                // The best configuration known for current EOP-TIE EDA is
                // "MaxEntClassificationEDA_Base+WN+VO+TP+TPPos+TS_EN.xml"
                // Try to train a model based on this configuration.
                // You can find the configuration in Excitement-Open-Platform/config/
                //
                // You have to edit the configuration to update file paths,
                // output dir, etc.
                // you also need to provide the proper LAP pre-processing.
        }                
}

Command line interface (CLI)

Another way to run EOP by a command line standalone java class, serving as a unique entry point to the EOP main included functionalities. The class that is located in the gui directory is able to call both the linguistic analysis pipeline to pre-process the data to be annotated and the selected entailment algorithm (EDA). It is the simplest way to use EOP.

Go into the EOP-1.0.2 directory, i.e.

> cd  ~/Excitement-Open-Platform-1.0.2/target/EOP-1.0.2/

and call the Demo class with the needed parameters as reported below, i.e.

> java -Djava.ext.dirs=../EOP-1.0.2/ eu.excitementproject.eop.gui.Demo -config
@TODO by Vivi

where:

  • MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data to stored. @TODO by Vivi

7. Evaluating the results

Application Program Interface (API)

Command line interface (CLI)

Another way to run EOP by a command line standalone java class, serving as a unique entry point to the EOP main included functionalities. The class that is located in the gui directory is able to call both the linguistic analysis pipeline to pre-process the data to be annotated and the selected entailment algorithm (EDA). It is the simplest way to use EOP.

8. Sharing the results

Clone this wiki locally