-
Notifications
You must be signed in to change notification settings - Fork 0
GermanLexicalResources
This page keeps information about German Lexical Resource modules (that implements LexicalResource
interface) within EXCITEMENT open platform.
For the moment, there are four lexical resources within CORE of EOP. They are:
- DerivBase: The resource holds groups of derivationally words (e.g., sleepy_adj, sleep_n, sleepless_adj), and returns lexical rules based on the assumption that derivationally related words have a common meaning, and are thus useful for Textual Entailment.
- DewakDistributional: A resource based on distributional similarities observed on deWac corpus. The resource holds 10k most frequent terms and their inter-similarity, and returns lexical rules based on those similarities (basic assumption similar to TransDM).
- GermaNetWrapper: This is an implementation that interacts with GermaNet, the German WordNet. It generates lexical rules for lemma pairs which share a semantic relationship in GermaNet (synonyms, hypernyms, ...). Note that GermaNet itself is not provided, and the user has to install it to use.
- TransDM: A distributional similarity resource that "translates" an English syntactic vector space into German, and refines the information with monolingual German data. It returns lexical rules based on the assumption that words with similar context are synonymous and thus, also entailing.
For a sample configuration file which initializes these resources, please see [https://github.com/hltfbk/Excitement-Open-Platform/blob/master/core/src/test/resources/german_resource_test_configuration.xml].
This resource implements a German Lexical Resource based on derivational information, DErivBase v1.3. The resource contains groups of lemmas, so-called derivational families, which share a morphologic (and ideally also a semantic) relationship, e.g. "sleep, sleepy, to sleep, sleepless". DErivBase has been generated by a rule-based approach: Content words from SdeWaC corpus are grouped into derivational families by help of manually written derivation rules. For Textual Entailment, we assume a bidirectional entailment relationship between two words which occur in the same derivational family.
The resource is accessed via the class core.component.lexicalknowledge.derivbase.DerivBase
. It loads a information from one of two existing files: One contains only the derivational families; the other contains both the derivational families and confidence scores for all pairs of lemmas within one family. The confidence score within DErivBase reflects the connectedness of two lemmas within one family: The score is calculated as 1/n, where n is the length of the derivation path. Thus, 1.00 trusts only pairs which are directly linked by a rule; 0.5 trusts pairs which are linked by two rules; 0.33 trusts pairs which are linked by three rules, etc.
We transfer the DErivBase-internal confidence scores to the confidence scores as defined in the EXCITEMENT project: A confidence score of 0 means "no entailment", 0.5 means "don't know", 1.0 means "entailment". Since we assume that derivationally related lemmas also entail each other (bidirectionally), we map the scores from DErivBase into the scale 0.5-1.0. The following are some examples for DErivBase-internal and corresponding EXCITEMENT confidence scores:
DErivBase = 1.0; EXCITEMENT = 1.0
DErivBase = 0.5; EXCITEMENT = 0.75
DErivBase = 0.33; EXCITEMENT = 0.665
etc.
DerivBaseResource is a simple LexicalResource
and does not support LexicalResourceWithRelation
.
DerivBaseResource has two configurable values, indicating if confidence scores for derivationally related lemma pairs in the resource should be used, and in which way.
Section | Property | Value | Requirement |
---|---|---|---|
DerivBaseResource |
useScores |
Boolean value. Specifies if rule confidence scores should be used (true) or not (false). | N/A |
DerivBaseResource |
derivationSteps |
Integer value between 1 and 10. Specifies how many derivation steps are accepted to derive lemma l2 from lemma l1, and to still count this lemma pair as "entailment". Thus, this value influences how many lemma pairs of the DErivBase resource are considered in the EXCITEMENT platform. | Is only effective if useScores is set to true. |
This resource implements a German lexical resource based on corpus term distribution. It uses the distance vectors which have been gathered from DeWac, a web corpus for German. The vectors are based on the 10k most frequent words observed in the corpus. Similarity is calculated with five different similarity measures (balAPinc, lin, linOpt, jaccard, dice). Only pairs which achieve a predefined minimum similarity are stored in the resource (for balAPinc: .7, for lin: .6, for linOpt: .6, for jaccard: .8, for dice: .9).
As a confidence score, the resource returns the distributional similarity score which has been calculated for the lemma-POS pairs. Thus, depending on the measure used, it lies between .6 and 1.0.
The DewakDistributional is a simple LexicalResource
and does not support LexicalResourceWithRelation
.
No values to configure.
This class implements a German Lexical Resource based on GermaNet 7.0, which is the German WordNet. The implementation accesses GermaNet via GermaNet API. The implementation supports both LexicalResource
and LexicalResourceWithRelation
.
For the relations, it supports both OwnRelationSpecifier (with GermaNetRelation; possible relation types: synonym, hypernym, hyponym, causes, entails) and CanonicalRelationSpecifier (possible relation types: TERuleRelation.Entailment or .Nonentailment).
For each OwnRelation, a confidence score can be set. They can be set in the configuration. If a configuration is used, but the scores are not defined, the confidences for all relations are all set to 0.0 by default. If no configuration is used, the confidences for all relations are all set to 1.0 by default.
Note 1: The EXCITEMENT project cannot and do not redistribute GermaNet, and the user of this component must get it with a proper license agreement from Tuebingen University. The GermaNet API, however, is provided with the project.
Note 2: In the current version (1.0.2 and previous), GermaNet's antonymy relation is not supported. It will be integrated in a later version.
The GermaNet resource has a few configurable values. Basically, it needs path to GermaNet data itself, and a set of double values that indicates "confidence" for each own relation when they are treated as "entailment".
Section | Property | Value | Requirement |
---|---|---|---|
GermaNetWrapper |
germaNetFilesPath |
Path to the GermaNet resource, which has to be installed by the user on his own computer. | N/A |
GermaNetWrapper |
causesConfidence |
Indicates a confidence score on how reliable the GermaNet 'causes' relation is considered. Value between 0 and 1. Causes are only used for rules LHS - RHS. | N/A |
GermaNetWrapper |
entailsConfidence |
Indicates a confidence score on how reliable the GermaNet 'entails' relation is considered. Value between 0 and 1. Entails are only used for rules LHS - RHS. | N/A |
GermaNetWrapper |
hypernymConfidence |
Indicates a confidence score on how reliable the GermaNet 'hypernym' relation is considered. Value between 0 and 1. Hypernyms are only used for rules LHS - RHS. | N/A |
GermaNetWrapper |
synonymConfidence |
Indicates a confidence score on how reliable the GermaNet 'snonym' relation is considered. Value between 0 and 1. Synonyms are used for both rules RHS - LHS and LHS - RHS. | N/A |
GermaNetWrapper |
hypoymConfidence |
Indicates a confidence score on how reliable the GermaNet 'hyponym' relation is considered. Value between 0 and 1. Hyponyms are only used for rules RHS - LHS. | N/A |
GermaNetWrapper |
antonymConfidence |
Not integrated yet. |
Added in version 1.1.
This multilingual resource contains syntax-based distributional information. It is motivated by the fact that available corpora for English are still larger than for other languages, and syntactic analysis is still more reliable. As a result, the creation of high-quality syntax-based distributional resources for other languages is still problematic.
TransDM addresses this problem by "translating" the standard English syntax-based distributional resource (Baroni and Lenci's Distributional Memory), into German using a simple translation lexicon, and complementing it with co-occurrence information gathered from a German corpus. This combination outperforms monolingual German distributional models while keeping coverage constant.
For each word pair, the maximum similarity of the similarities in the translated and the monolingual German model is chosen.
Two similarity measures are implemented:
- cosine (standard vector similarity measure)
- balapinc (especially suitable for Textual Entailment)
The underlying English corpus is ukWac, the translation lexicon is dict.cc, and the German corpus is sdeWac. The resource contains the 2 million word pairs with the highest similarity values achieved with the method described above.
GermanTransDmResource is a simple LexicalResource
and does not support LexicalResourceWithRelation
.
GermanTransDmResource has one configurable value, specifying which similarity measure(s) of the resource should be considered.
Section | Property | Value | Requirement |
---|---|---|---|
GermanTransDm |
simMeasure |
Specifies, which similarity measure should be used from this LexicalResource . Three options are available: all (i.e., lemma pairs with both high cosine and balapinc measures are considered), cosine (only pairs with high cosine measures are considered), balapinc (only pairs with high balapinc measures are considered). The default value is all . |
N/A |