Skip to content

Latest commit

 

History

History
61 lines (37 loc) · 1.97 KB

parallel-corpus-mining.md

File metadata and controls

61 lines (37 loc) · 1.97 KB

Parallel corpus mining

Metadata

  • Status: Proposed (NB: still has to be discussed with relevant researchers)
  • Type: Specific
  • Work Package: WP3
  • Research Coordinators: Time in Translation group
  • Coordinators for CLARIAH: Jesse de Does, Vincent Vandeghinste
  • Participating Institutes: INT, UU
  • End-users: Time in Translation group
  • Developers: (Who is involved in implementing this use-case (if any)? Try to mention name, institute, role/responsibility)
  • Interest Groups: (a list of CLARIAH interest groups, such as Text and DevOps, for which this use case may be relevant. See the list of IG's at: https://github.com/clariah/ig/.
  • Task IDs: Wp3 search engine extensions: parallel corpora; treebanks

Description

Progress in studying verbal tense and aspect semantics can be made by applying quantitative corpus methods in the field of semantic micro-typology, in particular by exploiting the possibilities of translation corpora.

What is the research about?

Tense-aspect categories found across languages.

What problem is hindering the research?

Absence of a flexible, open source and user-friendly environment to explore the corpus data.

What is needed to do the research?

We propose extensions to blacklab/blacklab-server/autosearch

  • to enable parallel concordancing
  • extraction of relevant statistics
  • upload of parallel data created by researchers into autosearch
  • exploitation of existing parallel corpora

Data

Parallel UD-enriched corpora (tagging, lemmatization, dependency syntax)

  • created by researchers
  • existing corpora (OPUS, etc)

Tools

  • extended version of blacklab/autosearch
  • Visualization and analysis tools developed by the Time in Translation group

What software and services are involved?

How to evaluate this?

  • Is the researched satisfied?

References

References to related resources and publications and especially links to related use-cases: