FAIR Embeddings for Textual Cultural Heritage

xxxxx This is a mockup site, data and links are empty xxxxx

FAIR Embeddings for Textual Cultural Heritage

Valid and scalable solutions to research questions for textual cultural heritage (TeCH) data depend on access to contextualized embeddings. Word embeddings are abstract and distributed dense representations of language (characters, words, phrases) that are learned by data-intensive representation-learning algorithms implemented as deep neural network architecture. To ensure that DH researchers uses state-of-the-art technology for tackling complex TeCH problems, it is mandatory that they have access to pre-trained multi-level embeddings for their respective language, which follow the FAIR principles (findable, accessible, interoperable, and resuable). FAIR Embeddings for Textual Cultural Heritage (FAIR eTeCH) will pioneer FAIR embeddings for Scandinavian languages, which through a collaboration with national libraries and an innovative use of regulations pertaining to derived data can circumvent restriction on copyrighted and sensitive data.

One of the greatest challenges for large scale DH research is access to original or direct data (e.g., the content of newspaper article) because of copyright restrictions. In Denmark, for instance, a newspaper article in the Danish Mediestream has to be more than a century old in order to allow a researcher free data mining access. Embeddings however have status as derived data that do not allow for a reconstruction of the original data source. Embeddings trained on large newspaper collections are more than adequate to solve problems related to semantic similarity and drift.

Metadata

Level 1 metadata: Data citation

Publication date:

January 28, 2019

DOI:

DOI: 00.0000/zenodo.0000002

License (for files) Creative Commons Attribution 4.0 International

Versions

Version	Date
Version 1 00.0000/zenodo.0000001	Jan 28, 2019
Version 2 00.0000/zenodo.0000002	May 23, 2019

Cite all versions? You can cite all versions by using the DOI 00.0000/zenodo.0000000. This DOI represents all versions, and will always resolve to the latest one. Read more.

Share

[!LINKS HERE for export to SOME!]

Cite as

@misc{nielbo_2019_0000002,
author = {Nielbo,Gerdes and Møldrup-Dalum},
title = {{FAIR Embeddings for Textual Cultural Heritage v2 [dataset]}},
month = jan,
year = 2019,
doi = {00.0000/zenodo.0000002},
url = {https://doi.org/00.0000/zenodo.0000002}
}

Export

BibTex, DataCite, JSON, Mendeley

Level 2 metadata: Domain-specific requirements

Data file dk-news-1919-2001-300d-1M contains one million word vectors trained on Danish newspapers in MedieStream from 1919 to 2001. Every word is embedded in a 300 dimensional space. The first line of the file contains the number of words in the vocabulary and the size of the vectors. Each line contains a word followed by its vectors. Each value is space separated. Words are ordered by descending frequency.

[!standards for level 2 & 3 are developed by the Nordic Digital Humanities Laboratory in order to facilitate interoperability and reusability in the Nordic region!]

Level 3 metadata: Fine-level requirements

[!e.g., fine-grained corpus specifications, model specifications, OCR accuracy pr. decade!]

Machine-accessible interface

Download pre-trained embeddings

git clone https://github.com/knielbo/fair-etech.git
unzip fair-etech.zip
cd fair-etech

Load text models in Python

The text models can easily be loaded in Python using the following code:

import io

def load_vectors(filename):
    f = io.open(filename, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, f.readline().split())
    data = {}
    for line in f:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

vectors = load_vectors('dk-news-1919-2001-300d-1M.vec')

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
_docbase		_docbase
pipeline		pipeline
README.md		README.md
dk-news-1919-2001-300d-1M.json.zip		dk-news-1919-2001-300d-1M.json.zip
dk-news-1919-2001-300d-1M.vec.zip		dk-news-1919-2001-300d-1M.vec.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAIR Embeddings for Textual Cultural Heritage

Metadata

Level 1 metadata: Data citation

Level 2 metadata: Domain-specific requirements

Level 3 metadata: Fine-level requirements

Machine-accessible interface

About

Releases

Packages

Languages

knielbo/fair-etech

Folders and files

Latest commit

History

Repository files navigation

FAIR Embeddings for Textual Cultural Heritage

Metadata

Level 1 metadata: Data citation

Level 2 metadata: Domain-specific requirements

Level 3 metadata: Fine-level requirements

Machine-accessible interface

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages