This is an akka-based microservice to act as dictionary for multilanguage word embeddings.
The service run in two modes:
- embeddings conversion
- dictionary
The first mode is activated by setting the environment variable DO_CONVERSION=true
.
It will convert embeddings from csv
to parquet
files. The service will then terminate.
All the embeddings must reside in /data/ft/
(a mounted volume) and have filename of the form
wiki.<language>_en.vec
e.g.
wiki.fr_en.vec
except English which will be
wiki.en.vec
The reason for this names is that all vectors for languages other than English are aligned with the English ones.
See scripts/download.sh
and scripts/rename.sh
to download the raw vectors from
Amazon AWS.
This is the main mode of execution. The service will provide
- vector dictionary for all supported languages
- synonyms (k-nn)
- analogies (king - man + woman ~= queen; also k-nn)
Check com.haufe.umantis.ds.embdict.messages
for possible message queries.
sbt docker:publishLocal
docker-compose build
docker-compose up -d
This dictionary service needs access to word vectors in /data/ft
through a mounted volume.
Ports 5150
and 5151
need to be exposed.
OPENBLAS_NUM_THREADS: 1
LANGUAGES: ${LANGUAGES:-ar,bg,ca,cs,da,de,el,en,es,et,fi,fr,he,hr,hu,id,it,mk,nl,no,pl,pt,ro,ru,sk,sl,sv,tr,uk,vi}
DICTIONARY_SIZE: ${DICTIONARY_SIZE:-all}
DO_CONVERSION: ${DO_CONVERSION:-false}
RESCALE_VECTORS: ${RESCALE_VECTORS:-true}
DICT_TYPES: ${DICT_TYPES:-parallel}
This services uses akka-kryo-serialization
to serialize messages instead of the classic Java
serializer for performance reasons.
If new messages need to be added, they have to be explicitly listed in the configuration file.
Check src/main/resources/kryo_serializer.conf
for further information.
This software is released under the terms of the GNU GPL 3 License. It was developed by Nicola Bova at Haufe-Umantis, Barcelona, Spain.
Email: nicola dot bova at gmail dot com