Word2Vec

Julia interface to word2vec

Word2Vec takes a text corpus as input and produces the word vectors as output. Training is done using the original C code, other functionalities are pure Julia. See demo for more details.

Release Notes

Installation

Pkg.add("Word2Vec")

Note: Only linux and OS X are supported.

Functions

All exported functions are documented, i.e., we can type ? functionname to get help. For a list of functions, see here.

Examples

We first download some text corpus, for example http://mattmahoney.net/dc/text8.zip.

Suppose the file text8 is stored in the current working directory. We can train the model with the function word2vec.

julia> word2vec("text8", "text8-vec.txt", verbose = true)
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.04%  Words/thread/sec: 350.44k

Now we can import the word vectors text8-vec.txt to Julia.

julia> model = wordvectors("./text8-vec")
WordVectors 71291 words, 100-element Float64 vectors

The vector representation of a word can be obtained using get_vector.

julia> get_vector(model, "book")'
100-element Array{Float64,1}:
 -0.05446138539336186
  0.001090934639284009
  0.06498087707990222
  ⋮
 -0.0024113040415322516
  0.04755140828570571
  0.039764719065723826

The cosine similarity of book, for example, can be computed using cosine_similar_words.

julia> cosine_similar_words(model, "book")
10-element Array{String,1}:
 "book"
 "books"
 "diary"
 "story"
 "chapter"
 "novel"
 "preface"
 "poem"
 "tale"
 "bible"

Word vectors have many interesting properties. For example, vector("king") - vector("man") + vector("woman") is close to vector("queen").

5-element Array{String,1}:
 "queen"
 "empress"
 "prince"
 "princess"
 "throne"

References

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space", In Proceedings of Workshop at ICLR, 2013. [pdf]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality", In Proceedings of NIPS, 2013. [pdf]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig, "Linguistic Regularities in Continuous Space Word Representations", In Proceedings of NAACL HLT, 2013. [pdf]

Acknowledgements

The design of the package is inspired by Daniel Rodriguez (@danielfrg)'s Python word2vec interface.

Reporting Bugs

Please file an issue to report a bug or request a feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Word2Vec

Installation

Functions

Examples

References

Acknowledgements

Reporting Bugs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Word2Vec

Installation

Functions

Examples

References

Acknowledgements

Reporting Bugs