This is a compressed suffix tree based infinite context size language model capable of indexing terabyte sized text collections.
The new multi-threaded extension which does parallel construction (was not used in the published results listed in below) is not tested rigorously. For stability and correctness, please checkout the older single-thread version:
git checkout 8163b55fe9a9dfad1d1dbcc89a09d451c5fe217b
This code is the basis of the following papers:
-
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn: Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015: 2409-2418
-
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn: Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees. TACL 2016 : 477-490
-
Ehsan Shareghi, Trevor Cohn, Gholamreza Haffari: Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling. EMNLP 2016: 944-948
Please cite our EMNLP2015 and TACL2016 papers, if you use our code.
- Check out the reprository:
https://github.com/eehsan/cstlm.git
git submodule update --init
cd build
cmake ..
make -j
cd build
rm -rf ../collections/unittest/
./create-collection.x -i ../UnitTestData/data/training.data -c ../collections/unittest
./create-collection.x -i ../UnitTestData/data/training.data -c ../collections/unittest -1
./unit-test.x
Create collection:
./create-collection.x -i toyfile.txt -c ../collections/toy
Build index (including quantities for modified KN)
./build-index.x -c ../collections/toy/ -m
Query index (i.e., Modified KN (drop -m for KN), 5-gram)
./query-index-knm.x -c ../collections/toy/ -p test.txt -m -n 5
Create collection:
./create-collection.x -i toyfile.txt -c ../collections/toy -1
Build index (including quantities for modified KN)
./build-index.x -c ../collections/toy/ -m
- Check out the MMKN branch of the repository:
https://github.com/eehsan/cstlm/tree/MMKN
Checkout what Matthias has developed on his repository.