Welcome to CSTLM

This is a compressed suffix tree based infinite context size language model capable of indexing terabyte sized text collections.

Disclaimer

The new multi-threaded extension which does parallel construction (was not used in the published results listed in below) is not tested rigorously. For stability and correctness, please checkout the older single-thread version:

git checkout 8163b55fe9a9dfad1d1dbcc89a09d451c5fe217b

References

This code is the basis of the following papers:

Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn: Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015: 2409-2418
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn: Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees. TACL 2016 : 477-490
Ehsan Shareghi, Trevor Cohn, Gholamreza Haffari: Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling. EMNLP 2016: 944-948

Please cite our EMNLP2015 and TACL2016 papers, if you use our code.

Compile instructions

Check out the reprository: https://github.com/eehsan/cstlm.git
git submodule update --init
cd build
cmake ..
make -j

Run unit tests to ensure correctness

cd build
rm -rf ../collections/unittest/
./create-collection.x -i ../UnitTestData/data/training.data -c ../collections/unittest
./create-collection.x -i ../UnitTestData/data/training.data -c ../collections/unittest -1
./unit-test.x

Usage instructions (Word based language model)

Create collection:

./create-collection.x -i toyfile.txt -c ../collections/toy

Build index (including quantities for modified KN)

./build-index.x -c ../collections/toy/ -m

Query index (i.e., Modified KN (drop -m for KN), 5-gram)

./query-index-knm.x -c ../collections/toy/ -p test.txt -m -n 5

Usage instructions (Character based language model)

Create collection:

./create-collection.x -i toyfile.txt -c ../collections/toy -1

Build index (including quantities for modified KN)

./build-index.x -c ../collections/toy/ -m

Generalized Modified Kneser-Ney

Check out the MMKN branch of the repository: https://github.com/eehsan/cstlm/tree/MMKN

Moses integration and parallel construction

Checkout what Matthias has developed on his repository.

Name		Name	Last commit message	Last commit date
Latest commit History 578 Commits
CMakeModules		CMakeModules
Profiling		Profiling
UnitTestData		UnitTestData
build		build
collections/unittest		collections/unittest
external		external
include/cstlm		include/cstlm
src		src
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to CSTLM

Disclaimer

References

Compile instructions

Run unit tests to ensure correctness

Usage instructions (Word based language model)

Usage instructions (Character based language model)

Generalized Modified Kneser-Ney

Moses integration and parallel construction

About

Releases

Packages

Contributors 4

Languages

eehsan/cstlm

Folders and files

Latest commit

History

Repository files navigation

Welcome to CSTLM

Disclaimer

References

Compile instructions

Run unit tests to ensure correctness

Usage instructions (Word based language model)

Usage instructions (Character based language model)

Generalized Modified Kneser-Ney

Moses integration and parallel construction

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages