Skip to content

okirmis/encoding_estimator

Repository files navigation

EncodingEstimator: Detect encoding of strings

Build Status Code Covergae

This gem allows you to detect the encoding of strings/files based on their content. This can be useful if you need to load data from sources with unknown encodings. The gem uses character distribution statistics to check which encoding is the one that gives you the best results.

Usage in Ruby Code

The gem has two major high level methods. The first one can be used when you want to know, how a string is encoded:

detection = EncodingEstimator.detect( File.read( 'foo.txt' ), languages: [ :en, :de ] )
puts "Encoding: #{detection.result.encoding}"

The second one is a shortcut you can use in case you just want to be sure to get a string of an unknown encoding as a UTF-8 encoded string (should be the ruby default):

utf8_txt = EncodingEstimator.ensure_utf8( File.read( 'foo.txt' ), languages: [ :en, :de ] )

More detailed tutorials can be found here.

If you need more control over the operations to perform, just have a look at EncodingEstimator::Detector and EncodingEstimator::Conversion.

Installation

Add this line to your application's Gemfile:

gem 'encoding_estimator'

And then execute:

$ bundle

Or install it yourself as:

$ gem install encoding_estimator

Note: if you want to use the multithreaded versions of the algorithms, please install parallel and ruby-progressbar gem.

Command line utilities

This gem provides two command line utilities: encest-detect and encest-gen.

encest-detect

This tool can detect the encoding of files. Therefore, it has some command line options you should use whenever you know more about a file (e.g. which languages it could be written in or which encodings it could have).

usage: encest-detect [options]
    --encodings, -e   Encodings to test (default: iso-8859-1,utf-16le,windows-1251)
    --operations, -o  Operations (enc/dec) to test (default: dec)
    --languages, -l   Language profiles to apply (default: en,de)
    --threads, -t     Number of threads to use (0 to disable multithreading, default)
    --help, -h        Display help
    other arguments: files to parse

Please note that the -l argument accepts the short two-letter-codes for the included language profiles as well as paths to language model files. These can be generated by using encest-gen.

The output might look like this:

$ encest-detect -l en,de,fr */*.txt
de/iso-8859-1.txt: dec_iso-8859-1
    keep_utf-8: 0.9983638601518013
    dec_iso-8859-1: 1.0
    dec_utf-16le: 0.0
    dec_windows-1251: 0.9984215377764598
en/utf-16le.txt: dec_utf-16le
    keep_utf-8: 0.0
    dec_iso-8859-1: 0.3981167811176304
    dec_utf-16le: 1.0
    dec_windows-1251: 0.005410547626031029
fr/utf-8.txt: keep_utf-8
    keep_utf-8: 1.0
    dec_iso-8859-1: 0.9957726010451553
    dec_utf-16le: 0.0
    dec_windows-1251: 0.9957810888135232

encest-gen

This tool is can generate the language models the encest-detect tool uses (or the other classes in this gem). The language models are very simple JSON files, looking somewhat like that:

{"W":0.222539,"ä":0.288427,"-":0.513657,"Z":0.118473 ... }

The encest-gen command generates these scores based on a lot of input text. To generate the language models this gem provides by default, I used dumps of the Wikipedia, but you can use any (UTF-8-encoded) text files you like. Just put them in one directory, let's call it pt (for Portuguese) and extract the files you want to learn the language model from to that directory (e.g. the Wikipedia dump). Please split large files into smaller chunks of text (max ~20MiB) because ruby otherwise will crash with NoMemoryError and you don't see a progressbar.

Usage of encest-gen is quite simple:

usage: encest-gen [options]
    --threshold, -t  Minimum character count threshold to include a char in the model (default: 0.00001)
    --threads, -n    Number of threads used to process the files (default: 4)
    --silent, -s     Disable progressbars and other outputs
    --help, -h       Display help
    other arguments: lang1=directory1 ... langN=directoryN

So for our Portuguese language model on a 8 core machine we call:

encest-gen -n 8 pt=/path/to/the/directory/with/text

The command will produce a file called pt.json which is you new language model.

How it works

This gem uses a statistical approach to determine the encoding of an input string. Therefore, it interprets the input as different encodings (all encodings to test) and compares the character distribution against one or multiple language models. The detector then returns the likelihood of every encoding.

Supported languages

Currently, the gem has support for 10 languages: English, German, French, Spanish, Russian, Portuguese, Greek, Turkish, Chinese and Arabic. The language profiles were generated from Wikipedia dumps. You can generate your own language profiles using the encest-gen tool. For more information on this tool, see above.

Supported encodings

The gem supports all encodings your ruby implementation supports. But note that including more encodings in the list of encodings you want to test slows down the detection process.

License

The gem is available as open source under the terms of the MIT License.