Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
fgaim committed Jun 26, 2022
1 parent 112db5a commit 9d4915b
Showing 1 changed file with 20 additions and 11 deletions.
31 changes: 20 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,39 @@
# Tigrinya Word Count

This compilation of over `0.95 million` unique Tigrinya words and their frequencies. Data collected from various sources across the web, including news websites, blogs, books, newspapers, magazines, etc.
This is a compilation of over `1.15 million` unique Tigrinya words and their frequencies. The source documents were collected from various sources across the web, including newspapers, websites, blogs, books, magazines, etc.

The source documents were deduplicated and the text was preprocessed (normalized and tokenized) before generating the word count. The statistics provided is a good represenation of the language characteristics. Due to Tigrinya's rich morphology, the vocabulary is expected to grow even furthher as more text becomes available for analysis. This repository will be updated from time to time.
The source documents were deduplicated and the text was preprocessed (normalized and tokenized) before generating the word count.
The statistics provided is believed to be a good representation of the language characteristics.

Due to Tigrinya's rich morphology, the vocabulary is expected to grow further as more text becomes available for analysis, and this repository will be updated, accordingly.

## Format
* Content file `ti_word_count.txt`
* Each line formed as `<word>TAB<count>`
> From version 3.0.0 onwards, content from social media such as comments on YouTube is not included in the estimation. This is due to the very high ratio of noise observed in those texts.

## Content Format

- File: `ti_word_count.txt`
- Each line is formed as `<word>TAB<count>`


## Tigrinya Stop-words

A compilation of Tigrinya stop words can be found at `ti_stop_words.txt`.
These are manually currated functional words with over 10k frequency in the data sources.
These are manually curated functional words with over 10k frequency in the data sources.

## Stats
- Vocabulary: 953670
- Source tokens: 35216657

## Statistics

- Source tokens: 59,805,568
- Vocabulary: 1,158,871
- Stop words: 182
- See the Tigrinya word frequency and ranking distribution ![alt Zipf's](zipf.png)


## Uses
* Licensed under the MIT License, it can be freely used for any purpose with proper attributions.
* If you use this resource in a published work, please cite as follows:

- Distributed under the MIT License. It can be freely used for any purpose with proper attributions to the author.
- If you use this resource in a published work, please cite as follows:

```
Fitsum Gaim, 2017, Tigrinya WordCount, https://github.com/fgaim/Tigrinya-WordCount
Expand Down

0 comments on commit 9d4915b

Please sign in to comment.