-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Allow custom corpus for statistics #205
Comments
{
'symbols': 100.002,
'digrams': 79.7713,
'trigrams': 60.166000000000004
} based on this quick script : #!/usr/bin/env python3
import json
with open('file.json') as handle:
corpus = json.loads(handle.read())
print(corpus)
count = {}
for n in ["symbols", "digrams", "trigrams"]:
count[n] = sum(corpus[n].values())
print(count) I will write a new version of the script that is not standalone but called by kalamine |
I plan on using appdirs 1.4.4 to manage the data (corpus folder location) if it is ok with you ; or you might have a better suggestion ? |
Parsign txt file to get ngrams (here up to quadrigrams, but can be tuned).
I started work on my fork, on the corpus branch Future steps are
@fabi1cazenave wdyt ? Also, any opinion on :
|
Hi @Ced-C, sorry for the lag, it’s been wild here. First thing, please create a first PR if you intend to change the JSON format for our corpora. This format is used by Kalamine of course, but also by DuckTypist and the Ergo‑L analyzer. In other words, it’s a breaking change, and this means we’ll have to publish Kalamine 0.39 before switching to this new file format. I find it a bit depressing that Python cannot store data to |
Goal
Be able to add/rm corpus with kalamine.
These corpuses will then be available in
kalamine watch
analyser.This will allow other languages to use kalamine analyser.
Approach
kalamine corpus add foo.txt
kalamine watch
server _e.g./corpus
to add/rm corpuses using files/corpus
pageThe text was updated successfully, but these errors were encountered: