Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Allow custom corpus for statistics #205

Open
Ced-C opened this issue Dec 11, 2024 · 5 comments
Open

[Feat] Allow custom corpus for statistics #205

Ced-C opened this issue Dec 11, 2024 · 5 comments

Comments

@Ced-C
Copy link
Contributor

Ced-C commented Dec 11, 2024

Goal

Be able to add/rm corpus with kalamine.
These corpuses will then be available in kalamine watch analyser.
This will allow other languages to use kalamine analyser.

Approach

  1. add/rm corpus from kalamine using CLI as a PoC e.g. kalamine corpus add foo.txt
  2. add a page to kalamine watch server _e.g. /corpus to add/rm corpuses using files
  3. add possibility to paste corpus in text-area in /corpus page
@Ced-C
Copy link
Contributor Author

Ced-C commented Dec 11, 2024

chardict.py stats seems a bit off : all frequency do not add to 100 %

{
   'symbols': 100.002,
   'digrams': 79.7713,
   'trigrams': 60.166000000000004
}

based on this quick script :

#!/usr/bin/env python3
import json
with open('file.json') as handle:
    corpus = json.loads(handle.read())
print(corpus)
count = {}
for n in ["symbols", "digrams", "trigrams"]:
    count[n] = sum(corpus[n].values())

print(count)

I will write a new version of the script that is not standalone but called by kalamine

@Ced-C
Copy link
Contributor Author

Ced-C commented Dec 11, 2024

I plan on using appdirs 1.4.4 to manage the data (corpus folder location) if it is ok with you ; or you might have a better suggestion ?

Ced-C pushed a commit to Ced-C/kalamine that referenced this issue Dec 12, 2024
Parsign txt file to get ngrams (here up to quadrigrams, but can be
tuned).
Ced-C pushed a commit to Ced-C/kalamine that referenced this issue Dec 12, 2024
@Ced-C
Copy link
Contributor Author

Ced-C commented Dec 12, 2024

I started work on my fork, on the corpus branch

Future steps are

  • adding proper cli support (add, rm, merge)
  • adding std corpus to user_config_dir/corpuses by default at install
  • kalamine watch to look for corpus in user_config_dir/corpuses by default

@fabi1cazenave wdyt ?

Also, any opinion on  :

  • changing the corpus JSON format, is it worth updating it to :
    1. add a “count” section so that several corpuses can be shares and merge easily without having the original text?
    2. add a field to list all the sources in the corpus, e.g. ["Victor Hugo — les Misérables I : Fantine", "Bépo discord #général : 2023–2024", …]
    3. switch from [symbols, bigrams, trigrams], to ngram_freq {1: , 2:, 3:, } (might be controversial and useless if frequency analysis above trigrams proves to be useless)
  • I considered changing the IGNORE_CHARS blacklist to a whitelist. but it would be tricky to list all chars useful to all languages. Maybe instead adding narrow/std non-break space is missing, but I think the larger question is what do we want :
    • like tabs, they are assigned to left-pinky traditionally so a↹ could be a pain to write in Ergo‑L / Qwerty, do we want do reflect that ?
    • Spaces and more generally thumb keys are not counted because “equal” for all layout —so far. Do we want to keep this philosophy, or do we acknowledge that future work in the field might include better thumb use and thus we ought to treat it as any finger. Maybe an std IGNORE_CHARS list, close to what we have today that we can make as a parameter in the future ie a good idea. In such case, I would tend to write it in the corpus to know what has been considered in this corpus, but it could be tricky to merge corpuses then… any thoughts on the matter ? Too complicated for now ?

@Ced-C
Copy link
Contributor Author

Ced-C commented Dec 17, 2024

Format change in chardict json output has been implemented to try it out. It simplifies corpus merging and do not remove any feature.
step 1 in §Approache is now done with commits :
8e26a45 0f97d7e a5a5900

Next step is the /corpus page creation.

@fabi1cazenave
Copy link
Collaborator

Hi @Ced-C, sorry for the lag, it’s been wild here.

First thing, please create a first PR if you intend to change the JSON format for our corpora. This format is used by Kalamine of course, but also by DuckTypist and the Ergo‑L analyzer. In other words, it’s a breaking change, and this means we’ll have to publish Kalamine 0.39 before switching to this new file format.

I find it a bit depressing that Python cannot store data to $XDG_CONFIG_DIR without a separate package but if that’s the case… so be it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants