-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multilanguage support #13
Comments
Are you planning to add all the languages from that repo? |
No, just the ones that I can easily find stemmers for. Right now that's the languages supported by https://github.com/CurrySoftware/rust-stemmers minus Hungarian since that one didn't match lunr-languages' output. There are a few more languages that could be added pretty easily by running the snowball compiler, but I don't think I'll go through the effort unless someone actually wants them. |
So If I would like to add support for Polish language I need add it to snowball first? |
You need a rust implementation, and a javascript implementation that are both compatible. The snowball compiler is one way to generate both implementations, but you could port an algorithm manually as well. |
hi @mattico ! I'd like to help you to get the multilanguage integration happen. could you please provide any guidance? |
First, to be clear, multi-language means a search index that supports content that is written in multiple languages. A single document which has multiple languages. We already support searching many languages individually. Second, the main constraint of the implementation is to be compatible with the Javascript implementation. So the starting point for any addition should be understanding how the Javascript implementation works and converting it. The readme of elasticlunr.js says it can use https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js Tests should be added which generate an index using the javascript implementation and compare it to an index generated using the rust implementation. More specifically it looks like lunr.multi.js takes a bunch of language pipelines as arguments and combines them together into one. Language pipelines have a few distinct parts which are run sequentially:
These all get combined into a pipeline, which is just a list of functions which each get run sequentially on each input token to produce the output token. The |
Thanks a lot! Everything seems to be clear :) |
Thank you for such a thorough answer, @mattico! I've managed to implement support for Russian and English languages. Unfortunately, I did neither made a universal solution for all possible combinations of languages, nor covered it with tests. I hope I will find some spare time in the near future to implement universal support properly and send a PR. Btw, I've also encountered a weird issue with |
https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js
The text was updated successfully, but these errors were encountered: