-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More optimizations #13
Comments
Hey, sorry about the late response. I was on parental leave. First thought: The reason the library is not using a simple pre-sized array is that we want to support unicode. (Because this is something that the C-Extension based library doesn't support for Python 2.7) |
I will close this, since there was no answer. I think the library will lose it's unicode (and whatever else regarding sequences of arbitrary signs) support with these changes. |
As long as all registred strings are utf8, searching Sorry for the late reply. |
as long as the amount of registered string is lower than 255 it could work. But utf8 is capable of encoding way more characters. So you'd already have a problem when a text is a mix of english and chinese, for example. We use the library heavily for scenarios where a lot more than 255 different characters can come up. |
I implemented new optimizations until the library did not look much like coming from ahocorapy.
First a few benchmarks. I am using the same benchmark offered in this repository. All numbers are nanoseconds using latest stable pypy. First construction:
Then search:
I added a step after finalization called freeze, here is the code:
Search is changed accordingly, the algorithm is unchanged.
Edit: added pyahocorasick to the benchmarks.
The text was updated successfully, but these errors were encountered: