Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More optimizations #13

Closed
amirouche opened this issue Jul 5, 2021 · 4 comments
Closed

More optimizations #13

amirouche opened this issue Jul 5, 2021 · 4 comments

Comments

@amirouche
Copy link

amirouche commented Jul 5, 2021

I implemented new optimizations until the library did not look much like coming from ahocorapy.

First a few benchmarks. I am using the same benchmark offered in this repository. All numbers are nanoseconds using latest stable pypy. First construction:

CPython-v3.9.2-default:/lorem-ipsum-construction/benchmarks/python-acdc.py   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.51 B
CPython-v3.9.2-default:/lorem-ipsum-construction/benchmarks/ahocorapy.py     : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.21 B
CPython-v3.9.2-default:/lorem-ipsum-construction/benchmarks/pyahocorasick.py : ▇ 30.91M
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-construction/benchmarks/python-acdc.py: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 826.37M
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-construction/benchmarks/ahocorapy.py  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 434.21M

Then search:

CPython-v3.9.2-default:/lorem-ipsum-search/benchmarks/python-acdc.py   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.23 M
CPython-v3.9.2-default:/lorem-ipsum-search/benchmarks/ahocorapy.py     : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.58 M
CPython-v3.9.2-default:/lorem-ipsum-search/benchmarks/pyahocorasick.py : ▇▇▇▇ 458.14K
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-search/benchmarks/python-acdc.py: ▇▇▇▇▇ 636.67K
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-search/benchmarks/ahocorapy.py  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.97 M

I added a step after finalization called freeze, here is the code:

    # last step: freeze!
    out = []
    for istate in states:
        transitions = [None] * 255
        for byte, state in istate.transitions.items():
            transitions[byte] = state.uid
        fstate = FState(
            istate.byte,
            istate.match[0],
            istate.longest_strict_suffix[0].uid,
            tuple(transitions)
        )
        out.append(fstate)
    return tuple(out)

Search is changed accordingly, the algorithm is unchanged.

Edit: added pyahocorasick to the benchmarks.

@FrederikP
Copy link
Collaborator

Hey, sorry about the late response. I was on parental leave.

First thought:
How does this still support unicode (more than 255 chars)?

The reason the library is not using a simple pre-sized array is that we want to support unicode. (Because this is something that the C-Extension based library doesn't support for Python 2.7)

@FrederikP
Copy link
Collaborator

I will close this, since there was no answer. I think the library will lose it's unicode (and whatever else regarding sequences of arbitrary signs) support with these changes.

@amirouche
Copy link
Author

The reason the library is not using a simple pre-sized array is that we want to support unicode.

As long as all registred strings are utf8, searching "foobar" or "foobar".encode('utf8') is the same. If there mixed byte strings from different encoding, then that would be a different story.

Sorry for the late reply.

@FrederikP
Copy link
Collaborator

as long as the amount of registered string is lower than 255 it could work. But utf8 is capable of encoding way more characters. So you'd already have a problem when a text is a mix of english and chinese, for example.

We use the library heavily for scenarios where a lot more than 255 different characters can come up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants