More optimizations #13

amirouche · 2021-07-05T07:40:53Z

I implemented new optimizations until the library did not look much like coming from ahocorapy.

First a few benchmarks. I am using the same benchmark offered in this repository. All numbers are nanoseconds using latest stable pypy. First construction:

CPython-v3.9.2-default:/lorem-ipsum-construction/benchmarks/python-acdc.py   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.51 B
CPython-v3.9.2-default:/lorem-ipsum-construction/benchmarks/ahocorapy.py     : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.21 B
CPython-v3.9.2-default:/lorem-ipsum-construction/benchmarks/pyahocorasick.py : ▇ 30.91M
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-construction/benchmarks/python-acdc.py: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 826.37M
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-construction/benchmarks/ahocorapy.py  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 434.21M

Then search:

CPython-v3.9.2-default:/lorem-ipsum-search/benchmarks/python-acdc.py   : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.23 M
CPython-v3.9.2-default:/lorem-ipsum-search/benchmarks/ahocorapy.py     : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.58 M
CPython-v3.9.2-default:/lorem-ipsum-search/benchmarks/pyahocorasick.py : ▇▇▇▇ 458.14K
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-search/benchmarks/python-acdc.py: ▇▇▇▇▇ 636.67K
PyPy-v3.7.10-7.3.5+dfsg-2:/lorem-ipsum-search/benchmarks/ahocorapy.py  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.97 M

I added a step after finalization called freeze, here is the code:

    # last step: freeze!
    out = []
    for istate in states:
        transitions = [None] * 255
        for byte, state in istate.transitions.items():
            transitions[byte] = state.uid
        fstate = FState(
            istate.byte,
            istate.match[0],
            istate.longest_strict_suffix[0].uid,
            tuple(transitions)
        )
        out.append(fstate)
    return tuple(out)

Search is changed accordingly, the algorithm is unchanged.

Edit: added pyahocorasick to the benchmarks.

The text was updated successfully, but these errors were encountered:

FrederikP · 2021-07-20T15:27:10Z

Hey, sorry about the late response. I was on parental leave.

First thought:
How does this still support unicode (more than 255 chars)?

The reason the library is not using a simple pre-sized array is that we want to support unicode. (Because this is something that the C-Extension based library doesn't support for Python 2.7)

FrederikP · 2022-07-25T15:14:06Z

I will close this, since there was no answer. I think the library will lose it's unicode (and whatever else regarding sequences of arbitrary signs) support with these changes.

amirouche · 2022-07-30T14:16:11Z

The reason the library is not using a simple pre-sized array is that we want to support unicode.

As long as all registred strings are utf8, searching "foobar" or "foobar".encode('utf8') is the same. If there mixed byte strings from different encoding, then that would be a different story.

Sorry for the late reply.

FrederikP · 2022-08-01T13:01:24Z

as long as the amount of registered string is lower than 255 it could work. But utf8 is capable of encoding way more characters. So you'd already have a problem when a text is a mix of english and chinese, for example.

We use the library heavily for scenarios where a lot more than 255 different characters can come up.

FrederikP closed this as completed Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More optimizations #13

More optimizations #13

amirouche commented Jul 5, 2021 •

edited

Loading

FrederikP commented Jul 20, 2021

FrederikP commented Jul 25, 2022

amirouche commented Jul 30, 2022

FrederikP commented Aug 1, 2022

More optimizations #13

More optimizations #13

Comments

amirouche commented Jul 5, 2021 • edited Loading

FrederikP commented Jul 20, 2021

FrederikP commented Jul 25, 2022

amirouche commented Jul 30, 2022

FrederikP commented Aug 1, 2022

amirouche commented Jul 5, 2021 •

edited

Loading