Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better datetime/date string parsing performance #2885

Merged
merged 1 commit into from
Dec 22, 2024
Merged

Conversation

texodus
Copy link
Member

@texodus texodus commented Dec 22, 2024

Tl;DR 7x performance improvement parsing non-ISO-8601 datetime strings in CSV or JSON formats.

I've been analyzing Perspective performance using this NYC Open Data 500k row CSV file, which loads fine but takes ~22 seconds on my computer to parse:

Screenshot 2024-12-22 at 4 50 23 PM

Looking into the profile analysis shows a lot of long calls out from WASM into JS to call strptime, which Perspective uses for anything not ISO 8601 formatted date or datetime strings. In Emscripten, strptime is implemented as a JavaScript foreign call which turns out to be quite expensive, as well as requiring unnecessary string copying.

Screenshot 2024-12-22 at 4 54 59 PM

Perspective ultimately callsstrptime through Arrow, which is used internally for CSV and any other string-to-datetime parsing (including in JSON formats). It turns out though, Arrow itself has a vendored C++ implementation of strptime which it uses for Windows builds. Patching Arrow to use its own vendored strptime implementation for Emscripten as well reduces the original runtime to ~3s:

Screenshot 2024-12-22 at 5 12 25 PM

This PR only applies this fix for Emscripten (WebAssembly) builds. I haven't yet tested Python, but I don't expect this fix to be applicable for Python as these implementations are likely identical in this context.

@texodus texodus changed the title Better datetime/date parsing fro string performance Better datetime/date string parsing performance Dec 22, 2024
@texodus texodus added the enhancement Feature requests or improvements label Dec 22, 2024
@texodus texodus marked this pull request as ready for review December 22, 2024 23:03
@texodus texodus merged commit c4666a3 into master Dec 22, 2024
13 checks passed
@texodus texodus deleted the strptime-wasm branch December 22, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant