Better datetime
/date
string parsing performance
#2885
Merged
+177
−129
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tl;DR 7x performance improvement parsing non-ISO-8601 datetime strings in CSV or JSON formats.
I've been analyzing Perspective performance using this NYC Open Data 500k row CSV file, which loads fine but takes ~22 seconds on my computer to parse:
Looking into the profile analysis shows a lot of long calls out from WASM into JS to call
strptime
, which Perspective uses for anything not ISO 8601 formatted date or datetime strings. In Emscripten,strptime
is implemented as a JavaScript foreign call which turns out to be quite expensive, as well as requiring unnecessary string copying.Perspective ultimately calls
strptime
through Arrow, which is used internally for CSV and any other string-to-datetime parsing (including in JSON formats). It turns out though, Arrow itself has a vendored C++ implementation ofstrptime
which it uses for Windows builds. Patching Arrow to use its own vendoredstrptime
implementation for Emscripten as well reduces the original runtime to ~3s:This PR only applies this fix for Emscripten (WebAssembly) builds. I haven't yet tested Python, but I don't expect this fix to be applicable for Python as these implementations are likely identical in this context.