-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vertexcodec: Implement enhanced delta encoding #820
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When new encoding is selected, in addition to storing the first vertex in the tail we also store channel data. Channel data will encode the transformation type for every 4 bytes of the vertex; for now we only need to encode a 2-bit value, but we reserve a full byte for simplicity - and we may need 2 bytes for this in the future to accomodate rotates. While varying the channel data per vertex block may be helpful, it increase the block overhead and switching encodings for adjacent blocks penalizes efficiency of LZ compressors. As such, we will only work with a single encoding across the entire data stream for every channel.
This still encodes just one byte at a time, but that simplifies the encode flow a little bit. We should be able to compute deltas using other bytes in the same channel based on the specified encoding type without changing the flow here.
When using new format we can now select between byte, halfword, word deltas, or switch to xor without zigzag encoding (which is width-invariant). Different types of deltas are optimal for different content; 2-byte deltas provide the most value, whereas 4-byte and XOR deltas are more experimental and may be removed in the future. Ideally it would be possible to select the optimal mode by looking at the source data without repeat encoding attempts; this will be investigated separately.
When using the new version, we can now pick the optimal channel encoding; for now, we do this in a fairly naive fashion: we encode deltas in all possible ways and count the number of bytes. This does not, strictly speaking, minimize the output size: for simplicity, we estimate the number of bytes using a specific fixed bit group layout, which seems pretty close in terms of its efficiency for any delta encoding type. This is still not very fast, but hopefully better, faster heuristics can be developed in the future.
This helps contextualize the encoding performance differences between v0/v1. For completeness, this is also added for index codecs although there are no plans for improving these.
In addition to accumulating the deltas via 16-bit or 32-bit adds (or xors), we need to perform zigzag decoding in larger units. This is more complicated, because for 8-bit data we do this before the transpose to fully utilize SIMD units; thus, 16 and 32-bit unzigzag needs to work on decomposed integer values. For 16-bit, it's easy enough to unpack and repack these; for 32-bit, repacking is more involved due to the lack of unsigned 32-bit packing instructions. So we opt for doing a byte-wise unzigzag instead, which just needs to propagate the bits when shifting the value by 1 across byte boundaries manually.
To make this easier, scalar delta decoding is now also done 4 bytes at a time, using a template to share C++ code to the extent possible.
The implementation mostly mirrors SSE path; however, we use the same approach for 16-bit delta, where instead of unpacking and repacking the full value we opt for bytewise processing. This is even more optimal than it is for SSE since NEON supports native byte shifts.
The implementation mirrors NEON approach exactly: we use simplified byte wise processing for both 16-bit and 32-bit deltas; the 16-bit path is faster this way even when targeting x64 in the Wasm backend.
Instead of unary negation, use subtraction; also convert sizeof-based conditions in scalar decoder to a loop which the compiler should unroll anyway.
For unclear reasons, despite seemingly better codegen, unzigzag16 with unpacks is a little faster on clang compared to the byte-wise version. However, it's noticeably slower on MSVC in a way that affects top line decode bandwidth, so let's opt for using a better version to avoid surprises. Also switch to using _mm_andnot_si128 explicitly; some compilers can automatically transform this but it's easy to do ourselves for safety.
Using encodeBytesMeasure requires aligned byte groups, but we were not padding the last block with zeroes previously which triggered an assertion.
Instead of hardcoding each data width, we can implement this with the same approach we've used for decodeDeltas1, templating over the type and using loops over sizeof(T). This runs at the same speed but is shorter and hopefully easier to maintain.
For clarity we print dots for higher bytes within the same channel that are using the same type of delta. Also fix first trace line being hard to read in codectest.
Both xors and 32-bit deltas are not particularly profitable on real data; of the two, 32-bit deltas take more effort to support whereas xor is not only easier to support but also decodes faster. In the future we might keep xor mode to be able to select faster deltas for channels that do not compress at all anyway.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When using new format we can now select between byte and halfword
deltas, or switch to xor without zigzag encoding (which is
width-invariant).
Different types of deltas are optimal for different content; 2-byte
deltas provide the most value, whereas 4-byte and XOR deltas are more
experimental. This PR initially contained support for 4-byte deltas as well,
but it was increasing complexity without providing value so it was removed.
Xor deltas could be used to optimize for decoding performance when they
don't change the ratio.
The delta encoding is stored as a channel type for every 4 bytes of the
vertex;for now we only need to encode a 2-bit value, but we reserve a full
byte for simplicity - this may change in the future to accommodate rotates.
While varying the channel data per vertex block may be helpful, it
increases the block overhead and switching encodings for adjacent blocks
penalizes efficiency of LZ compressors. As such, we will only work with
a single encoding across the entire data stream for every channel.
To select the delta encoding, we try to encode deltas in all possible ways
and count the estimated number of bytes using a specific control mode.
This is not technically optimal as it ignores other subtleties of byte group
encoding, but ends up close to optimal in terms of efficiency. This also
significantly slows down encoding process, but that can be fixed once we
settle on the delta types we need to support.
With this, we're up to >5% gains overall, with no loss of decoding performance:
With just 1/2 byte deltas, we get:
As before, all of these changes are to an experimental version that has absolutely
no guarantees of bitstream compatibility until it gets renumbered to 1.
This contribution is sponsored by Valve.