vertexcodec: Implement enhanced delta encoding #820

zeux · 2024-12-12T14:44:39Z

When using new format we can now select between byte and halfword
deltas, or switch to xor without zigzag encoding (which is
width-invariant).

Different types of deltas are optimal for different content; 2-byte
deltas provide the most value, whereas 4-byte and XOR deltas are more
experimental. This PR initially contained support for 4-byte deltas as well,
but it was increasing complexity without providing value so it was removed.
Xor deltas could be used to optimize for decoding performance when they
don't change the ratio.

The delta encoding is stored as a channel type for every 4 bytes of the
vertex;for now we only need to encode a 2-bit value, but we reserve a full
byte for simplicity - this may change in the future to accommodate rotates.

While varying the channel data per vertex block may be helpful, it
increases the block overhead and switching encodings for adjacent blocks
penalizes efficiency of LZ compressors. As such, we will only work with
a single encoding across the entire data stream for every channel.

To select the delta encoding, we try to encode deltas in all possible ways
and count the estimated number of bytes using a specific control mode.
This is not technically optimal as it ignores other subtleties of byte group
encoding, but ends up close to optimal in terms of efficiency. This also
significantly slows down encoding process, but that can be fixed once we
settle on the delta types we need to support.

With this, we're up to >5% gains overall, with no loss of decoding performance:

143 files: raw v1/v0 -5.24%, lz4 v1/v0 -4.21%, zstd v1/v0 -2.76%

With just 1/2 byte deltas, we get:

143 files: raw v1/v0 -5.20%, lz4 v1/v0 -4.21%, zstd v1/v0 -2.75%

As before, all of these changes are to an experimental version that has absolutely
no guarantees of bitstream compatibility until it gets renumbered to 1.

This contribution is sponsored by Valve.

When new encoding is selected, in addition to storing the first vertex in the tail we also store channel data. Channel data will encode the transformation type for every 4 bytes of the vertex; for now we only need to encode a 2-bit value, but we reserve a full byte for simplicity - and we may need 2 bytes for this in the future to accomodate rotates. While varying the channel data per vertex block may be helpful, it increase the block overhead and switching encodings for adjacent blocks penalizes efficiency of LZ compressors. As such, we will only work with a single encoding across the entire data stream for every channel.

This still encodes just one byte at a time, but that simplifies the encode flow a little bit. We should be able to compute deltas using other bytes in the same channel based on the specified encoding type without changing the flow here.

When using new format we can now select between byte, halfword, word deltas, or switch to xor without zigzag encoding (which is width-invariant). Different types of deltas are optimal for different content; 2-byte deltas provide the most value, whereas 4-byte and XOR deltas are more experimental and may be removed in the future. Ideally it would be possible to select the optimal mode by looking at the source data without repeat encoding attempts; this will be investigated separately.

When using the new version, we can now pick the optimal channel encoding; for now, we do this in a fairly naive fashion: we encode deltas in all possible ways and count the number of bytes. This does not, strictly speaking, minimize the output size: for simplicity, we estimate the number of bytes using a specific fixed bit group layout, which seems pretty close in terms of its efficiency for any delta encoding type. This is still not very fast, but hopefully better, faster heuristics can be developed in the future.

This helps contextualize the encoding performance differences between v0/v1. For completeness, this is also added for index codecs although there are no plans for improving these.

In addition to accumulating the deltas via 16-bit or 32-bit adds (or xors), we need to perform zigzag decoding in larger units. This is more complicated, because for 8-bit data we do this before the transpose to fully utilize SIMD units; thus, 16 and 32-bit unzigzag needs to work on decomposed integer values. For 16-bit, it's easy enough to unpack and repack these; for 32-bit, repacking is more involved due to the lack of unsigned 32-bit packing instructions. So we opt for doing a byte-wise unzigzag instead, which just needs to propagate the bits when shifting the value by 1 across byte boundaries manually.

To make this easier, scalar delta decoding is now also done 4 bytes at a time, using a template to share C++ code to the extent possible.

The implementation mostly mirrors SSE path; however, we use the same approach for 16-bit delta, where instead of unpacking and repacking the full value we opt for bytewise processing. This is even more optimal than it is for SSE since NEON supports native byte shifts.

The implementation mirrors NEON approach exactly: we use simplified byte wise processing for both 16-bit and 32-bit deltas; the 16-bit path is faster this way even when targeting x64 in the Wasm backend.

Instead of unary negation, use subtraction; also convert sizeof-based conditions in scalar decoder to a loop which the compiler should unroll anyway.

For unclear reasons, despite seemingly better codegen, unzigzag16 with unpacks is a little faster on clang compared to the byte-wise version. However, it's noticeably slower on MSVC in a way that affects top line decode bandwidth, so let's opt for using a better version to avoid surprises. Also switch to using _mm_andnot_si128 explicitly; some compilers can automatically transform this but it's easy to do ourselves for safety.

Using encodeBytesMeasure requires aligned byte groups, but we were not padding the last block with zeroes previously which triggered an assertion.

Instead of hardcoding each data width, we can implement this with the same approach we've used for decodeDeltas1, templating over the type and using loops over sizeof(T). This runs at the same speed but is shorter and hopefully easier to maintain.

For clarity we print dots for higher bytes within the same channel that are using the same type of delta. Also fix first trace line being hard to read in codectest.

Both xors and 32-bit deltas are not particularly profitable on real data; of the two, 32-bit deltas take more effort to support whereas xor is not only easier to support but also decodes faster. In the future we might keep xor mode to be able to select faster deltas for channels that do not compress at all anyway.

zeux added 15 commits December 11, 2024 13:03

demo: Add encoding throughput metric to output

1e1dd2f

This helps contextualize the encoding performance differences between v0/v1. For completeness, this is also added for index codecs although there are no plans for improving these.

vertexcodec: Implement support for scalar delta decoding

55831a3

To make this easier, scalar delta decoding is now also done 4 bytes at a time, using a template to share C++ code to the extent possible.

vertexcodec: Implement SIMD delta decoding support for Wasm

64c25da

The implementation mirrors NEON approach exactly: we use simplified byte wise processing for both 16-bit and 32-bit deltas; the 16-bit path is faster this way even when targeting x64 in the Wasm backend.

vertexcodec: Fix MSVC warnings

ea7fff0

Instead of unary negation, use subtraction; also convert sizeof-based conditions in scalar decoder to a loop which the compiler should unroll anyway.

vertexcodec: Fix delta estimation for unaligned padding

e8021f8

Using encodeBytesMeasure requires aligned byte groups, but we were not padding the last block with zeroes previously which triggered an assertion.

vertexcodec: Expand TRACE output to include channels

34614da

For clarity we print dots for higher bytes within the same channel that are using the same type of delta. Also fix first trace line being hard to read in codectest.

zeux force-pushed the vcone-delc branch from e36ebc7 to 0e17ab0 Compare December 13, 2024 01:40

Adjust GHA workflows to skip Draco assets better

0c69838

zeux force-pushed the vcone-delc branch from a1b4787 to 0c69838 Compare December 13, 2024 01:53

zeux merged commit fbc7c93 into master Dec 13, 2024
12 checks passed

zeux deleted the vcone-delc branch December 13, 2024 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vertexcodec: Implement enhanced delta encoding #820

vertexcodec: Implement enhanced delta encoding #820

zeux commented Dec 12, 2024 •

edited

Loading

vertexcodec: Implement enhanced delta encoding #820

vertexcodec: Implement enhanced delta encoding #820

Conversation

zeux commented Dec 12, 2024 • edited Loading

zeux commented Dec 12, 2024 •

edited

Loading