Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vertexcodec: Implement enhanced delta encoding #820

Merged
merged 16 commits into from
Dec 13, 2024
Merged

vertexcodec: Implement enhanced delta encoding #820

merged 16 commits into from
Dec 13, 2024

Conversation

zeux
Copy link
Owner

@zeux zeux commented Dec 12, 2024

When using new format we can now select between byte and halfword
deltas, or switch to xor without zigzag encoding (which is
width-invariant).

Different types of deltas are optimal for different content; 2-byte
deltas provide the most value, whereas 4-byte and XOR deltas are more
experimental. This PR initially contained support for 4-byte deltas as well,
but it was increasing complexity without providing value so it was removed.
Xor deltas could be used to optimize for decoding performance when they
don't change the ratio.

The delta encoding is stored as a channel type for every 4 bytes of the
vertex;for now we only need to encode a 2-bit value, but we reserve a full
byte for simplicity - this may change in the future to accommodate rotates.

While varying the channel data per vertex block may be helpful, it
increases the block overhead and switching encodings for adjacent blocks
penalizes efficiency of LZ compressors. As such, we will only work with
a single encoding across the entire data stream for every channel.

To select the delta encoding, we try to encode deltas in all possible ways
and count the estimated number of bytes using a specific control mode.
This is not technically optimal as it ignores other subtleties of byte group
encoding, but ends up close to optimal in terms of efficiency. This also
significantly slows down encoding process, but that can be fixed once we
settle on the delta types we need to support.

With this, we're up to >5% gains overall, with no loss of decoding performance:

143 files: raw v1/v0 -5.24%, lz4 v1/v0 -4.21%, zstd v1/v0 -2.76%

With just 1/2 byte deltas, we get:

143 files: raw v1/v0 -5.20%, lz4 v1/v0 -4.21%, zstd v1/v0 -2.75%

As before, all of these changes are to an experimental version that has absolutely
no guarantees of bitstream compatibility until it gets renumbered to 1.

This contribution is sponsored by Valve.

zeux added 15 commits December 11, 2024 13:03
When new encoding is selected, in addition to storing the first vertex
in the tail we also store channel data. Channel data will encode the
transformation type for every 4 bytes of the vertex; for now we only
need to encode a 2-bit value, but we reserve a full byte for simplicity
- and we may need 2 bytes for this in the future to accomodate rotates.

While varying the channel data per vertex block may be helpful, it
increase the block overhead and switching encodings for adjacent blocks
penalizes efficiency of LZ compressors. As such, we will only work with
a single encoding across the entire data stream for every channel.
This still encodes just one byte at a time, but that simplifies the
encode flow a little bit. We should be able to compute deltas using
other bytes in the same channel based on the specified encoding type
without changing the flow here.
When using new format we can now select between byte, halfword, word
deltas, or switch to xor without zigzag encoding (which is
width-invariant).

Different types of deltas are optimal for different content; 2-byte
deltas provide the most value, whereas 4-byte and XOR deltas are more
experimental and may be removed in the future.

Ideally it would be possible to select the optimal mode by looking at
the source data without repeat encoding attempts; this will be
investigated separately.
When using the new version, we can now pick the optimal channel
encoding; for now, we do this in a fairly naive fashion: we encode
deltas in all possible ways and count the number of bytes.

This does not, strictly speaking, minimize the output size: for
simplicity, we estimate the number of bytes using a specific fixed bit
group layout, which seems pretty close in terms of its efficiency for
any delta encoding type. This is still not very fast, but hopefully
better, faster heuristics can be developed in the future.
This helps contextualize the encoding performance differences between
v0/v1. For completeness, this is also added for index codecs although
there are no plans for improving these.
In addition to accumulating the deltas via 16-bit or 32-bit adds (or
xors), we need to perform zigzag decoding in larger units. This is more
complicated, because for 8-bit data we do this before the transpose to
fully utilize SIMD units; thus, 16 and 32-bit unzigzag needs to work on
decomposed integer values.

For 16-bit, it's easy enough to unpack and repack these; for 32-bit,
repacking is more involved due to the lack of unsigned 32-bit packing
instructions. So we opt for doing a byte-wise unzigzag instead, which
just needs to propagate the bits when shifting the value by 1 across
byte boundaries manually.
To make this easier, scalar delta decoding is now also done 4 bytes at a
time, using a template to share C++ code to the extent possible.
The implementation mostly mirrors SSE path; however, we use the same
approach for 16-bit delta, where instead of unpacking and repacking the
full value we opt for bytewise processing. This is even more optimal
than it is for SSE since NEON supports native byte shifts.
The implementation mirrors NEON approach exactly: we use simplified byte
wise processing for both 16-bit and 32-bit deltas; the 16-bit path is
faster this way even when targeting x64 in the Wasm backend.
Instead of unary negation, use subtraction; also convert sizeof-based
conditions in scalar decoder to a loop which the compiler should unroll
anyway.
For unclear reasons, despite seemingly better codegen, unzigzag16 with
unpacks is a little faster on clang compared to the byte-wise version.
However, it's noticeably slower on MSVC in a way that affects top line
decode bandwidth, so let's opt for using a better version to avoid
surprises.

Also switch to using _mm_andnot_si128 explicitly; some compilers can
automatically transform this but it's easy to do ourselves for safety.
Using encodeBytesMeasure requires aligned byte groups, but we were not
padding the last block with zeroes previously which triggered an
assertion.
Instead of hardcoding each data width, we can implement this with the
same approach we've used for decodeDeltas1, templating over the type and
using loops over sizeof(T). This runs at the same speed but is shorter
and hopefully easier to maintain.
For clarity we print dots for higher bytes within the same channel that
are using the same type of delta.

Also fix first trace line being hard to read in codectest.
Both xors and 32-bit deltas are not particularly profitable on real
data; of the two, 32-bit deltas take more effort to support whereas xor
is not only easier to support but also decodes faster.

In the future we might keep xor mode to be able to select faster deltas
for channels that do not compress at all anyway.
@zeux zeux merged commit fbc7c93 into master Dec 13, 2024
12 checks passed
@zeux zeux deleted the vcone-delc branch December 13, 2024 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant