Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in the last checkpoint metadata field naming: snake_case used instead of camelCase #326

Closed
hackintoshrao opened this issue Sep 6, 2024 · 4 comments · Fixed by delta-io/delta-rs#2889

Comments

@hackintoshrao
Copy link
Contributor

hackintoshrao commented Sep 6, 2024

While investigating snapshot creation for earlier versions (PR #322), I uncovered an inconsistency in how checkpoint metadata fields are named in the _last_checkpoint file.

Current Behavior:
The _last_checkpoint file contains fields in snake_case format. For example:

{"size":8,"size_in_bytes":21857,"version":1}

Expected Behavior:
According to the Delta Protocol specification, these fields should be in camelCase. The expected format should be:

{"size":8,"sizeInBytes":21857,"version":1}

Impact:
This inconsistency causes issues with deserialization when the CheckpointMetadata struct is annotated with #[serde(rename_all = "camelCase")], leading to some fields (like size_in_bytes) being incorrectly set to None.

Current Workaround:
We've temporarily addressed this in PR #322 by adding an alias to the affected field:

#[serde(alias = "size_in_bytes")]
pub(crate) size_in_bytes: Option<i64>,

Proposed Long-term Solution:

  1. Investigate why delta-rs is writing checkpoint metadata in snake_case instead of camelCase.
  2. Update the checkpoint writing logic to use camelCase field names, aligning with the Delta protocol specification.
  3. Update the CheckpointMetadata struct to expect camelCase without needing aliases.
  4. If backwards compatibility is required, implement a more robust deserialization that can handle both camelCase and snake_case variants.

Questions to Address:

  1. Was the use of snake_case a deliberate choice? If so, what was the reasoning?
  2. Are there any backward compatibility concerns we need to consider when fixing this?
  3. Are there any other parts of the codebase affected by this inconsistency?
hackintoshrao added a commit to e6data/delta-kernel-rs that referenced this issue Sep 6, 2024
- Added serde alias for 'size_in_bytes' field in CheckpointMetadata struct
- This allows deserialization of both camelCase and snake_case variants
- Addresses issue with inconsistent field naming in _last_checkpoint file

This is a temporary workaround for the issue described in delta-io#326. The long-term
solution will involve aligning the checkpoint writing logic with the Delta
protocol specification to use camelCase field names consistently.

See delta-io#326 for full details.
@scovich
Copy link
Collaborator

scovich commented Sep 9, 2024

Investigate why delta-rs is writing checkpoint metadata in snake_case instead of camelCase.

My guess: rust clippy enforces snake_case for identifiers, and the json serde will automatically follow that convention unless specifically instructed to convert it to camelCase.

@scovich
Copy link
Collaborator

scovich commented Sep 9, 2024

Confirmed: https://github.com/delta-io/delta-rs/blob/main/crates/core/src/table/mod.rs#L42-L46

/// Metadata for a checkpoint file
#[derive(Serialize, Deserialize, Debug, Default, Clone, Copy)]
pub struct CheckPoint {
      ...
    #[serde(skip_serializing_if = "Option::is_none")]
    /// The number of bytes of the checkpoint. This field is optional.
    pub(crate) size_in_bytes: Option<i64>,
    #[serde(skip_serializing_if = "Option::is_none")]
    /// The number of AddFile actions in the checkpoint. This field is optional.
    pub(crate) num_of_add_files: Option<i64>,
}

@feniljain
Copy link

feniljain commented Sep 20, 2024

Corresponding PR on the delta-rs is merged :)

@zachschuermann
Copy link
Collaborator

Given this is fixed in delta-rs now, and the alias never landed in kernel, seems we can close this issue and only read the field as camelCase. Given _last_checkpoint files are rather ephemeral I wouldn't think we need to keep around the ability to read the incorrectly-written field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants