-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse table metadata.configuration as TableProperties
#453
parse table metadata.configuration as TableProperties
#453
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #453 +/- ##
=======================================
Coverage ? 78.51%
=======================================
Files ? 57
Lines ? 11914
Branches ? 11914
=======================================
Hits ? 9354
Misses ? 2053
Partials ? 507 ☔ View full report in Codecov by Sentry. |
need to make all properties Options. why? for propagating unchanged in write. e.g. if appendOnly is omitted we want to omit in the write (not write its default) |
kernel/src/scan/mod.rs
Outdated
@@ -95,7 +95,7 @@ impl ScanBuilder { | |||
let (all_fields, read_fields, have_partition_cols) = get_state_info( | |||
logical_schema.as_ref(), | |||
&self.snapshot.metadata().partition_columns, | |||
self.snapshot.column_mapping_mode, | |||
self.snapshot.table_properties().get_column_mapping_mode(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't normally use get_
for field accessors?
self.snapshot.table_properties().get_column_mapping_mode(), | |
self.snapshot.table_properties().column_mapping_mode(), |
(rust can handle structs with both foo: Foo
and fn foo(&self) -> &Foo
-- even if both are public)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to hear everyone's thoughts on this.. I think we need two things here (and the API above was attempting to answer the second part)
- a struct to parse the table properties into (
TableProperties
struct) which lets users (or engines) examine the table properties that are set - an API for checking table properties/feature enablement. This likely needs to check if table features are enabled in the protocol and check table properties being set and perhaps check on other dependent features. (i.e. we need somewhere that we can embed this logic)
this probably isn't the PR for tackling (2) - but would love to hear some thoughts so we could get started on a follow-up. For now I'll just do column_mapping_mode()
to unblock this and we can iterate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes a lot of sense. These are checks that are going to be made for every table feature, so it helps to have all of it in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Likely we don't need a method for each one and should rather check a struct of some sort. The API design is a bit subtle though, as some features can be set to multiple modes (i.e. column mapping) and others can just be on/off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nanosecond" | "nanoseconds" => Duration::from_nanos(number), | ||
"microsecond" | "microseconds" => Duration::from_micros(number), | ||
"millisecond" | "milliseconds" => Duration::from_millis(number), | ||
"second" | "seconds" => Duration::from_secs(number), | ||
"minute" | "minutes" => Duration::from_secs(number * SECONDS_PER_MINUTE), | ||
"hour" | "hours" => Duration::from_secs(number * SECONDS_PER_HOUR), | ||
"day" | "days" => Duration::from_secs(number * SECONDS_PER_DAY), | ||
"week" | "weeks" => Duration::from_secs(number * SECONDS_PER_WEEK), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this list coming from somewhere specific that we need to stay in sync with?
(I wonder e.g. if we need to support e.g. nanos
and micros
)
Probably worth documenting the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this came from delta-rs
(recycled from Robert's old PR) but I've documented that in the code and pointed out the differences between this and delta-spark's behavior. also created a follow-up to improve this but i think we can leave this here for now and improve in follow-up? #507
} | ||
let mut it = value.split_whitespace(); | ||
let _ = it.next(); // skip "interval" | ||
let number = parse_int(it.next().ok_or_else(not_an_interval)?)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a particular thing we're trying to catch by parsing as a signed number and then returning an error if it's negative? would it not make more sense to just parse as a u64
and fail if we fail to parse?
I guess the error message is a little nicer in this case. Not sure it's worth the extra code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i made even nicer errors, lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to delete if not worth it
/// existing records cannot be deleted, and existing values cannot be updated. | ||
#[serde(rename = "delta.appendOnly")] | ||
#[serde(deserialize_with = "deserialize_bool")] | ||
pub append_only: Option<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a bit of work, but it may be worth linking to the protocol directly here. We do it elsewhere in the codebase, and I think it's useful. Ex:
/// true for this Delta table to be append-only. If append-only,
/// existing records cannot be deleted, and existing values cannot be updated.
///
/// [Append-only Tables]: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#append-only-tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to go through and do this.. will try to finish this in the morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm leaving this for future work lol - it'll be good first issue #519
kernel/src/actions/mod.rs
Outdated
@@ -113,6 +114,11 @@ impl Metadata { | |||
pub fn schema(&self) -> DeltaResult<StructType> { | |||
Ok(serde_json::from_str(&self.schema_string)?) | |||
} | |||
|
|||
/// Parse the metadata configuration HashMap<String, String> into a TableProperties struct. | |||
pub fn parse_table_properties(&self) -> DeltaResult<TableProperties> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just eagerly parse TableProperties
instead doing the HashMap<String,String>
to TableProperties
separately. Do we split it up because of the Schema
derive?
#[derive(Debug, Default, Clone, PartialEq, Eq, Schema)]
pub struct Metadata {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup exactly. for now this seemed like the best way to separate the schema (Metadata
struct) from the actual parsing of TableProperties
. perhaps in the future we can look into unifying these? could just omit the derive and impl Schema ourselves (or add some new fancy mechanism that lets us annotate fields with [derive(Schema)]
kernel/src/scan/mod.rs
Outdated
@@ -95,7 +95,7 @@ impl ScanBuilder { | |||
let (all_fields, read_fields, have_partition_cols) = get_state_info( | |||
logical_schema.as_ref(), | |||
&self.snapshot.metadata().partition_columns, | |||
self.snapshot.column_mapping_mode, | |||
self.snapshot.table_properties().get_column_mapping_mode(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Likely we don't need a method for each one and should rather check a struct of some sort. The API design is a bit subtle though, as some features can be set to multiple modes (i.e. column mapping) and others can just be on/off.
pub mod schema; | ||
pub mod snapshot; | ||
pub mod table; | ||
pub mod table_properties; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only meaningful change is adding pub mod table_properties
, other changes are shifting to colocate module declarations
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #453 +/- ##
==========================================
+ Coverage 80.58% 81.02% +0.44%
==========================================
Files 63 65 +2
Lines 13762 14110 +348
Branches 13762 14110 +348
==========================================
+ Hits 11090 11433 +343
+ Misses 2113 2096 -17
- Partials 559 581 +22 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
acceptance/tests/dat_reader.rs
Outdated
@@ -12,6 +12,11 @@ fn reader_test(path: &Path) -> datatest_stable::Result<()> { | |||
path.parent().unwrap().to_str().unwrap() | |||
); | |||
|
|||
// TODO(zach): skip iceberg_compat_v1 test until DAT is fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! just please do fix how you skip the iceberg test, and use the existing stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits, but the question fallback behavior for properties that exist but fail to parse seems important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these custom deserializers have a fallback for cases where the table property exists but cannot parse to the expected type? Ideally, those should land in the unknown_properties
map along with everything else, with the parsed version being None
.
Needed because many table properties are not necessarily in the Delta spec, and so different clients may use them differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I currently implemented that as failure: see fail_known_keys
test. You're saying the behavior should instead be something like this?
#[test]
fn known_key_unknown_val() {
let properties = HashMap::from([("delta.appendOnly".to_string(), "wack".to_string())]);
let de = MapDeserializer::<_, de::value::Error>::new(properties.clone().into_iter());
let table_properties = TableProperties::deserialize(de).unwrap();
let unknown_properties = HashMap::from([("delta.appendOnly", "wack")]);
let expected = TableProperties {
unknown_properties,
..Default::default()
};
assert_eq!(table_properties, expected);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. Otherwise I fear we'll eventually add yet another parsing failure ticket to the pile, along with all the CommitInfo issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea fair point.. let me see how to do this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
welp this was the straw that broke serde
's back.. instead of implementing a custom Deserialize
I just decided to do the naive thing and implement a big match by hand. Actually isn't that gross and we can always try serde
again or just write our own macro to make everything nice
/// This value should be large enough to ensure that: | ||
/// | ||
/// * It is larger than the longest possible duration of a job if you run VACUUM when there are | ||
/// concurrent readers or writers accessing the Delta table. | ||
/// * If you run a streaming query that reads from the table, that query does not stop for | ||
/// longer than this value. Otherwise, the query may not be able to restart, as it must still | ||
/// read old files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: will we have a check that blocks too-small intervals? ie less than one day?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could! is that in the protocol/implement in delta-spark? (i'll check too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's in the delta-spark impl? You have to "break glass" to vacuum with an interval less than X?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps take this as a follow-up? (not implementing vacuum currently and other implementations that consume this could check themselves?)
created #522
.unknown_properties | ||
.insert(k.as_ref().to_string(), v.as_ref().to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that I think about it maybe we could actually consume the original hashmap and consume items we parse from it and store the remainder in unknown_properties
.. but gonna leave that for future work/probably not really worth optimizing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment above -- consuming the original allows a pretty significant code simplification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restamping. lgtm
"delta.autoOptimize.optimizeWrite" => { | ||
parse_bool(v.as_ref()).map(|val| props.optimize_write = Some(val)) | ||
} | ||
"delta.checkpointInterval" => parse_positive_int(v.as_ref()) | ||
.map(|val| props.checkpoint_interval = Some(val)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I'm confused.. how does cargo fmt
decide to only sometimes use {}
for these structurally identical match arms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I noticed the same thing and was unhappy haha
for (k, v) in unparsed { | ||
let parsed = | ||
match k.as_ref() { | ||
"delta.appendOnly" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this loop have quadratic behavior because the match arms are checked sequentially?
Or is rust smart enough to build a perfect hash of some kind?
With 23 table properties so far, a linear string of comparisons translates to 23*22/2 = 253 expected string comparisons in the worst case where all the properties are set. And a bunch more are coming because almost every table feature has a table property that decides whether it's active.
Still tho... 256 short string comparisons would be a few microseconds at most, so prob not worth the trouble to be smarter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah haha this is funny that I had the same thought and dismissed it as I felt like I was being too performance-concerned! Glad I'm not crazy :)
I think you're correct and the rust compiler doesn't (at least in 1.73 stable) implement anything fancy here - according to quick chatGPT inquiry
really i think our options are:
- leverage custom
Deserialize
fromserde
for constant-time lookup (/linear-time loop).- pros: well-known efficient solution, elegant annotations on the struct for e.g. renaming the field for deserialization
- cons: Extra overhead from
serde
considering we have such a constrained case of only ever deserializing from aHashMap<String, String>
, code complexity of customDeserialize
- leverage
HashMap
-based lookup instead of thematch
for constant-time. could even use a crate likephf
to get compile-time generation of perfect hash functions for string matching.- pros: relatively well-known efficient solution, straightforward/understandable
- cons: increasing code size/complexity increase (similar to serde approach above)
- leave it as-is
- pros: simplest/fastest
- cons: worst performance
given these options I propose we move forward as-is since this isn't really a blocker but we open an issue to implement option (1) or (2) in the future - this would be a great 'first-issue' for someone to tackle and ramp up on a little piece of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.unknown_properties | ||
.insert(k.as_ref().to_string(), v.as_ref().to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment above -- consuming the original allows a pretty significant code simplification
kernel/src/table_properties.rs
Outdated
|
||
#[test] | ||
fn allow_unknown_keys() { | ||
let properties = [("some_random_unknown_key".to_string(), "test".to_string())]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be "fun" to do:
let properties = [("some_random_unknown_key".to_string(), "test".to_string())]; | |
let properties = [("unknown_properties".to_string(), "messing with your mind".to_string())]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha will add!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did something slightly shorter to keep on one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Just a couple questions (but could be follow-up items if you want)
|
||
/// Modes of column mapping a table can be in | ||
#[derive(Serialize, Deserialize, Debug, Copy, Clone, PartialEq)] | ||
#[derive(Debug, Default, EnumString, Serialize, Deserialize, Copy, Clone, PartialEq, Eq)] | ||
#[strum(serialize_all = "camelCase")] | ||
#[serde(rename_all = "camelCase")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Do these classes still need serde support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately yea, since it is still a field in GlobalScanState
(which derives Serialize
/Deserialize
)
…nto table-properties
@@ -99,7 +102,9 @@ pub(crate) fn parse_bool(s: &str) -> Option<bool> { | |||
/// Deserialize a comma-separated list of column names into an `Option<Vec<ColumnName>>`. Returns | |||
/// `Some` if successfully parses, and `None` otherwise. | |||
pub(crate) fn parse_column_names(s: &str) -> Option<Vec<ColumnName>> { | |||
ColumnName::parse_column_name_list(s).ok() | |||
ColumnName::parse_column_name_list(s) | |||
.map_err(|e| warn!("column name list failed to parse: {e}")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a job for inspect_err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yep good catch thanks - I'll just squeeze that into the next small PR
fn test_parse_column_names() { | ||
assert_eq!( | ||
parse_column_names("`col 1`, col.a2,col3").unwrap(), | ||
vec![ | ||
ColumnName::new(["col 1"]), | ||
column_name!("col.a2"), | ||
column_name!("col3") | ||
] | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_parse_column_name_list
in column_names.rs already covers this pretty thoroughly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm yea maybe that was unnecessary - i'll remove since this is basically jsut a pass-through to the column name parsing
What changes are proposed in this pull request?
New
TableProperties
struct parses theMetadata
action'sconfiguration
into a strongly-typed configuration struct. Instead of previously manually querying specific keys (e.g. checking ColumnMapping key) we can leverage the configuration to appropriately handle DVs, column mapping, etc. in a structured manner.Pragmatically, the changes are:
TableProperties
struct along with newSnapshot::get_table_properties()
APIa. Note that every field is optional and additionally there is an
unknown_properties
field which includes a hashmap of all fields that did not parse to a known table property.table_properties
module including deserialization submodule to implement functions to deserialize the fields ofTableProperties
Future work
TableProperties
fields with protocol links #519TableProperties
parsing performance #535This PR affects the following public APIs
TableProperties
andSnapshot::get_table_properties
APIHow was this change tested?
new ut
Credit: @roeap for the original PR which this is based on #222