Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds SQL support for Configurable Table Snapshot History #262

Merged
merged 10 commits into from
Dec 19, 2024

Conversation

Will-Lo
Copy link
Collaborator

@Will-Lo Will-Lo commented Dec 4, 2024

Summary

Adds Spark SQL support for configurable table snapshots, which controls the versioning of the Openhouse tables.
Syntax is similar to retention but is instead defined as HISTORY.

History configuration supports both MAX_AGE and VERSIONS, where we retain all table snapshots that live within MAX_AGE and within VERSIONS.

Example: A table with MAX_AGE = 1d will retain all snapshots that are within 1 day of when the snapshot retention job last ran.
A table with VERSIONS = 5 will retain the last 5 snapshots of the table without considering the age of the snapshots
If both MAX_AGE = 1d and VERSIONS = 5 is defined, keep the last 5 snapshots within the last day. Note: If there are less than 5 snapshots, then there were less than 5 commits done in the past day.

MAX_AGE and VERSIONS cannot be defined as less than 1.
The default maximums of MAX_AGE and VERSIONS defined in #259 are 3 days and 100 versions respectively.

Examples:

ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=24H VERSIONS=10)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=3D)
ALTER TABLE <db>.<table> SET POLICY (HISTORY VERSIONS=5)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Tested on local docker running spark:
Tested setting both policies

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=3d VERSIONS=3 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+-----------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                            |
+--------+-----------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 3,
    "granularity": "DAY",
    "versions": 3
  }
}|
+--------+-----------------------------------------------------------------------------------------------------------------+

Setting only versions:

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY VERSIONS=20 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+----------------------------------------------------------------------------------------+
|key     |value                                                                                   |
+--------+----------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 0,
    "versions": 20
  }
}|
+--------+----------------------------------------------------------------------------------------+

Setting only max age

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=8h )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+------------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                             |
+--------+------------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 8,
    "granularity": "HOUR",
    "versions": 0
  }
}|
+--------+------------------------------------------------------------------------------------------------------------------+

Also tested negative cases (invalid numbers, past maximums defined in #259)
e.g.

scala> spark.sql("ALTER TABLE openhouse.db.tb SET POLICY ( HISTORY MAX_AGE=1h VERSIONS=2 )").show
org.apache.iceberg.exceptions.BadRequestException: 400 , {"status":"BAD_REQUEST","error":"Bad Request","message":" : History for the table [LocalHadoopCluster.db.tb] max age must be between 1 to 3 days","stacktrace":null,"cause":"Not Available"}


For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@Will-Lo Will-Lo marked this pull request as ready for review December 4, 2024 02:32
Copy link
Member

@abhisheknath2011 abhisheknath2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Will-Lo. Overall looks good. Added some comments.

@Will-Lo Will-Lo changed the title Adds SQL support for Configurable Table Snapshots Adds SQL support for Configurable Table Snapshot History Dec 4, 2024
@Will-Lo Will-Lo closed this Dec 6, 2024
@Will-Lo Will-Lo reopened this Dec 6, 2024
@Will-Lo Will-Lo force-pushed the snapshots_policy_sql branch from a2316b4 to 6b7999a Compare December 6, 2024 19:16
@Will-Lo Will-Lo closed this Dec 6, 2024
@Will-Lo Will-Lo reopened this Dec 6, 2024
@maluchari
Copy link
Collaborator

Summary

Adds Spark SQL support for configurable table snapshots, which controls the versioning of the Openhouse tables. Syntax is similar to retention but is instead defined as HISTORY.

Examples:

ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=24H MIN_VERSIONS=10)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MAX_AGE=3D)
ALTER TABLE <db>.<table> SET POLICY (HISTORY MIN_VERSIONS=5)

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Tested on local docker running spark: Tested setting both policies

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=3d MIN_VERSIONS=3 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+-----------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                            |
+--------+-----------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 3,
    "granularity": "DAY",
    "minVersions": 3
  }
}|
+--------+-----------------------------------------------------------------------------------------------------------------+

Setting only versions:

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MIN_VERSIONS=20 )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+----------------------------------------------------------------------------------------+
|key     |value                                                                                   |
+--------+----------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 0,
    "minVersions": 20
  }
}|
+--------+----------------------------------------------------------------------------------------+

Setting only max age

scala> spark.sql("ALTER TABLE openhouse.db.tb2 SET POLICY ( HISTORY MAX_AGE=8h )").show
++
||
++
++


scala> spark.sql("SHOW TBLPROPERTIES openhouse.db.tb2 (policies)").show(truncate=false)
+--------+------------------------------------------------------------------------------------------------------------------+
|key     |value                                                                                                             |
+--------+------------------------------------------------------------------------------------------------------------------+
|policies|{
  "sharingEnabled": false,
  "history": {
    "maxAge": 8,
    "granularity": "HOUR",
    "minVersions": 0
  }
}|
+--------+------------------------------------------------------------------------------------------------------------------+

Also tested negative cases (invalid numbers, past maximums defined in #259)

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Can we please add details regarding the SQL API contract? If both versions and max age is present what is expected for example.

maluchari
maluchari previously approved these changes Dec 16, 2024
Copy link
Collaborator

@maluchari maluchari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. But overall LGTM

Copy link
Member

@abhisheknath2011 abhisheknath2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring the tests and answering the questions. Added minor comment, otherwise LGTM!

autumnust
autumnust previously approved these changes Dec 16, 2024
Copy link
Collaborator

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Will-Lo Will-Lo dismissed stale reviews from abhisheknath2011 and maluchari via 6039e0b December 17, 2024 19:53
@Will-Lo Will-Lo merged commit abccdc3 into linkedin:main Dec 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants