-
Notifications
You must be signed in to change notification settings - Fork 14
2024.09.17 F2F Editors Meeting
- Julian Morley (Stanford)
- Rosy Metz (Emory)
- Simeon Warner (Cornell)
- Andrew Woods (Harvard)
- Neil Jefferies (Oxford)
- Arrival in Ghent no later than Sunday Sept. 15th
- Do the workshop on Monday Sept. 16th
- Details TBD during the week, but we will have meetings Tuesday - Thursday
- Full day meeting on Friday Sept. 20th that will end no earlier than Friday at 5pm local time (i.e., your flights need to take this into consideration)
- Triage/create tickets/topics into spec or implementation notes
- Determine what tickets we can group together
- Address any low hanging fruit tickets during the meeting
- Identify a timeline for v.2.0, consider
- what we want for release candidate timeframe to gather feedback?
- how will we support updating tooling?
- how many validators do we need?
- Use Case Repository: Specifically the issues labeled Confirmed In-scope
- Spec Repository
- Draft Spec
- Extensions Repository
- OCFL Package Per Version Workgroup Notes
Must be done by 4pm for the 4:30pm DPC Awards Neil is nominated for.
Time | Topic |
---|---|
12pm - 2pm. | OCFL Workshop at iPRES |
2:30pm - 3:30pm | Create Schedule for the Week |
Must be done by 3:30pm for a 4pm panel Neil is on.
Time | Topic |
---|---|
9am - 10:30am | Application Profiles: Use Case #50 |
11:00am - 12:30pm | Coffee with Jürgen |
12:30pm - 1:30pm | Lunch at the Conference |
1:30pm - 2:30pm | Application Profiles: Use Case #50 |
2:30pm - 3:30pm | File Deletion, Corruption, Loss: Use Cases #42 and #14. |
6pm - onwards | Conference Reception |
Use case: https://github.com/OCFL/Use-Cases/issues/50
Editors agree that application profiling will be handled through an extension and not through changes to the specification.
Extension 0008: Schema Registry may provide inspiration for addressing this use case. The extension could be as simple as pointing to documentation, or can be more complex like Jürgen's self-documenting and machine-actionable objects. The minimal storage root extension might be a profile directory in the extension directory that includes human readable description and/or links to external documentation. Refinement might allow use at the object level and possible reference from an object to profiles at the storage root level. Will create a draft extension for community discussion.
Handling file loss, file corruption or version collapse all change the assumption of version immutability. This is necessarily a version 2 concept so it can only apply in a version 2+ storage root. Even with a notionally immutable system, one can have corruption. Possible solutions without mutability would be to delete corrupted objects or just store a record of corrupted files outside of the system.
We agree on the tombstones
idea described in the use case #42
Question from @Brian Wheeler https://github.com/OCFL/Use-Cases/issues/42#issuecomment-1949288221 : "If the file is gone then it would not appear in the manifest
?". We agree that when a file is gone then the file would be shown in the tombstones
block and not in the manifest
block.
We will create a new version to record that a file has been deleted, vanished or corrupted. We will recommend that no other changes be made at the same time as the recording of deletion/corruption. The creation of a new version gives the chance to write a new version message with user/time/etc. and any other human readable information about the why the change is occurring.
Why might a file be tombstoned?
- missing/vanished, removed/deleted -- spec will not distinguish these cases: file does not appear in
manifest
, appears with original digest intombstones
- corrupted -- file appears with "new" digest in
manifest
, with original digest intombstones
- name in file system but unreadable or not reliably readable -- file appears with empty digest string (not a valid digest output in any digest format we know, and an empty string is valid as key on JSON object whereas Null etc. are not) in
manifest
, with original digest intombstones
Use cases for corrupted and or unreadable:
- write once storage where we can't delete
- corruption where we want to keep corrupted file for possible later analysis
We will add an extra parameter in ocfl_layout.json
to flag use of mutability features such as tombstones
with the implication that tooling MUST check latest inventory before trying to read any version.
Implementation notes must:
- account for deduped files
- talk about read errors and inconsistency
- talk about corruption characteristics of different storage types
- talk about need for documentation in new version
- impact on other V2 features - packages and content-linking
- validation strategies
Example of file deletion (unchanged from 2023-09-23 comment)
{
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": { },
"tombstones": {
"7545b8...f67": [ "v1/content/file.txt" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2020-10-12T01:00:00Z",
"message": "One file",
"state": {
"7545b8...f67": [ "file.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2021-01-00T02:00:00Z",
"message": "The one file had to be deleted entirely for legal reasons",
"state": { },
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
}
{
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": {
"": [ "v1/content/file1.txt" ],
"aaa143...79a": [ "v1/content/file2.txt" ]
},
"tombstones": {
"7545b8...f67": [ "v1/content/file1.txt" ],
"fe4512...e47": [ "v1/content/file2.txt" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2023-10-02T12:00:00Z",
"message": "Two files",
"state": {
"7545b8...f67": [ "file1.txt" ],
"fe4512...e47": [ "file2.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2024-09-20T10:09:00Z",
"message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum, change name in state",
"state": {
"aaa143...79a": [ "file2_corrupted.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
}
Time | Topic |
---|---|
9am - 10:30am | Keynote for iPRES |
11am - 12:30pm | Conference Presentations |
12:30pm - 1:30pm | Lunch at the Conference |
1:30pm - 5pm | Package Per Version: Use Case #33 |
5:30pm - 6:30pm | Boat Trip to Dinner |
6:30pm - onward | Conference Dinner |
See Thursday's notes.
Must be done by 2:30pm for Neil's BOF from 3pm - 4pm.
Time | Topic |
---|---|
9am - 10:30am | Package Per Version: Use Case #33 |
11am - 12:30pm | Conference Presentations |
12:30pm - 1:30pm | Conference Lunch |
1:30pm - 3:30pm | Conference Presentations |
These notes reference the Package Per Version Use Case, which is Use Case #33. It potentially addresses the issue of lots of small files as well as splitting large files.
- If a version is packaged, then the ENTIRE contents (inventory, sidecar and content) of the version directory are stored in the package.
- Depending on the packaging format the package may comprise one or more files
- Packages are stored in the version directory. If the packages were stored in the root directory, then the user might end up with many package files in the root directory.
- example: Stanford has files with 100+ packages per version, and cases where there is 50TB data in 10,000 5GB chunks
- The version directory for a packaged version MUST contain no other files than the package file(s)
- Users MUST choose one package type per version.
- The package file(s) are enumerated in packages.json stored in the object root. Therefore there is no particular naming convention required for packages - Implementation notes will recommend a simple and systematic approach.
- Do we include the version directory within the packages?
- Users cannot expand the package in place as it writes another version directory.
- i.e., you end up with v1 > v1 > all the content files, the
inventory.json
file, and it's sidecar file.
- i.e., you end up with v1 > v1 > all the content files, the
- If you're recovering to a different location, and all the packages are in the same folder then you can expand all the package files and end up with the complete object.
- Users cannot expand the package in place as it writes another version directory.
- Do we include a flag at the storage root level to indicate the use of packages throughout the storage root?
- This might allow for better tooling.
- Manage version packages in a
packages.json
file maintained at the object root.- Should not have any significant scaling issues
-
inventory.json
does not change format
- Separation of concerns between packaging and object versioning
- Repackaging does not constitute a versioning event (implementation notes will discuss in further detail)
- A given package contains all the files of a specific version so the package digest can be validated in lieu of the files in that version.
- All paths in the
packages.json
file are relative to the object root - There is a metadata block per version that provides information about the package with an array of key/values.
- The key values in the
"metadata"
block must include"format"
and"formatVersion"
(avoiding"type"
and"version"
because of namespace collisions) - An optional key is
"extension"
that points to an extension in the object’s extension folder that allows an organization to store other information about the package.- See the information in The archiveInformation Block that the Library of Norway created in their Workgroup Notes: Proposal Medium Impact
- The key values in the
- Outline the best method for rewriting packages...
- rewrite the packages,
- rewrite the
packages.json
- and update the sidecar file.
- We aren’t going to version the
packages.json
file.
This strategy replicates the manifest block of the inventory.json
file, i.e. “digest”: \[ “filename” \]
and then just specifies the order in the packages list for each version:
{
"digestAlgorithm": "sha512",
"type": "https://ocfl.io/1.1/spec/#packages",
"manifest": {
"abc..123": [ "v1/v1.zip" ],
"cde..123": [ "v3/v3.z01" ],
"ade..789": [ "v3/v3.z02" ],
"ces..229": [ "v3/v3.zip" ]
},
"versions": {
"v1": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["v1/v1.zip"]
},
"v3": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["v3/v3.z01", "v3/v3.z02", "v3/v3.zip"]
}
}
-
"manifest"
- lists all package files for all versions, this is done in an array to match theinventory.json
file. -
"format"
- lists a handful of defined package types, and also links to a controlled vocabulary extension similar to the digest algorithm extension which is optional. -
"formatVersion"
- the precise meaning of version may be dependent on the format used to package up the content. -
"extension"
- an optional extension used to include more information about the package files, the extension must be a local extension in the object, the additional information goes in extension directory -
"packages"
- the list of v1 packages in the order in which they should be upackaged.
The inventory.json
file remains unchanged. Below is how the above package.json
example would appear on disk:
[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── v1
│ └── v1.zip
├── v2
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── new-file.txt
└── v3
├── v3.zip
├── v3.z02
└── v3.z01
Must be done by 5pm for Neil's meeting at 6pm.
Time | Topic |
---|---|