-
Notifications
You must be signed in to change notification settings - Fork 14
2024.09.17 F2F Editors Meeting
- Julian Morley (Stanford)
- Rosy Metz (Emory)
- Simeon Warner (Cornell)
- Andrew Woods (Harvard)
- Neil Jefferies (Oxford)
- Arrival in Ghent no later than Sunday Sept. 15th
- Do the workshop on Monday Sept. 16th
- Details TBD during the week, but we will have meetings Tuesday - Thursday
- Full day meeting on Friday Sept. 20th that will end no earlier than Friday at 5pm local time (i.e., your flights need to take this into consideration)
- Triage/create tickets/topics into spec or implementation notes
- Determine what tickets we can group together
- Address any low hanging fruit tickets during the meeting
- Identify a timeline for v.2.0, consider
- what we want for release candidate timeframe to gather feedback?
- how will we support updating tooling?
- how many validators do we need?
- Use Case Repository: Specifically the issues labeled Confirmed In-scope
- Spec Repository
- Draft Spec
- Extensions Repository
- OCFL Package Per Version Workgroup Notes
Must be done by 4pm for the 4:30pm DPC Awards Neil is nominated for.
Time | Topic |
---|---|
12pm - 2pm. | OCFL Workshop at iPRES |
2:30pm - 3:30pm | Create Schedule for the Week |
Must be done by 3:30pm for a 4pm panel Neil is on.
Time | Topic |
---|---|
9am - 10:30am | Application Profiles: Use Case #50 |
11:00am - 12:30pm | Coffee with Jürgen |
12:30pm - 1:30pm | Lunch at the Conference |
1:30pm - 2:30pm | Application Profiles: Use Case #50 |
2:30pm - 3:30pm | File Deletion, Corruption, Loss: Use Cases #42 and #14. |
6pm - onwards | Conference Reception |
Use case: https://github.com/OCFL/Use-Cases/issues/50
Editors agree that application profiling will be handled through an extension and not through changes to the specification.
Extension 0008: Schema Registry may provide inspiration for addressing this use case. The extension could be as simple as pointing to documentation, or can be more complex like Jürgen's self-documenting and machine-actionable objects. The minimal storage root extension might be a profile directory in the extension directory that includes human readable description and/or links to external documentation. Refinement might allow use at the object level and possible reference from an object to profiles at the storage root level. Will create a draft extension for community discussion.
Handling file loss, file corruption or version collapse all change the assumption of version immutability. This is necessarily a version 2 concept so it can only apply in a version 2+ storage root. Even with a notionally immutable system, one can have corruption. Possible solutions without mutability would be to delete corrupted objects or just store a record of corrupted files outside of the system.
We agree on the tombstones
idea described in the use case #42
Question from @Brian Wheeler https://github.com/OCFL/Use-Cases/issues/42#issuecomment-1949288221 : "If the file is gone then it would not appear in the manifest
?". We agree that when a file is gone then the file would be shown in the tombstones
block and not in the manifest
block.
We will create a new version to record that a file has been deleted, vanished or corrupted. We will recommend that no other changes be made at the same time as the recording of deletion/corruption. The creation of a new version gives the chance to write a new version message with user/time/etc. and any other human readable information about the why the change is occurring.
Why might a file be tombstoned?
- missing/vanished, removed/deleted -- spec will not distinguish these cases: file does not appear in
manifest
, appears with original digest intombstones
- corrupted -- file appears with "new" digest in
manifest
, with original digest intombstones
- name in file system but unreadable or not reliably readable -- file appears with empty digest string (not a valid digest output in any digest format we know, and an empty string is valid as key on JSON object whereas Null etc. are not) in
manifest
, with original digest intombstones
Use cases for corrupted and or unreadable:
- write once storage where we can't delete
- corruption where we want to keep corrupted file for possible later analysis
We will add an extra parameter in ocfl_layout.json
to flag use of mutability features such as tombstones
with the implication that tooling MUST check latest inventory before trying to read any version.
Implementation notes must:
- account for deduped files
- talk about read errors and inconsistency
- talk about corruption characteristics of different storage types
- talk about need for documentation in new version
- impact on other V2 features - packages and content-linking
- validation strategies
Example of file deletion (unchanged from 2023-09-23 comment)
{
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": { },
"tombstones": {
"7545b8...f67": [ "v1/content/file.txt" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2020-10-12T01:00:00Z",
"message": "One file",
"state": {
"7545b8...f67": [ "file.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2021-01-00T02:00:00Z",
"message": "The one file had to be deleted entirely for legal reasons",
"state": { },
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
}
{
"digestAlgorithm": "sha512",
"head": "v2",
"id": "http://example.org/minimal_deletion",
"manifest": {
"": [ "v1/content/file1.txt" ],
"aaa143...79a": [ "v1/content/file2.txt" ]
},
"tombstones": {
"7545b8...f67": [ "v1/content/file1.txt" ],
"fe4512...e47": [ "v1/content/file2.txt" ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2023-10-02T12:00:00Z",
"message": "Two files",
"state": {
"7545b8...f67": [ "file1.txt" ],
"fe4512...e47": [ "file2.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2024-09-20T10:09:00Z",
"message": "File 1 vanished or cannot be read reliably, exclude. File 2 is corrupted with a different checksum, change name in state",
"state": {
"aaa143...79a": [ "file2_corrupted.txt" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
}
}
}
Time | Topic |
---|---|
9am - 10:30am | Keynote for iPRES |
11am - 12:30pm | Conference Presentations |
12:30pm - 1:30pm | Lunch at the Conference |
1:30pm - 5pm | Package Per Version: Use Case #33 |
5:30pm - 6:30pm | Boat Trip to Dinner |
6:30pm - onward | Conference Dinner |
See Thursday's notes.
Must be done by 2:30pm for Neil's BOF from 3pm - 4pm.
Time | Topic |
---|---|
9am - 10:30am | Package Per Version: Use Case #33 |
11am - 12:30pm | Conference Presentations |
12:30pm - 1:30pm | Conference Lunch |
1:30pm - 3:30pm | Conference Presentations |
These notes reference the Package Per Version Use Case, which is Use Case #33. It potentially addresses the issue of lots of small files as well as splitting large files.
- If a version is packaged, then the ENTIRE contents (inventory, sidecar and content) of the version directory are stored in the package.
- Depending on the packaging format the package may comprise one or more files
- Packages are stored in the version directory. If the packages were stored in the root directory, then the user might end up with many package files in the root directory.
- example: Stanford has files with 100+ packages per version, and cases where there is 50TB data in 10,000 5GB chunks
- The version directory for a packaged version MUST contain no other files than the package file(s)
- Users MUST choose one package type per version.
- The package file(s) are enumerated in packages.json stored in the object root. Therefore there is no particular naming convention required for packages - Implementation notes will recommend a simple and systematic approach.
- Do we include the version directory within the packages?
- Users cannot expand the package in place as it writes another version directory.
- i.e., you end up with v1 > v1 > all the content files, the
inventory.json
file, and it's sidecar file.
- i.e., you end up with v1 > v1 > all the content files, the
- If you're recovering to a different location, and all the packages are in the same folder then you can expand all the package files and end up with the complete object.
- Users cannot expand the package in place as it writes another version directory.
- Do we include a flag at the storage root level to indicate the use of packages throughout the storage root?
- This might allow for better tooling.
- Manage version packages in a
packages.json
file maintained at the object root.- Should not have any significant scaling issues
-
inventory.json
does not change format
- Separation of concerns between packaging and object versioning
- Repackaging does not constitute a versioning event (implementation notes will discuss in further detail)
- A given package contains all the files of a specific version so the package digest can be validated in lieu of the files in that version.
- All paths in the
packages.json
file are relative to the object root - There is a metadata block per version that provides information about the package with an array of key/values.
- The key values in the
"metadata"
block must include"format"
and"formatVersion"
(avoiding"type"
and"version"
because of namespace collisions) - An optional key is
"extension"
that points to an extension in the object’s extension folder that allows an organization to store other information about the package.- See the information in The archiveInformation Block that the Library of Norway created in their Workgroup Notes: Proposal Medium Impact
- The key values in the
- Outline the best method for rewriting packages...
- rewrite the packages,
- rewrite the
packages.json
- and update the sidecar file.
- We aren’t going to version the
packages.json
file.
This strategy replicates the manifest block of the inventory.json
file, i.e. “digest”: \[ “filename” \]
and then just specifies the order in the packages list for each version:
{
"digestAlgorithm": "sha512",
"type": "https://ocfl.io/1.1/spec/#packages",
"manifest": {
"abc..123": [ "v1/v1.zip" ],
"cde..123": [ "v3/v3.z01" ],
"ade..789": [ "v3/v3.z02" ],
"ces..229": [ "v3/v3.zip" ]
},
"versions": {
"v1": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["v1/v1.zip"]
},
"v3": {
"metadata": {
"format": "zip",
"formatVersion": "6.3.10",
"extension": "[extension-name-ref]"
},
"packages": ["v3/v3.z01", "v3/v3.z02", "v3/v3.zip"]
}
}
-
"manifest"
- lists all package files for all versions, this is done in an array to match theinventory.json
file. -
"format"
- lists a handful of defined package types, and also links to a controlled vocabulary extension similar to the digest algorithm extension which is optional. -
"formatVersion"
- the precise meaning of version may be dependent on the format used to package up the content. -
"extension"
- an optional extension used to include more information about the package files, the extension must be a local extension in the object, the additional information goes in extension directory -
"packages"
- the list of packages in the version in the order in which they should be unpacked.
The inventory.json
file remains unchanged. The files corresponding to the above packages.json
example would appear on disk:
[object root]
├── 0=ocfl_object_1.1
├── inventory.json
├── inventory.json.sha512
├── packages.json
├── packages.json.sha512
├── v1
│ └── v1.zip
├── v2
│ ├── inventory.json
│ ├── inventory.json.sha512
│ └── new-file.txt
└── v3
├── v3.zip
├── v3.z02
└── v3.z01
Must be done by 5pm for Neil's meeting at 6pm.
Time | Topic |
---|---|
These notes reference the Object Forking Use Case, which is Use Case 44. The use case is supported via content addressable storage. This introduces the concept of parent (the original object) and child (the object that is forked from the original object.
- We support this by inserting one or more pointers to one or more files in one or more parent objects. This is placed in the
manifest
block of the inheriting child object. - The version
state
block lists the logical path as normal, allowing users to change the file name when inherited from a parent. - A child object can inherit arbitrary files from multiple parent objects; it's not limited to the set of files of a single version from a single parent object. This is an implementation detail.
- However, an implementer may choose to limit this feature to all files in a specific version of a single parent object, if desired. This is also an implementation detail.
- A child object can only inherit files from parent objects in the same storage root. OCFL has no mechanism for referencing files outside of the current storage root.
- Inherited files cannot be included in the child's
fixity
block, and the verifier must lookup the parent object. - The child object must use the same
"digestAlgorithm"
as all parent objects. - File inheritance MUST NOT inherit a file from a grandparent. i.e., the act of creating a file link involves verifying with the parent object that the file exists in that object, and is not itself a pointer to another object's file.
- There is no benefit to inheriting a file from a grandparent, it only creates complexity and the specification aims for simplicity.
- To prevent recursion loops, validators must only check to one level of recursion when validating any object.
When a parent object is deleted:
- In a storage root that supports file inheritance a flag MUST be placed in the
ocfl_layout.json
file. - If you delete an object, you MUST check whether another object inherits files from that object. Implementation notes will address how to do this.
- We will create an extension as part of version 2 allowing you to document the child objects that depend on files in the parent object. verification of child objects will fail with a descriptive error (parent object no longer exists)
When a referenced file is deleted in a parent object:
- Tombstoning will be propagated via the verification process of the child object (i.e., the file has been deleted in parent object).
- A soft delete or rename in the parent object does not impact the child object in any way, as the original bitstream remains on disk in the parent's content directory and referenced in the parent's inventory.
- A child object is invalid if the current
state
block of a child object references a deleted file in a parent object.
Question:
- Should the tombstones get placed in the
inventory.json
of the child object? - Or does the implementation notes address the use of tombstoning in a parent as it may make a child object invalid?
When a file is corrupted in a parent object:
- verification process should flag it the same as in parent object (i.e. file is corrupted in parent object)
{
"digestAlgorithm": "sha512",
"head": "v3",
"id": "ark:/12345/bcd987",
"manifest": {
"4d27c8...b53": [ "v2/content/foo/bar.xml" ],
"7dcc35...c31": [ { "objectid": "ark:/67890/fgh123" } ],
"df83e1...a3e": [ { "objectid": "ark:/67890/fgh123" } ],
"ffccf6...62e": [ { "objectid": "ark:/67890/fgh123" } ]
},
"type": "https://ocfl.io/1.1/spec/#inventory",
"versions": {
"v1": {
"created": "2018-01-01T01:01:01Z",
"message": "Initial import. bar.xml, bigdata.dat and image.tiff are inherited from a parent object.",
"state": {
"7dcc35...c31": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ],
"ffccf6...62e": [ "image.tiff" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Alice"
}
},
"v2": {
"created": "2018-02-02T02:02:02Z",
"message": "Fix bar.xml replacing import with a local edit, remove image.tiff",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Bob"
}
},
"v3": {
"created": "2018-03-03T03:03:03Z",
"message": "Reinstate image.tiff",
"state": {
"4d27c8...b53": [ "foo/bar.xml" ],
"df83e1...a3e": [ "bigdata.dat" ],
"ffccf6...62e": [ "image.tiff" ]
},
"user": {
"address": "mailto:[email protected]",
"name": "Cecilia"
}
}
}
}
https://github.com/OCFL/Use-Cases/issues/46
Two distinct scenarios have different solutions:
- The desire to collapse versions and delete intermediate file revisions where there is no need or desire to keep the details of the historical changes because they are curatorially insignificant. If versions have been created rather than using an approach such as mutable head (see https://ocfl.github.io/extensions/0005-mutable-head.html) then such changes unavoidably mutate the object. We think this is best handled by rewriting the object with selected versions removed, taking care to keep all necessary/interesting changes. This approach needs no new specification support but additional implementation notes.
- The desire to delete intermediate files (perhaps to save storage) but to retain the history of versions. This is handled by the Support Physical File-Deletion use case.