-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214
Comments
In the log I see lots of similar Java stacktraces:
Does anyone know what that means? |
Looks like a known (and old) issue: iipc/openwayback#14 |
Might be related (maybe not, just capturing it here). When I run the command manually, at the end I get:
|
Perhaps for dev discussion (or a specific planning meeting on WAS), but I think the items with the above issues are the result of corrupt source files. If I try to manually unzip the file from above:
But if I unzip a different file:
That worked fine. So I suspect we'll need a way re-download/start the source file. Just a hunch at this point. |
Adding a note that this error affects the Stanford University Websites collection (specifically druid cv292vs5727), which is a collection that we're hoping to actively accession in the near future. |
Just some continued investigation: The zip files adjacent to the one I believe is corrupt are fine:
Also of note, it is much smaller than all of the other
|
79 web archive crawl objects have errors in the wasCrawlDisseminationWF of the form:
Some of these errors date back at least 4 years. I tried resetting the step for some items but they return to an error state.
Link to all current items with errors:
https://argo.stanford.edu/catalog?f%5Bwf_wps_ssim%5D%5B%5D=wasCrawlDisseminationWF%3Acdx-generator%3Aerror&per_page=100
Link to the one item with this error that was accessioned within the past year:
https://argo.stanford.edu/view/druid:cv292vs5727
The text was updated successfully, but these errors were encountered: