Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

Open
andrewjbtw opened this issue Feb 27, 2020 · 6 comments

Comments

@andrewjbtw
Copy link
Collaborator

andrewjbtw commented Feb 27, 2020

79 web archive crawl objects have errors in the wasCrawlDisseminationWF of the form:

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer /web-archiving-stacks/data/collections/kh149kf8484/bj/330/fg/0526/CDL-20100320000124-00037-oriole.ucop.edu-00131663.arc.gz /web-archiving-stacks/data/indices/cdx_working//druid:bj330fg0526/CDL-20100320000124-00037-oriole.ucop.edu-00131663.cdx 2>> log/cdx_indexer.log pid 959 exit 1

Some of these errors date back at least 4 years. I tried resetting the step for some items but they return to an error state.

Link to all current items with errors:
https://argo.stanford.edu/catalog?f%5Bwf_wps_ssim%5D%5B%5D=wasCrawlDisseminationWF%3Acdx-generator%3Aerror&per_page=100

Link to the one item with this error that was accessioned within the past year:
https://argo.stanford.edu/view/druid:cv292vs5727

@andrewjbtw andrewjbtw changed the title Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer Feb 28, 2020
@jcoyne
Copy link
Contributor

jcoyne commented Apr 21, 2020

In the log I see lots of similar Java stacktraces:

java.io.IOException: Failed parse of http status line.
        at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
        at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
        at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
        at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
java.io.IOException: Failed parse of http status line.
        at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)

Does anyone know what that means?

@aaron-collier aaron-collier self-assigned this May 4, 2020
@aaron-collier
Copy link
Contributor

Looks like a known (and old) issue: iipc/openwayback#14

@aaron-collier
Copy link
Contributor

Might be related (maybe not, just capturing it here). When I run the command manually, at the end I get:

WARNING: Trying skip of failed record cleanup of {WARC-Type=response, reader-identifier=/web-archiving-stacks/data/collections/xs048zp7815/cv/292/vs/5727/ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz, WARC-Date=2016-12-11T13:29:18Z, absolute-offset=471946099, Content-Length=19248744, WARC-Record-ID=<urn:uuid:f5405d28-65d0-4754-a1e5-b2175ede1bbc>, WARC-Payload-Digest=sha1:FXV6QEH674CTRXLMLVR5YQLEQCXS6D4X, WARC-IP-Address=54.230.141.103, WARC-Target-URI=http://dr6lcqo3bxtwa.cloudfront.net/binary/2016/12/9/23/1437582013143-p5k4ma/20161208b_CSPAN_BoxerFilibuster-1481328291115.mp4, Content-Type=application/http; msgtype=response}: invalid block type
java.util.zip.ZipException: invalid block type
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
	at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.hasNext(ArchiveReaderCloseableIterator.java:37)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

May 04, 2020 4:06:44 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {WARC-Type=response, reader-identifier=/web-archiving-stacks/data/collections/xs048zp7815/cv/292/vs/5727/ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz, WARC-Date=2016-12-11T13:29:18Z, absolute-offset=471946099, Content-Length=19248744, WARC-Record-ID=<urn:uuid:f5405d28-65d0-4754-a1e5-b2175ede1bbc>, WARC-Payload-Digest=sha1:FXV6QEH674CTRXLMLVR5YQLEQCXS6D4X, WARC-IP-Address=54.230.141.103, WARC-Target-URI=http://dr6lcqo3bxtwa.cloudfront.net/binary/2016/12/9/23/1437582013143-p5k4ma/20161208b_CSPAN_BoxerFilibuster-1481328291115.mp4, Content-Type=application/http; msgtype=response}: invalid block type
java.util.zip.ZipException: invalid block type
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
	at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

May 04, 2020 4:06:44 PM org.archive.io.ArchiveReader$ArchiveRecordIterator next
WARNING: Bad Record. Trying skip (Record start 471946099): invalid block type
Exception in thread "main" java.lang.RuntimeException: After retry (Offset 471946099)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
Caused by: java.util.zip.ZipException: invalid block type
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:126)
	at org.archive.util.LaxHttpParser.readRawLine(LaxHttpParser.java:84)
	at org.archive.util.LaxHttpParser.readLine(LaxHttpParser.java:112)
	at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:113)
	at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
	at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
	at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader$1.innerNext(WARCReaderFactory.java:290)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
	... 7 more
[

@aaron-collier
Copy link
Contributor

Perhaps for dev discussion (or a specific planning meeting on WAS), but I think the items with the above issues are the result of corrupt source files.

If I try to manually unzip the file from above:

was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

gzip: ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz: invalid compressed data--format violated

But if I unzip a different file:

[was@was-robots1-prod testing]$ gunzip webrecorder-cidr-20160512212029480-00000-6-5d875e632943.warc.gz
[was@was-robots1-prod testing]$ ls
ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz  webrecorder-cidr-20160512212029480-00000-6-5d875e632943.warc

That worked fine. So I suspect we'll need a way re-download/start the source file. Just a hunch at this point.

@aaron-collier aaron-collier transferred this issue from sul-dlss/was-registrar-app May 5, 2020
@andrewjbtw
Copy link
Collaborator Author

Adding a note that this error affects the Stanford University Websites collection (specifically druid cv292vs5727), which is a collection that we're hoping to actively accession in the near future.

@aaron-collier
Copy link
Contributor

Just some continued investigation:

The zip files adjacent to the one I believe is corrupt are fine:

[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

gzip: ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz: invalid compressed data--format violated
[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211133805177-00052.warc.gz
[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211135951009-00053.warc.gz

Also of note, it is much smaller than all of the other QUARTERLY zips:

-rw-r--r-- 1 was was 1089840585 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211135951009-00053.warc.gz
-rw-r--r-- 1 was was 1336199543 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211133805177-00052.warc.gz
-rw-r--r-- 1 was was  483459072 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

@aaron-collier aaron-collier removed their assignment Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants