cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

andrewjbtw · 2020-02-27T19:34:08Z

79 web archive crawl objects have errors in the wasCrawlDisseminationWF of the form:

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer /web-archiving-stacks/data/collections/kh149kf8484/bj/330/fg/0526/CDL-20100320000124-00037-oriole.ucop.edu-00131663.arc.gz /web-archiving-stacks/data/indices/cdx_working//druid:bj330fg0526/CDL-20100320000124-00037-oriole.ucop.edu-00131663.cdx 2>> log/cdx_indexer.log pid 959 exit 1

Some of these errors date back at least 4 years. I tried resetting the step for some items but they return to an error state.

Link to all current items with errors:
https://argo.stanford.edu/catalog?f%5Bwf_wps_ssim%5D%5B%5D=wasCrawlDisseminationWF%3Acdx-generator%3Aerror&per_page=100

Link to the one item with this error that was accessioned within the past year:
https://argo.stanford.edu/view/druid:cv292vs5727

The text was updated successfully, but these errors were encountered:

jcoyne · 2020-04-21T21:14:57Z

In the log I see lots of similar Java stacktraces:

java.io.IOException: Failed parse of http status line.
        at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53)
        at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57)
        at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
        at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
java.io.IOException: Failed parse of http status line.
        at org.archive.io.RecoverableIOException.<init>(RecoverableIOException.java:36)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:294)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114)
        at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79)

Does anyone know what that means?

aaron-collier · 2020-05-04T22:48:11Z

Looks like a known (and old) issue: iipc/openwayback#14

aaron-collier · 2020-05-04T23:09:17Z

Might be related (maybe not, just capturing it here). When I run the command manually, at the end I get:

WARNING: Trying skip of failed record cleanup of {WARC-Type=response, reader-identifier=/web-archiving-stacks/data/collections/xs048zp7815/cv/292/vs/5727/ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz, WARC-Date=2016-12-11T13:29:18Z, absolute-offset=471946099, Content-Length=19248744, WARC-Record-ID=<urn:uuid:f5405d28-65d0-4754-a1e5-b2175ede1bbc>, WARC-Payload-Digest=sha1:FXV6QEH674CTRXLMLVR5YQLEQCXS6D4X, WARC-IP-Address=54.230.141.103, WARC-Target-URI=http://dr6lcqo3bxtwa.cloudfront.net/binary/2016/12/9/23/1437582013143-p5k4ma/20161208b_CSPAN_BoxerFilibuster-1481328291115.mp4, Content-Type=application/http; msgtype=response}: invalid block type
java.util.zip.ZipException: invalid block type
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
	at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.hasNext(ArchiveReaderCloseableIterator.java:37)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

May 04, 2020 4:06:44 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {WARC-Type=response, reader-identifier=/web-archiving-stacks/data/collections/xs048zp7815/cv/292/vs/5727/ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz, WARC-Date=2016-12-11T13:29:18Z, absolute-offset=471946099, Content-Length=19248744, WARC-Record-ID=<urn:uuid:f5405d28-65d0-4754-a1e5-b2175ede1bbc>, WARC-Payload-Digest=sha1:FXV6QEH674CTRXLMLVR5YQLEQCXS6D4X, WARC-IP-Address=54.230.141.103, WARC-Target-URI=http://dr6lcqo3bxtwa.cloudfront.net/binary/2016/12/9/23/1437582013143-p5k4ma/20161208b_CSPAN_BoxerFilibuster-1481328291115.mp4, Content-Type=application/http; msgtype=response}: invalid block type
java.util.zip.ZipException: invalid block type
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:262)
	at org.archive.io.ArchiveRecord.skip(ArchiveRecord.java:248)
	at org.archive.io.ArchiveRecord.close(ArchiveRecord.java:172)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:175)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

May 04, 2020 4:06:44 PM org.archive.io.ArchiveReader$ArchiveRecordIterator next
WARNING: Bad Record. Trying skip (Record start 471946099): invalid block type
Exception in thread "main" java.lang.RuntimeException: After retry (Offset 471946099)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
Caused by: java.util.zip.ZipException: invalid block type
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
	at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
	at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
	at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:126)
	at org.archive.util.LaxHttpParser.readRawLine(LaxHttpParser.java:84)
	at org.archive.util.LaxHttpParser.readLine(LaxHttpParser.java:112)
	at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:113)
	at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:90)
	at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:94)
	at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader$1.innerNext(WARCReaderFactory.java:290)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
	... 7 more
[

aaron-collier · 2020-05-04T23:44:41Z

Perhaps for dev discussion (or a specific planning meeting on WAS), but I think the items with the above issues are the result of corrupt source files.

If I try to manually unzip the file from above:

was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

gzip: ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz: invalid compressed data--format violated

But if I unzip a different file:

[was@was-robots1-prod testing]$ gunzip webrecorder-cidr-20160512212029480-00000-6-5d875e632943.warc.gz
[was@was-robots1-prod testing]$ ls
ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz  webrecorder-cidr-20160512212029480-00000-6-5d875e632943.warc

That worked fine. So I suspect we'll need a way re-download/start the source file. Just a hunch at this point.

andrewjbtw · 2020-05-15T06:24:16Z

Adding a note that this error affects the Stanford University Websites collection (specifically druid cv292vs5727), which is a collection that we're hoping to actively accession in the near future.

aaron-collier · 2020-05-18T23:19:17Z

Just some continued investigation:

The zip files adjacent to the one I believe is corrupt are fine:

[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

gzip: ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz: invalid compressed data--format violated
[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211133805177-00052.warc.gz
[was@was-robots1-prod testing]$ gunzip ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211135951009-00053.warc.gz

Also of note, it is much smaller than all of the other QUARTERLY zips:

-rw-r--r-- 1 was was 1089840585 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211135951009-00053.warc.gz
-rw-r--r-- 1 was was 1336199543 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211133805177-00052.warc.gz
-rw-r--r-- 1 was was  483459072 Oct 30  2019 ARCHIVEIT-5591-QUARTERLY-JOB253992-20161211132256193-00051.warc.gz

andrewjbtw changed the title ~~Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer~~ cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer Feb 28, 2020

aaron-collier self-assigned this May 4, 2020

aaron-collier transferred this issue from sul-dlss/was-registrar-app May 5, 2020

aaron-collier removed their assignment Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

andrewjbtw commented Feb 27, 2020 •

edited

Loading

jcoyne commented Apr 21, 2020

aaron-collier commented May 4, 2020

aaron-collier commented May 4, 2020

aaron-collier commented May 4, 2020

andrewjbtw commented May 15, 2020

aaron-collier commented May 18, 2020

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

cdx-generator : Error in extracting CDX with command: jar/openwayback/bin/cdx-indexer #214

Comments

andrewjbtw commented Feb 27, 2020 • edited Loading

jcoyne commented Apr 21, 2020

aaron-collier commented May 4, 2020

aaron-collier commented May 4, 2020

aaron-collier commented May 4, 2020

andrewjbtw commented May 15, 2020

aaron-collier commented May 18, 2020

andrewjbtw commented Feb 27, 2020 •

edited

Loading