avoid clobbering existing warc #137

traverseda · 2019-07-05T20:08:16Z

#first run
1.7M	warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K	warc_cache/warcs/book.pythontips.com.warc.gz
#Deleted dedupe but not warc file
1.7M	warc_cache/warcs/book.pythontips.com.warc.gz

It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?

The text was updated successfully, but these errors were encountered:

traverseda · 2019-07-05T20:24:37Z

Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?

nlevitt · 2019-07-05T21:03:18Z

It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?

I'm not sure I understand these questions? Warcprox has no conception of "recreating the warc file".

Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?

See https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit

Closing as there doesn't seem to be an issue reported here.

traverseda · 2019-07-05T21:09:19Z

#first run
1.7M	warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K	warc_cache/warcs/book.pythontips.com.warc.gz

It is the same code being run twice. Can you explain why the files are different sizes? And why the warc file is smaller the second time?

nlevitt · 2019-07-05T21:16:22Z

Deduplication, presumably. The fact that it went back to the original size after you deleted the dedup db strongly corroborates this. You can also look in the warcs to see what's inside there...

traverseda · 2019-07-05T21:21:17Z

Notice how it is actually smaller on the second run. I've confirmed that that isn't because of compression.

Is the deduplication run out-of-band? How can the file become smaller the second time I run the command?

nlevitt · 2019-07-05T21:23:19Z

Deduplication means you don't save a second copy of something if you already have it. The second warc being smaller is the whole point.

traverseda · 2019-07-05T21:24:16Z

So yes, it is deleting the original warc file and creating a new one, instead of appending the new results on to the end of the old one.

traverseda · 2019-07-05T21:25:25Z

That's the intended behavior?

Perhaps it could copy the old warc file to mywarc.0.warc or something? That behavior is not explicit and I found it to be very confusing. I had presumed it was a bug, and it took me a while to track down the issue.

nlevitt · 2019-07-05T21:45:40Z

Uhhh. Oh, now I understand the confusion. I had assumed that you had renamed your warcs from warcprox's default naming scheme to book.pythontips.com.warc.gz. Normally warcprox names its warcs such that it basically guarantees uniqueness. But I guess you are using --warc-filename and not using any of the {variables}. The bug you're reporting is that in case of a filename collision, which you can reproduce easily using--warc-filename, the old file gets clobbered. Ok, that's a legitimate bug.

Warcprox is not designed to write to a single warc. It rolls over to a new warc when the active warc reaches a configurable size, or a configurable time since the last write has elapsed.

I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.

nlevitt · 2019-07-05T21:48:39Z

I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.

@vbanos since --warc-filename is your feature, do you have time to implement this improvement? 😃

traverseda · 2019-07-05T21:49:52Z

Yeah, that would be a lot less confusing.

Panicking and dieing in the case of filename collision would probably be fine, that would have forced me to read the docs more. I was operating under the assumption that the warc files were essentially an append-only log.

My bad, but it took an embarrassingly long time to notice there was a problem while I was busy dealing with selenium issues.

nlevitt closed this as completed Jul 5, 2019

nlevitt changed the title ~~Different sized warc on second run?~~ avoid clobbering existing warc Jul 5, 2019

nlevitt reopened this Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid clobbering existing warc #137

avoid clobbering existing warc #137

traverseda commented Jul 5, 2019

traverseda commented Jul 5, 2019

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019 •

edited

Loading

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019

traverseda commented Jul 5, 2019 •

edited

Loading

nlevitt commented Jul 5, 2019

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019

avoid clobbering existing warc #137

avoid clobbering existing warc #137

Comments

traverseda commented Jul 5, 2019

traverseda commented Jul 5, 2019

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019 • edited Loading

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019

traverseda commented Jul 5, 2019 • edited Loading

nlevitt commented Jul 5, 2019

nlevitt commented Jul 5, 2019

traverseda commented Jul 5, 2019

traverseda commented Jul 5, 2019 •

edited

Loading

traverseda commented Jul 5, 2019 •

edited

Loading