Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid clobbering existing warc #137

Open
traverseda opened this issue Jul 5, 2019 · 11 comments
Open

avoid clobbering existing warc #137

traverseda opened this issue Jul 5, 2019 · 11 comments

Comments

@traverseda
Copy link

#first run
1.7M	warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K	warc_cache/warcs/book.pythontips.com.warc.gz
#Deleted dedupe but not warc file
1.7M	warc_cache/warcs/book.pythontips.com.warc.gz

It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?

@traverseda
Copy link
Author

Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?

@nlevitt
Copy link
Contributor

nlevitt commented Jul 5, 2019

It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?

I'm not sure I understand these questions? Warcprox has no conception of "recreating the warc file".

Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that?

See https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit

Closing as there doesn't seem to be an issue reported here.

@nlevitt nlevitt closed this as completed Jul 5, 2019
@traverseda
Copy link
Author

traverseda commented Jul 5, 2019

#first run
1.7M	warc_cache/warcs/book.pythontips.com.warc.gz
#Second run, exact same code
516K	warc_cache/warcs/book.pythontips.com.warc.gz

It is the same code being run twice. Can you explain why the files are different sizes? And why the warc file is smaller the second time?

@nlevitt
Copy link
Contributor

nlevitt commented Jul 5, 2019

Deduplication, presumably. The fact that it went back to the original size after you deleted the dedup db strongly corroborates this. You can also look in the warcs to see what's inside there...

@traverseda
Copy link
Author

Notice how it is actually smaller on the second run. I've confirmed that that isn't because of compression.

Is the deduplication run out-of-band? How can the file become smaller the second time I run the command?

@nlevitt
Copy link
Contributor

nlevitt commented Jul 5, 2019

Deduplication means you don't save a second copy of something if you already have it. The second warc being smaller is the whole point.

@traverseda
Copy link
Author

So yes, it is deleting the original warc file and creating a new one, instead of appending the new results on to the end of the old one.

@traverseda
Copy link
Author

traverseda commented Jul 5, 2019

That's the intended behavior?

Perhaps it could copy the old warc file to mywarc.0.warc or something? That behavior is not explicit and I found it to be very confusing. I had presumed it was a bug, and it took me a while to track down the issue.

@nlevitt
Copy link
Contributor

nlevitt commented Jul 5, 2019

Uhhh. Oh, now I understand the confusion. I had assumed that you had renamed your warcs from warcprox's default naming scheme to book.pythontips.com.warc.gz. Normally warcprox names its warcs such that it basically guarantees uniqueness. But I guess you are using --warc-filename and not using any of the {variables}. The bug you're reporting is that in case of a filename collision, which you can reproduce easily using--warc-filename, the old file gets clobbered. Ok, that's a legitimate bug.

Warcprox is not designed to write to a single warc. It rolls over to a new warc when the active warc reaches a configurable size, or a configurable time since the last write has elapsed.

I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.

@nlevitt nlevitt changed the title Different sized warc on second run? avoid clobbering existing warc Jul 5, 2019
@nlevitt nlevitt reopened this Jul 5, 2019
@nlevitt
Copy link
Contributor

nlevitt commented Jul 5, 2019

I'm thinking we should rename --warc-filename to --warc-filename-template and require at least one of {timestamp17} and {serialno}, and probably panic and die in case of a filename collision.

@vbanos since --warc-filename is your feature, do you have time to implement this improvement? 😃

@traverseda
Copy link
Author

Yeah, that would be a lot less confusing.

Panicking and dieing in the case of filename collision would probably be fine, that would have forced me to read the docs more. I was operating under the assumption that the warc files were essentially an append-only log.

My bad, but it took an embarrassingly long time to notice there was a problem while I was busy dealing with selenium issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants