-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid clobbering existing warc #137
Comments
Also, this implies to me that even if two files are identical and have identical URLs, a dedupe record is still written? Why is that? |
I'm not sure I understand these questions? Warcprox has no conception of "recreating the warc file".
See https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit Closing as there doesn't seem to be an issue reported here. |
It is the same code being run twice. Can you explain why the files are different sizes? And why the warc file is smaller the second time? |
Deduplication, presumably. The fact that it went back to the original size after you deleted the dedup db strongly corroborates this. You can also look in the warcs to see what's inside there... |
Notice how it is actually smaller on the second run. I've confirmed that that isn't because of compression. Is the deduplication run out-of-band? How can the file become smaller the second time I run the command? |
Deduplication means you don't save a second copy of something if you already have it. The second warc being smaller is the whole point. |
So yes, it is deleting the original warc file and creating a new one, instead of appending the new results on to the end of the old one. |
That's the intended behavior? Perhaps it could copy the old warc file to |
Uhhh. Oh, now I understand the confusion. I had assumed that you had renamed your warcs from warcprox's default naming scheme to book.pythontips.com.warc.gz. Normally warcprox names its warcs such that it basically guarantees uniqueness. But I guess you are using Warcprox is not designed to write to a single warc. It rolls over to a new warc when the active warc reaches a configurable size, or a configurable time since the last write has elapsed. I'm thinking we should rename |
@vbanos since --warc-filename is your feature, do you have time to implement this improvement? 😃 |
Yeah, that would be a lot less confusing. Panicking and dieing in the case of filename collision would probably be fine, that would have forced me to read the docs more. I was operating under the assumption that the warc files were essentially an append-only log. My bad, but it took an embarrassingly long time to notice there was a problem while I was busy dealing with selenium issues. |
It looks like the dedupe file is used again, but the warc file is being created from scratch. That's definitely not was I would expect, is that how it's supposed to work? If you're recreating the warc file, shouldn't you be recreating the DB as well?
The text was updated successfully, but these errors were encountered: