Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pywb record inserting domain and collection name into recorded URL on specific sites #886

Open
JakeBickUKGWA opened this issue Feb 22, 2024 · 2 comments

Comments

@JakeBickUKGWA
Copy link

JakeBickUKGWA commented Feb 22, 2024

Describe the bug

When recording on specific sites, pywb record appears to be duplicating content in the recorded URL, this also seems to be happening in playback, the original target page seems to be captured ok but when you navigate away from it and try to return to the page you get a 404. I've also tried this using ArchiveWeb.page and am getting similar behaviour.

Steps to reproduce the bug

  1. Attempt to record one of the affected pages with pywb record. The recording URL will look something like this

[pywb instance URL]/[collection]/record/https://teesvalley-ca.gov.uk/about/leadership/cabinet-boards-committees/meetings/local-enterprise-partnership/

  1. The page loads normally for a second but then the URL changes to:

[pywb instance URL]/[collection]/record/https://teesvalley-ca.gov.uk/[collection]/record/mp_/https://teesvalley-ca.gov.uk/about/leadership/cabinet-boards-committees/meetings/local-enterprise-partnership/

  1. If you try to play back the page e.g. by going to:

[pywb instance URL]/[collection]/20240222092319/https://teesvalley-ca.gov.uk/about/leadership/cabinet-boards-committees/meetings/local-enterprise-partnership/

The page renders but the URL changes to

[pywb instance URL]/[collection]/20240222092319/https://teesvalley-ca.gov.uk/[collection]/20240222092319mp_/https://teesvalley-ca.gov.uk/about/leadership/cabinet-boards-committees/meetings/local-enterprise-partnership/

  1. If you click on another captured page linked to from the page and try to go back you get a 'URL not found' error

Expected behavior

I would expect it not to insert the additional information in the URLs and to play back normally.

Screenshots

How the page looks after the URL has changed
image

Similar issue with ArchiveWeb.page playback
image

Environment

We have just updated to the latest version of pywb, I can try and find some more specific info on this if required.
I am using v0.11.3 of ArchiveWeb.page

Additional context

This only seems to have occurred on this site, other sites seem to be capturing as normal. The specific pages I have tried are:
https://teesvalley-ca.gov.uk/business/tees-valley-business-board/
https://teesvalley-ca.gov.uk/about/leadership/cabinet-boards-committees/meetings/local-enterprise-partnership
https://teesvalley-ca.gov.uk/about/leadership/cabinet-boards-committees/meetings/local-enterprise-partnership/local-enterprise-partnership-archive/

Not sure if this is related, but it also looks like there are some minor layout differences in the captured versions from the live web (i.e. the title text is left aligned instead of centred in the captured version)

ikreymer added a commit to webrecorder/wombat that referenced this issue Feb 23, 2024
…ewritten, to be more flexible with rewriting issues caused elsewhere

(fixes webrecorder/pywb#886)
bump to 3.7.2
@ikreymer
Copy link
Member

There's a few issues with this site:

  • It loads a script from the base64 string, eg: <script type="text/javascript" src="data:text/javascript;base64,dmFyIHJlbGV2YW5zc2l... This causes the rewriting to not be applied properly, though, fortunately its possible to detect in the history intercept.
  • It also overrides self with var self, which conflicts with how the rewriting works - which overrides self with let self assignment.

The history fix needs to be done in wombat, while the other fixes need to be done in pywb / wabac.js

ikreymer added a commit to webrecorder/wombat that referenced this issue Feb 24, 2024
…ewritten, to be more flexible with rewriting issues caused elsewhere (#138)

(fixes most significant issue in webrecorder/pywb#886)
bump to 3.7.2
@JakeBickUKGWA
Copy link
Author

Thanks Ilya, really appreciate you looking at this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants