-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index url computation fails on some wikis #445
Comments
As an aside, shouldn't the |
Thanks for your report. This is a bug but we have workarounds.
There are various ways of scraping it in different versions; the canonical URL is whatever https://www.ssbwiki.com/api.php?action=query&meta=siteinfo&siprop=general says. The API seems to be correct as it says /index.php and https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page ; so ideally we should pick that up, not sure why it got rewritten to a (wrong) scraped URL. In general, however, you cannot expect the URL guesser to know all potential non-standard short URL formats. Have you tried passing the
No, because the index could be index.html (as in, the webserver's default page). What we need is the location of the index.php script (to which URL parameters for Special:Export can be appended). Depending on the webserver's rewrite rules, the URL could be anything, including https://www.ssbwiki.com/Main_Page . |
The reason dumpgenerator is tricked into believing that the Main Page is the index.php is that every page in this wiki behaves like it's index.php: you can append any URL parameter to any URL, except the title parameter. For example the title parameter doesn't do anything here: https://www.ssbwiki.com/Special:Export?pages=Main+Page&curonly=1&templates=1&wpDownload=1&wpEditToken=%2B%5C&title=Special%3AListUsers This is a very confusing and bad webserver configuration which I don't recommend and I'm not sure we should support. But, if the API gives us a good result we should use it, and if the user gives us a good index.php URL we must use it. |
Ah, I see, I misunderstood.
Yes, it does seem to work.
That's what I figured but I wasn't sure, thanks for confirming. I'll try to see if we can use the correct value from the API (maybe run a full dump too) and get back to you. Thanks for the help! |
The two potential issues are:
|
(I originally opened an issue on the wikiteam3 fork, so you can take a look here if you want, but I'll sum everything up here so you don't need to)
On some wikis, such as https://www.ssbwiki.com/ and https://www.mariowiki.com, wikiteam grabs the wrong index url and then the export fails with a misleading error.
mwGetAPIAndIndex
inapi.py
parses the HTML of the main page to get the index url, using the view source button as reference. For ssbwiki and mariowiki, the view source button sends you to https://www.ssbwiki.com/Main_Page?action=edit and https://www.mariowiki.com/Main_Page?action=edit. This is odd because the Main page button sends you to https://www.ssbwiki.com/ and https://www.mariowiki.com/. The/Main_Page
urls automatically redirect you to this raw/shortened form.For comparison, the way the Archiveteam wiki works is that the Main page buttons sends you to https://wiki.archiveteam.org/index.php/Main_Page while the view source button sends you to https://wiki.archiveteam.org/index.php?title=Main_Page&action=edit, so the
index
variable is set to https://wiki.archiveteam.org/index.php. There are no redirects, although the raw https://wiki.archiveteam.org/ url and the https://wiki.archiveteam.org/index.php url work just as well as https://wiki.archiveteam.org/index.php/Main_PageIn order for the rest of the program to work, the
index
variable should have actually been set to https://www.ssbwiki.com/index.phpIndeed, we then use the
index
variable to construct https://www.ssbwiki.com/Main_Page?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 , which does not return XML, which causes the program to crash (https://www.ssbwiki.com?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 does not work either). The correct link is https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1.The correct index url can be found elsewhere on the parsed HTML. Unfortunately I don't know if there's a canonical location - the closest I can think of would be the log in button, but that might break other wikis if we changed this behavior.
I'm aware that the README states "If the script can't find itself the API and/or index.php paths, then you can provide them", but the error message does not make it obvious that this is the issue. In fact, wikiteam prints "
Checking index.php... https://www.ssbwiki.com/Main_Page index.php is OK
"I don't know enough about wikis to understand if this is something that can or should be fixed. Perhaps we could at the very least handle the error more gracefully and suggest the user manually adds the
--index
argument instead of simply saying "XML export on this wiki is broken", which isn't exactly accurate. But then again, I'm not sure how to instruct them to find the proper index url.I am willing to attempt a PR if you could perhaps point me in the right direction!
The text was updated successfully, but these errors were encountered: