Takes two lists of URLs and outputs a mapping that assigns each entry in list 1 an item from list 2 along with a score that indicates how likely the two refer to the same thing.
This script was created to automatically generate a map of redirects when migrating a website. The input lists would be a sitemap of each the old and new website, both plain text files containing one url per line. The URLs are required to be "pretty", meaning not just /post.php?id=123
but rather something like /blog/why-wordpress-sucks
and ideally have their protocol- and domain parts removed.
It can of course be used as a generic tool to fuzzy match two sets of strings. It uses the Levenshtein distance metric as implemented by python-Levenshtein.
Warning: Always check the results manually. Never trust the output of the script blindly. It will assign each item in list 1 one item from list 2, even if it's a really bad match.
- Clone this repository
git clone https://github.com/jsphpl/redirect-mapper
- Enter it
cd redirect-mapper
- Install dependencies
python setup.py install
- Use it:
$ python map.py [-h] [-t VALUE] [-c PATH] [-d] list1 list2
Generates a redirect map from two sitemaps for website migration.
By default, all matches are dumped on the standard output. If an item
from list1 is exactly contained in list2, it will be assigned right
away, without calculating distance or checking for ambiguity.
Issues & Documentation: https://github.com/jsphpl/redirect-mapper
positional arguments:
list1 List of target items for which to find matches. (1 item per line)
list2 List of search items on which to search for matches. (1 item per line)
optional arguments:
-h, --help show this help message and exit
-t VALUE, --threshold VALUE
Range within which two scores are considered equal. (default: 0.05)
-c PATH, --csv PATH If specified, the output will be formatted as CSV and written to PATH
-d, --drop-exact If specified, exact matches will be ommited from the output
Say your're asking where to redirect all the urls from old_sitemap.txt ?. Pass it as the first argument like so:
python map.py old_sitemap.txt new_sitemap.txt
To influence the level at which two matches are considered equally good, use the -t VALUE
argument.
python map.py -t 0.1 old_sitemap.txt new_sitemap.txt
If the results are used to set up 301 redirects on the new website to catch all traffic arriving at old URLs, exact matches can be omitted. They will be handled by actual pages exisiting on the new site (list2). Use the -d
flag here.
python map.py -d old_sitemap.txt new_sitemap.txt
Specify the output filename with -c PATH
.
python map.py -c results.csv old_sitemap.txt new_sitemap.txt
A helper exists that lets you crawl an XML sitemap and outputs a flat list of URLs, as required as input by map.py
. Together with that tool, the whole process of generating a redirect map could look like the following. After that, you would of course manually check the results.csv, taking special care of matches with a low score (≤0.8).
python aggregate.py https://old-website.com/sitemap.xml > old.txt
python aggregate.py https://new-website.com/sitemap.xml > new.txt
python map.py --drop-exact --csv results.csv old.txt new.txt
$ python aggregate.py [-h] URL/PATH
Aggregates URLs from a set of XML sitemaps listed under the entry path.
This script processes the XML file at given path, opens all sitemaps
listed inside, and prints all URLs inside those maps to stdout.
It should support most sitemaps that comply with the spec at
https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd.
It was tested with sitemaps generated by the following WP plugins:
- (Google XML Sitemaps)[https://wordpress.org/plugins/google-sitemap-generator/]
- (XML Sitemap & Google News feeds)[https://wordpress.org/plugins/xml-sitemap-feed/]
- (Yoast SEO)[https://wordpress.org/plugins/wordpress-seo/]
Issues & Documentation: https://github.com/jsphpl/redirect-mapper
positional arguments:
URL/PATH Path or URL of the root sitemap.
optional arguments:
-h, --help show this help message and exit