-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate datasets on catalog #3919
Comments
Even though we utilized #3918 , there are still duplicate datasets from the old dump. Plugging through those now, starting with DOI... https://gsa-tts.slack.com/archives/C2N85536E/p1660231323964859 |
Kicked this off again yesterday after being out for a week; we are about 35-40% done (but errors occur often). |
Leaving this here: the script got stuck on 42 datasets that existed in SOLR but not in the DB. CKAN can't update datasets that don't exist in the DB, and there's no way via CKAN API to manage these datasets other than find them via search. They had to be removed via the CKAN CLI, using the |
The script continued to run last night, and it crashed on a 404 error (Similar to the other 42 errors). Before it restarted, it had 1,500 left. After the restart, it had 11K. I confirmed that DOI had a harvest run yesterday afternoon, and duplicates were created. |
Through log analysis, I can confirm that the gather process is creating a new harvest object for datasets that already exist on the website. Consider identifier
They have the same harvest source information, same unique identifier, same title, same source hash. They have a unique name and a unique harvest object. The logs show that the logic isn't working for this code:
We are currently harvesting DOI on dev and locally, to see if the problem is reproducible. We do attempt to test the re-harvest logic, both at the ckanext-datajson extension level and at the catalog.data.gov level and all tests currently pass with no duplication. |
Locally, I was able to harvest DOI (I think if I tried a bigger source locally my machine would choke). Somehow, on the first harvest, it duplicated all the datasets (minus a few). However, the harvest source only reports harvesting a single set of records: 28,667. This is because that's how many harvest objects were created by the gather process to be harvested. I ran this using multiple catalog-fetch commands. I did see this error message pop up a few times, still not sure if that's meaningful or not. I'm going to clear the harvest and restart with just one job, and see if the duplicates persist. If not, I'm going to recommend that we move to a single harvest source for now. When I reharvest, no new duplicates are created. Just the normal additions, edits, etc. 🤷 |
Was able to repurpose much of the datagov-dedupe code to add functionality to report duplicates across organizations, and run a github action to check this on-demand. See https://github.com/GSA/datagov-dedupe/actions/runs/2952078797 Current summary:
|
Development harvested cleanly, and re-harvested fine without duplicates. I've also discovered that the harvest-run job (that processes if a harvest job is complete) actually re-requests items that hard failed and are in an incomplete state to be re-tried 5 times, which is why DOI takes so long to complete (there's generally around 20+ hard failures, but if each of these takes 1-20 minutes to recover and restart, and do all of them 5 times, this takes FOREVER). |
@jbrown-xentity we are restarting gather every 30 mins. https://github.com/GSA/catalog.data.gov/blob/main/.github/workflows/restart.yml#L76 |
After clearing and reharvesting, about 200 datasets were still duplicated. Not on the level of complete duplication at this point. Re-harvest should occur on Wed, and we'll know more after that. Link to check number of duplicates: https://catalog.data.gov/api/3/action/package_search?fq=organization:doi-gov%20AND%20type:dataset&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0 |
An initial analysis of the duplicate list for DOI is interesting. We already have multiple types:
Please note that there may be other types of duplicates, but finding others may require a more detailed analysis. |
We will re-evaluate the duplicate count and items after a re-harvest, to determine if these situations are replicable and/or if there are different duplicate types that arise after the initial data is already there. |
The duplicate count is the same at 163. Attached are the list of duplicate ID's for DOI. We will consider this research complete. I will make 2 tickets, 1 for each duplicate type that we have found to investigate further how it may be occurring. |
Catalog created a bunch of duplicate datasets via harvest. Needs to be corrected.
How to reproduce
Expected behavior
No duplicates
Actual behavior
Lots of duplicates
Sketch
Update code so no more duplicates occur: GSA/ckanext-datajson#120
Now we need to run de-dupe on all organizations.
Tested and confirmed on GSA org.
Can step through other organizations that have seen major changes from here: https://catalog.data.gov/api/action/package_search?q=metadata_modified:[2022-08-04T00:00:00Z+TO+NOW]&sort=metadata_modified%20desc&facet.field=[%22organization%22]
Eventually want to see no duplicates across the platform.
The text was updated successfully, but these errors were encountered: