Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate datasets on catalog #3919

Closed
jbrown-xentity opened this issue Aug 9, 2022 · 13 comments
Closed

Duplicate datasets on catalog #3919

jbrown-xentity opened this issue Aug 9, 2022 · 13 comments
Assignees
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets

Comments

@jbrown-xentity
Copy link
Contributor

Catalog created a bunch of duplicate datasets via harvest. Needs to be corrected.

How to reproduce

  1. Check for dupes: https://github.com/GSA/datagov-dedupe

Expected behavior

No duplicates

Actual behavior

Lots of duplicates

Sketch

Update code so no more duplicates occur: GSA/ckanext-datajson#120

Now we need to run de-dupe on all organizations.

Tested and confirmed on GSA org.

Can step through other organizations that have seen major changes from here: https://catalog.data.gov/api/action/package_search?q=metadata_modified:[2022-08-04T00:00:00Z+TO+NOW]&sort=metadata_modified%20desc&facet.field=[%22organization%22]

Eventually want to see no duplicates across the platform.

@jbrown-xentity
Copy link
Contributor Author

Even though we utilized #3918 , there are still duplicate datasets from the old dump. Plugging through those now, starting with DOI... https://gsa-tts.slack.com/archives/C2N85536E/p1660231323964859

@jbrown-xentity
Copy link
Contributor Author

Kicked this off again yesterday after being out for a week; we are about 35-40% done (but errors occur often).

@jbrown-xentity
Copy link
Contributor Author

Leaving this here: the script got stuck on 42 datasets that existed in SOLR but not in the DB. CKAN can't update datasets that don't exist in the DB, and there's no way via CKAN API to manage these datasets other than find them via search. They had to be removed via the CKAN CLI, using the search-index clear functionality (that is dangerous since if the dataset isn't specified, it can clear the whole index). Now that these few are gone, the de-dupe process is clearing ~8 per minute, and the doi de-dupe should finish up by the end of today.

@jbrown-xentity
Copy link
Contributor Author

The script continued to run last night, and it crashed on a 404 error (Similar to the other 42 errors). Before it restarted, it had 1,500 left. After the restart, it had 11K. I confirmed that DOI had a harvest run yesterday afternoon, and duplicates were created.
We already have a test case for this, and the tests pass. That means something is weird with these datasets that are stored; somehow the duplicate issue is recurring. I don't know if it's a special case (related to DOI data), or if just that the fetch jobs are stepping on each other and we need to reconsider GSA/ckanext-datajson#94. I think we need to create an investigation ticket into this, to discover how wide-spread the problem is and design a repro case such that we can find where the bug occurs.

@jbrown-xentity
Copy link
Contributor Author

Through log analysis, I can confirm that the gather process is creating a new harvest object for datasets that already exist on the website. Consider identifier 02b6c78d-945b-4517-b685-9060f0bf0e31:

They have the same harvest source information, same unique identifier, same title, same source hash. They have a unique name and a unique harvest object.

The logs show that the logic isn't working for this code:

2022-08-26 08:05:16,108 INFO  [ckanext.datajson.datajson_ckan_28] Check existing dataset: 02b6c78d-945b-4517-b685-9060f0bf0e31
2022-08-26 08:05:16,109 INFO  [ckanext.datajson.datajson_ckan_28] Datajson creates a HO: 02b6c78d-945b-4517-b685-9060f0bf0e31

We are currently harvesting DOI on dev and locally, to see if the problem is reproducible. We do attempt to test the re-harvest logic, both at the ckanext-datajson extension level and at the catalog.data.gov level and all tests currently pass with no duplication.

@jbrown-xentity
Copy link
Contributor Author

Locally, I was able to harvest DOI (I think if I tried a bigger source locally my machine would choke). Somehow, on the first harvest, it duplicated all the datasets (minus a few). However, the harvest source only reports harvesting a single set of records: 28,667. This is because that's how many harvest objects were created by the gather process to be harvested. I ran this using multiple catalog-fetch commands. I did see this error message pop up a few times, still not sure if that's meaningful or not.

I'm going to clear the harvest and restart with just one job, and see if the duplicates persist. If not, I'm going to recommend that we move to a single harvest source for now.

When I reharvest, no new duplicates are created. Just the normal additions, edits, etc. 🤷

@jbrown-xentity
Copy link
Contributor Author

Was able to repurpose much of the datagov-dedupe code to add functionality to report duplicates across organizations, and run a github action to check this on-demand. See https://github.com/GSA/datagov-dedupe/actions/runs/2952078797

Current summary:

  • 33 organizations have duplicates
  • 141,085 duplicate records exist (from 369,687), or 38% of all records
  • 113K duplicate records are from just DOI and NOAA. DOI uses DCAT-US harvester, while NOAA uses ISO WAF. These share almost no code in the gathering and fetch process (checking what should be importing, and actually importing)
  • DOI has more duplicates (56,435) than regular records (29629)

@jbrown-xentity
Copy link
Contributor Author

Development harvested cleanly, and re-harvested fine without duplicates.
Investigating specifically DOI and the logs, was able to find that the gather process ran twice, once at 2022-08-25T19:58:00 and another at 2022-08-26T07:57:03, for the same job. Investigating logs seems to show catalog-gather restarting often, as much as every half hour. This is suspicious, as we are not regularly restarting it. I see exit status of 0, 1, 143, and 137 but there may be others. There are 7 exits/crashes that occur between these 2 gather process log statements, most of which contain multiple error codes at the same time (first example has records of 143, then 0, then 137, all within 1 second of each other).
Code 137 is out of memory, it's possible we need to bump from 3G to 4G to make it more stable.

I've also discovered that the harvest-run job (that processes if a harvest job is complete) actually re-requests items that hard failed and are in an incomplete state to be re-tried 5 times, which is why DOI takes so long to complete (there's generally around 20+ hard failures, but if each of these takes 1-20 minutes to recover and restart, and do all of them 5 times, this takes FOREVER).

@FuhuXia
Copy link
Member

FuhuXia commented Aug 30, 2022

@jbrown-xentity
Copy link
Contributor Author

After clearing and reharvesting, about 200 datasets were still duplicated. Not on the level of complete duplication at this point. Re-harvest should occur on Wed, and we'll know more after that. Link to check number of duplicates: https://catalog.data.gov/api/3/action/package_search?fq=organization:doi-gov%20AND%20type:dataset&facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0

@jbrown-xentity
Copy link
Contributor Author

An initial analysis of the duplicate list for DOI is interesting. We already have multiple types:

Please note that there may be other types of duplicates, but finding others may require a more detailed analysis.

@jbrown-xentity
Copy link
Contributor Author

We will re-evaluate the duplicate count and items after a re-harvest, to determine if these situations are replicable and/or if there are different duplicate types that arise after the initial data is already there.

@jbrown-xentity
Copy link
Contributor Author

The duplicate count is the same at 163. Attached are the list of duplicate ID's for DOI.

We will consider this research complete. I will make 2 tickets, 1 for each duplicate type that we have found to investigate further how it may be occurring.

doi-duplicates.txt

@jbrown-xentity jbrown-xentity moved this from In Progress [8] to Done in data.gov team board Sep 7, 2022
@btylerburton btylerburton added the harvest-duplicates Issues related to Duplicated Datasets label Dec 21, 2023
@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets
Projects
Archived in project
Development

No branches or pull requests

4 participants