Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate dataset comes back after dedupe #5016

Open
FuhuXia opened this issue Dec 16, 2024 · 2 comments
Open

duplicate dataset comes back after dedupe #5016

FuhuXia opened this issue Dec 16, 2024 · 2 comments
Assignees
Labels
bug Software defect or bug O&M Operations and maintenance tasks for the Data.gov platform

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Dec 16, 2024

After dedupe, a duplicate of https://catalog.data.gov/dataset/national-settlement-service-data keeps coming back after next harvest job.

How to reproduce

  1. harvest https://catalog.data.gov/harvest/federal-reserve
  2. Run dedupe script on org https://catalog.data.gov/organization/board-of-governors-of-the-federal-reserve-system
  3. reharvest

Expected behavior

The dataset number should not change.

Actual behavior

One duplicate created for dataset https://catalog.data.gov/dataset/national-settlement-service-data

Sketch

Three approaches to fix the issue

  1. Clear the harvest source then reharvest. This will lose tracking stats for all datasets in this source.
  2. Exam the state of the affected dataset in DB and SOLR and figure out why duplicate occurs. Could be a new bug in ckanext-datajson.
  3. Could be a bug in the dedupe process that an edge case is not handled well.
@FuhuXia FuhuXia added the bug Software defect or bug label Dec 16, 2024
@FuhuXia FuhuXia added the O&M Operations and maintenance tasks for the Data.gov platform label Dec 18, 2024
@FuhuXia FuhuXia moved this to 🏗 In Progress [8] in data.gov team board Dec 18, 2024
@FuhuXia FuhuXia self-assigned this Dec 18, 2024
@FuhuXia
Copy link
Member Author

FuhuXia commented Dec 18, 2024

Did a harvest source clear and reharvest. The issue is back but the duplicate dataset changed to
international-summary-statistics
international-summary-statistics-2ddc8

Trying to replicate it in other environments.

@FuhuXia
Copy link
Member Author

FuhuXia commented Dec 18, 2024

Could not replicate on develop or staging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug O&M Operations and maintenance tasks for the Data.gov platform
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

1 participant