Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC data duplicated on U.S. Department of Health & Human Services #4073

Closed
jbrown-xentity opened this issue Nov 21, 2022 · 9 comments
Closed
Assignees
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets

Comments

@jbrown-xentity
Copy link
Contributor

Total list of site-wide duplicates. 804 records, all but 10 are CDC records.
CDC has their own organization: https://catalog.data.gov/organization/centers-for-disease-control-and-prevention
However, it looks like all the CDC data is replicated in US Department of Health: https://catalog.data.gov/organization/hhs-gov
In fact, it looks like US Department of Health has more records: https://catalog.data.gov/dataset/?q=%22https%3A%2F%2Fdata.cdc.gov%2Fapi%2Fviews%2F%22&sort=views_recent+desc&ext_location=&ext_bbox=&ext_prev_extent=-150.46875%2C-80.17871349622823%2C151.875%2C80.17871349622823

Expected behavior

Department harvest sources are cleaned such that they don't provide duplicates

Actual behavior

Multiple harvest sources (one from agency, one from department) causing duplicates.

Sketch

Validate with CDC that Department of Health is getting their updated data, and remove CDC harvest source.
https://catalog.data.gov/harvest/https-data-cdc-gov-data-json
Both the harvest sources are being harvested daily...

@hkdctol
Copy link
Contributor

hkdctol commented Nov 22, 2022

Reaching out to HHS to track down some current contacts to work on this.

@hkdctol
Copy link
Contributor

hkdctol commented Dec 1, 2022

Pinged HHS again; moving to blocked for now

@hkdctol hkdctol moved this from 🏗 In Progress [8] to 📡 Blocked in data.gov team board Dec 1, 2022
@hkdctol
Copy link
Contributor

hkdctol commented Dec 7, 2022

Still blocked but HHS confirms they are investigating the issue

@hkdctol
Copy link
Contributor

hkdctol commented Dec 23, 2022

Pinging HHS again--to confirm that we can delete CDC because HHS data.json covers it

@hkdctol
Copy link
Contributor

hkdctol commented Dec 23, 2022

HHS confirms that CDC harvest source and organization can be deleted, but there's a harvest job right now so will delete next week.

@hkdctol
Copy link
Contributor

hkdctol commented Dec 27, 2022

I tried to clear CDC data.json as a harvest source, but got an error:

Image

@hkdctol hkdctol moved this from 📡 Blocked to 🏗 In Progress [8] in data.gov team board Dec 27, 2022
@FuhuXia
Copy link
Member

FuhuXia commented Dec 27, 2022

The error is from web server timeout. The clearing process has finished regardless. I see CDC has 0 datasets.
https://catalog.data.gov/organization/centers-for-disease-control-and-prevention

@FuhuXia
Copy link
Member

FuhuXia commented Dec 27, 2022

CDC harvest source is cleared and deleted.

@hkdctol
Copy link
Contributor

hkdctol commented Dec 27, 2022

Organization deleted.

@hkdctol hkdctol moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Dec 27, 2022
@btylerburton btylerburton added the harvest-duplicates Issues related to Duplicated Datasets label Dec 21, 2023
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug harvest-duplicates Issues related to Duplicated Datasets
Projects
Archived in project
Development

No branches or pull requests

4 participants