-
Notifications
You must be signed in to change notification settings - Fork 110
Collections
An email to VA explaining collections July 15, 2019:
I believe everything is in order, despite the odd counts. I've checked the counts against your data.json and they tally up. The confusion comes from a quirky feature called Collections. Currently, I see 3793 datasets in the VA data.json. In catalog, I see 1527 non-collection members and 2266 collection members, totaling 3793. I'll explain:
catalog.data.gov groups some datasets together, called Collections. When we group datasets, the total count appears lower, because we're essentially counting all the datasets in a collection as a single dataset. This is a historical decision to avoid inflating the dataset counts on catalog with datasets that are very similar.
VA can specify collections with their data.json. We read the
isPartOf
field on each dataset. The identifier specified withisPartOf
is what we call the parent dataset. The parent dataset is what appears on the catalog (this is why you see only 1527 datasets on in the VA organization). The collection members are the datasets that have anisPartOf
attribute and are accessible from the parent dataset in the catalog. So datasets are either collection members (havingisPartOf
) or they are non-collection members (not having anisPartOf
).Here's an example of a collection: VA Veterans Health Administration Access Data which appears as a single dataset.
You can view all 197 collection members from here: https://catalog.data.gov/dataset?collection_package_id=f02ac089-2b1f-47e8-9b1d-71317c488724
Collections have a special visual marker: Screenshot from 2019-07-15 16-39-24.png
Unfortunately, it's not easy to get all the collection member counts through the UI. I actually got the counts above through the catalog API and documented the technical details. I'm happy to followup with questions, I know this feature isn't very intuitive.
You can get an entire organizations list of datasets inside a collection via the API: https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*%20AND%20organization:gsa-gov&rows=1000.
This works across harvest types, for example USGS (which is harvested via geospatial metadata and WAFs) can also be seen in this way: https://catalog.data.gov/api/action/package_search?fq=collection_package_id:*%20AND%20organization:usgs-gov&rows=1000
Each harvester might implement collections in its own way and not all of them to.
TODO