Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate MDTranslator into Datagov Harvesting Logic #4565

Open
9 tasks
btylerburton opened this issue Dec 19, 2023 · 3 comments
Open
9 tasks

Integrate MDTranslator into Datagov Harvesting Logic #4565

btylerburton opened this issue Dec 19, 2023 · 3 comments
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 H2.0/Harvest-Transform Transform Logic for Harvesting 2.0

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Dec 19, 2023

User Story

In order to successfully transform datasets from one schema to another, datagovteam would like to use the MDTranslator library, via a Rails application, to do so.

As an interim step, while the DCAT-US writer is still in active development, datagovteam would like to transform an FGDC/CSDGM source into ISO, in order to validate that the mdTranslator is functioning correctly.

Depends on:

Acceptance Criteria

  • GIVEN a harvest source with an FGDC CSDGM schema, stored in a WAF file format
    WHEN that source has been loaded into a variation of the Airflow ETL Pipeline, which does not include the validate and load steps
    THEN datagov-harvesting-logic will utilize the MDTranslator Rails application to to transform the source into valid ISO 19115 format.

Background

This will require work in both the datagov-harvesting-logic repo to integrate the MDTranslator Rails application, and will also potentially require work in the datagov-harvester (our Airflow / Orchestration repo) to allow for deferred operations.

At present, this additional work to support deferred operations is not necessarily a given, and could be a future enhancement to unblock pipeline workers while the transform is processing.

Security Considerations (required)

None. All work will reside within the Cloud.gov boundary and no external routes should be necessary.

Sketch

  • Integrate the MDTranslator rails application into a local Docker Compose container network
  • Test that local integration fully before proceeding to Cloud.gov integration
  • Implement the same locally tested solution in Cloud.gov
    • Note that this will involve the creation of new internal routes and network mappings, and it is encouraged to keep these considerations in mind while developing the local service
  • Configure a WAF Harvest source using FGDC/CSDGM standard metadata
  • Extract that WAF using datagov-harvesting-logic WAF extract task
  • Transform datasets into ISO 19115 using the MDTranslator application
  • Check for errors
@btylerburton btylerburton added the H2.0/Harvest-Transform Transform Logic for Harvesting 2.0 label Dec 19, 2023
@rshewitt rshewitt self-assigned this Dec 19, 2023
@rshewitt rshewitt moved this to 🏗 In Progress [8] in data.gov team board Dec 19, 2023
@rshewitt rshewitt moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Dec 21, 2023
@rshewitt rshewitt moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Dec 26, 2023
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Jan 4, 2024
@btylerburton btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Jan 10, 2024
@btylerburton btylerburton moved this from 🗄 Closed to 📔 Product Backlog in data.gov team board Feb 16, 2024
@btylerburton btylerburton moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Feb 29, 2024
@btylerburton btylerburton moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Feb 29, 2024
@btylerburton btylerburton moved this from 📔 Product Backlog to Harvester 2.0 in data.gov team board May 2, 2024
@btylerburton btylerburton moved this from H2.0 Backlog to 📥 Queue in data.gov team board Oct 10, 2024
@rshewitt
Copy link
Contributor

#4940 reminded me to configure the rails app for production ( e.g. rails server -e production ). i'll look into this later.

@rshewitt
Copy link
Contributor

spoke with james and tyler and we've decided to update the values for schema_type & source_type. schema_type can be "iso19115_1", "iso19115_2", "csdgm", "dcatus" and source_type can be "document", "waf".

@Bagesary
Copy link

MOved to Sprint backlog to focus on non-harvester stories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 H2.0/Harvest-Transform Transform Logic for Harvesting 2.0
Projects
Status: 📟 Sprint Backlog [7]
Development

No branches or pull requests

3 participants