Skip to content

Latest commit

 

History

History
65 lines (49 loc) · 2.24 KB

README.md

File metadata and controls

65 lines (49 loc) · 2.24 KB

Scripts for managing HEAL CDEs

This repository contains several scripts for working with HEAL CDEs. Primarily, it converts the Excel representation of these HEAL CDEs into a JSON representation based on the data model used by the NIH CDE Repository (see JSON Schemas for Data Elements and Forms), and then converting these JSON files into other formats for use in downstream tools.

How to use

Getting started

We use venv to maintain the list of packages.

$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Generating JSON files

The script generators/excel2cde.py recursively converts Excel files in the expected format into JSON files in the output directory (by default, to the output/json directory).

$ python generators/excel2cde.py [input-directory] [--output output_directory]

Converting JSON files to Excel templates

Excel template generation can be configured with the input/cde-template-locations.yaml file. Note particularly the template variable, which should be set to the location of the XLSX template (input/cde-template.xlsx by default). You should then run:

$ python exporters/xlsx-exporter.py -c input/cde-template-locations.yaml -o output/xlsx output/json

Annotating JSON files

Annotation generally requires sending the HEAL CDE text content to an online annotation process, following by using the Translator Node Normalization service to filter and standardize the resulting annotations. This reliance on online services causes several possible points of failure. To mitigate this, the annotation workflow is intended to be run through a Rakefile. The Rakefile in this repo contains instructions for building the annotated KGX output into the annotated/ directory.

$ rake
$ python validators/check_annotated.py annotated
$ mv annotated annotated/year-month-day