This is the ingest component of the CMR system. It is responsible for collaborating with metadata db and indexer components of the CMR system to maintain the lifecycle of concepts coming into the system.
There are two ways database operations can be done. It can happen through leiningen commands for local development or using the built uberjar.
- Create the user
lein create-user
- Run the migration scripts
lein migrate
You can use lein migrate -version version
to restore the database to
a given version. lein migrate -version 0
will clean the datbase
completely.
- Remove the user
lein drop-user
- Create the user
CMR_DB_URL=thin:@localhost:1521:orcl CMR_INGEST_PASSWORD=****** java -cp target/cmr-ingest-app-0.1.0-SNAPSHOT-standalone.jar cmr.db create-user
- Run db migration
CMR_DB_URL=thin:@localhost:1521:orcl CMR_INGEST_PASSWORD=****** java -cp target/cmr-ingest-app-0.1.0-SNAPSHOT-standalone.jar cmr.db migrate
You can provider additional arguments to migrate the database to a given version as in lein migrate.
- Remove the user
CMR_DB_URL=thin:@localhost:1521:orcl CMR_INGEST_PASSWORD=****** java -cp target/cmr-ingest-app-0.1.0-SNAPSHOT-standalone.jar cmr.db drop-user
The ingest application will publish messages to a fanout exchange for the indexer application and other consumers. Other applications setup their own queues and bind to the ingest exchange.
This could happen because queueing the message times out, RabbitMQ has surpassed the configured limit for requests, or RabbitMQ is unavailable. Ingest will treat this as an internal error and return the error to the provider. If this happens the data will still be left in Metadata DB, but won't be indexed. It's possible this newer version of a granule could be returned from CMR but it is unlikely to happen.
- /providers
- /providers/<provider-id>
- /jobs
- POST /jobs/pause - Pause all jobs
- POST /jobs/resume - Resumes all jobs
- GET /jobs/status - Gets pause/resume state of jobs
- POST /jobs/reindex-collection-permitted-groups - Runs the reindex collection permitted groups job.
- POST /jobs/reindex-all-collections - Runs to job to reindex all collections.
- POST /jobs/cleanup-expired-collections - Runs the job to remove expired collections.
- /caches
- /db-migrate
- /health
The providers that exist in the CMR are administered through the Ingest API. A provider consists of the following fields
provider-id
- The alpha numeric upper case string identifying the provider. The maximum length ofprovider-id
is 10 characters. See provider id.short-name
- A unique identifier of the provider. It is similar toprovider-id
, but more descriptive. It allows spaces and other special characters. The maximum length ofshort-name
is 128 characters.short-name
defaults toprovider-id
.cmr-only
- True or false value that indicates if this is a provider that ingests directly through the CMR Ingest API or the legacy ECHO Catalog REST Ingest API. A CMR Only provider will still have ACLs configured in ECHO and support ordering through ECHO. A CMR Only provider may even still have data in Catalog REST but it will not be kept in sync with the CMR.cmr-only
defaults to false.small
- True or false value that indicates if this is a provider that has a small amount of data and its collections and granules will be ingested into theSMALL_PROV
tables.small
defaults to false.
The provider API only supports requests and responses in JSON.
Returns a list of the configured providers in the CMR.
curl %CMR-ENDPOINT%/providers
[{"provider-id":"PROV2","short-name":"Another Test Provider","cmr-only":true,"small":false},{"provider-id":"PROV1","short-name":"Test Provider","cmr-only":false,"small":false}]
Creates a provider in the CMR. The provider id specified should match that of a provider configured in ECHO.
curl -i -XPOST -H "Content-Type: application/json" -H "Echo-Token: XXXX" %CMR-ENDPOINT/providers -d \
'{"provider-id": "PROV1", "short-name": "Test Provider", "cmr-only": false, "small":false}'
Updates the attributes of a provider in the CMR. The small
attribute cannot be changed during update.
curl -i -XPUT -H "Content-Type: application/json" -H "Echo-Token: XXXX" %CMR-ENDPOINT%/providers/PROV1 -d \
'{"provider-id": "PROV1", "short-name": "Test Provider", "cmr-only":true, "small":false}'
Removes a provider from the CMR. Deletes all data for the provider in Metadata DB and unindexes all data in Elasticsearch.
curl -i -XDELETE -H "Echo-Token: XXXX" %CMR-ENDPOINT%/providers/PROV1
The caches of the ingest application can be queries to help debug caches issues in the system. Endpoints are provided for querying the contents of the various caches used by the application.
The following curl will return the list of caches:
curl -i %CMR-ENDPOINT/caches
The following curl will return the keys for a specific cache:
curl -i %CMR-ENDPOINT/caches/<cache-name>
This curl will return the value for a specific key in the named cache:
curl -i %CMR-ENDPOINT/caches/<cache-name>/<cache-key>
This will report the current health of the application. It checks all resources and services used by the application and reports their health status in the response body in JSON format. The report includes an "ok?" status and a "problem" field for each resource. The report includes an overall "ok?" status and health reports for each of a service's dependencies. It returns HTTP status code 200 when the application is healthy, which means all its interfacing resources and services are healthy; or HTTP status code 503 when one of the resources or services is not healthy. It also takes pretty parameter for pretty printing the response.
curl -i -XGET %CMR-ENDPOINT%/health?pretty=true
Example healthy response body:
{
"oracle" : {
"ok?" : true
},
"echo" : {
"ok?" : true
},
"metadata-db" : {
"ok?" : true,
"dependencies" : {
"oracle" : {
"ok?" : true
},
"echo" : {
"ok?" : true
}
}
},
"indexer" : {
"ok?" : true,
"dependencies" : {
"elastic_search" : {
"ok?" : true
},
"echo" : {
"ok?" : true
},
"metadata-db" : {
"ok?" : true,
"dependencies" : {
"oracle" : {
"ok?" : true
},
"echo" : {
"ok?" : true
}
}
},
"index-set" : {
"ok?" : true,
"dependencies" : {
"elastic_search" : {
"ok?" : true
},
"echo" : {
"ok?" : true
}
}
}
}
}
}
Example unhealthy response body:
{
"oracle" : {
"ok?" : false,
"problem" : "Exception occurred while getting connection: oracle.ucp.UniversalConnectionPoolException: Cannot get Connection from Datasource: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection"
},
"echo" : {
"ok?" : true
},
"metadata-db" : {
"ok?" : true,
"dependencies" : {
"oracle" : {
"ok?" : true
},
"echo" : {
"ok?" : true
}
}
},
"indexer" : {
"ok?" : true,
"dependencies" : {
"elastic_search" : {
"ok?" : true
},
"echo" : {
"ok?" : true
},
"metadata-db" : {
"ok?" : true,
"dependencies" : {
"oracle" : {
"ok?" : true
},
"echo" : {
"ok?" : true
}
}
},
"index-set" : {
"ok?" : true,
"dependencies" : {
"elastic_search" : {
"ok?" : true
},
"echo" : {
"ok?" : true
}
}
}
}
}
}
Ingest has internal jobs that run. They can be run manually and controlled through the Jobs API.
curl -i -XPOST -H "Echo-Token: XXXX" %CMR-ENDPOINT%/jobs/pause
curl -i -XPOST -H "Echo-Token: XXXX" %CMR-ENDPOINT%/jobs/resume
Collections which ACLs have changed can be reindexed by sending the following request.
curl -i -XPOST -H "Echo-Token: XXXX" %CMR-ENDPOINT%/jobs/reindex-collection-permitted-groups
Reindexes every collection in every provider.
curl -i -XPOST -H "Echo-Token: XXXX" %CMR-ENDPOINT%/jobs/reindex-all-collections
It accepts an optional parameter force_version=true
. If this option is specified then Elasticsearch will be reindexed with force
version instead of the normal external_gte
. See https://www.elastic.co/guide/en/elasticsearch/reference/2.2/docs-index_.html#_version_types This will cause all data in the database to overwrite the elasticsearch index even if there's a newer version in Elasticsearch. This can be used to fix issues where a newer revision was force deleted or as in the case CMR-2673 the collections were indexed with a larger version and then that was changed at the database level. There's a race condition when this is run. If a collection comes in during indexing the reindexing could overwrite that data in Elasticsearch with an older revision of the collection. The race condition can be corrected by running reindex all collections without the force_version=true
which will index any revisions with larger transaction ids over top of older data.
Looks for collections that have a delete date in the past and removes them.
curl -i -XPOST -H "Echo-Token: XXXX" %CMR-ENDPOINT%/jobs/cleanup-expired-collections
The collection granule aggregate cache is used to cache information about all the granules within a collection that are indexed with that collection. That's currently limited to the granule temporal minimum and maximum. The cache is refreshed by a periodic job. The cache is located in the indexer but refresh scheduling is handled by Ingest so that singleton jobs can be used.
There are two kinds of cache refreshes that can be triggered. The full cache refresh will refresh the entire cache. Collections must be manually reindexed after the cache has been refreshed to get the latest data indexed.
curl -i -XPOST http://localhost:3002/jobs/trigger-full-collection-granule-aggregate-cache-refresh?token=XXXX
The partial cache refresh will look for granules ingested over the last trigger period (configurable) and expand the collection granule aggregate temporal times to cover any new data that was ingested. The collections that had changes will automatically be queued for reindexing after this runs.
curl -i -XPOST http://localhost:3002/jobs/trigger-partial-collection-granule-aggregate-cache-refresh?token=XXXX
Migrate database to the latest schema version:
curl -v -XPOST -H "Echo-Token: XXXX" http://localhost:3002/db-migrate
Migrate database to a specific schema version (e.g. 3):
curl -v -XPOST -H "Echo-Token: XXXX" http://localhost:3002/db-migrate?version=3
Copyright © 2014-2015 NASA