Some components of the bdbag
software are configured via JSON-formatted configuration files.
There are two global configuration files: bdbag.json and keychain.json. Skeleton versions of these files with simple default values are automatically created in the current user's home directory the first time a bag is created or opened.
Additionally, three JSON-formatted configuration files can be passed as arguments to bdbag in order to supply input for certain bag creation and update functions. These files are known as metadata, ro metadata and remote-file-manifest configurations.
The file bdbag.json
is a global configuration file that allows the user to specify a set of parameters to be used as
defaults when performing various bag manipulation functions.
The format of bdbag.json
is a single JSON object containing a set of JSON child objects (used as
configuration sub-sections) which control various default behaviors of the software.
This is the parent object for the entire configuration.
Parameter | Description |
---|---|
bdbag_config_version |
The version number of the configuration file. In general, it matches the release version number of bdbag |
bag_config |
This object contains all bag-related configuration parameters. |
fetch_config |
This object contains all fetch-related configuration parameters. |
resolver_config |
This object contains all implementation-specific resolver configuration parameters. |
identifier_resolvers |
This is a global list of identifier "meta" resolvers. It can be overridden on a per-resolver basis via the individual configuration blocks for each resolver in resolver_config . |
This object contains all bag-related configuration parameters.
Parameter | Description |
---|---|
bag_algorithms |
This is an array of strings representing the default checksum algorithms to use for bag manifests, if not otherwise specified. Valid values are "md5", "sha1", "sha256", and "sha512". |
bag_archiver |
This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz". |
bag_metadata |
This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt. |
bag_processes |
This is a numeric value representing the default number of concurrent processes to use when calculating checksums. |
bagit_spec_version |
The version of the bagit specification that created bags will conform to. Valid values are "0.97" or "1.0". |
The fetch_config
object contains a set of child objects each keyed by the scheme of the transport protocol that contains the transport handler configuration parameters.
There is a default set of transport handlers installed with bdbag
. In addition, bdbag
supports
externally implemented transport handlers that can be plugged-in (i.e., declared as run-time imports) via the fetch_config
configuration object in the bdbag.json
config file.
This requires developers to perform some integration tasks.
Developers should create a class deriving from bdbag.fetch.transports.base_transport.BaseFetchTransport
and implement three required functions:
-
__init__(self, config, keychain, **kwargs)
: The class constructor. Derived classes should first callsuper(<derived class name>, self).__init__(config, keychain, **kwargs)
which sets theconfig
,keychain
, andkwargs
variables as class member variables with the same names. -
fetch(self, url, output_path, **kwargs)
: This method should implement the logic required to transfer a file referenced byurl
to the local path referenced byoutput_path
. The**kwargs
argument is an extensible argument dictionary that the framework may populate with extra data, for example: an integer argumentsize
may present (if it can be found infetch.txt
for a given fetch entry), representing the expected size of the remote file in bytes. -
cleanup(self)
: This method should implement any transport-specific release of resources. Note this function will be called only once per-transport at the end of a entire bag fetch, and not once per-file.
Configure the usage of the external transport via the fetch_config
object of the bdbag.json
configuration file. The fetch_config
object is comprised of child configuration objects keyed by a lowercase string value representing the URL protocol scheme that is being configured. When configuring an external handler, the following applies:
-
There is a single required top-level string parameter with the key name
handler
which maps to the fully-qualified class name implementing the required methods of thebdbag.fetch.transports.base_transport.BaseFetchTransport
base class. At runtime the bdbag fetch framework code will attempt to load this class viaimportlib.import_module
machinery and if successful, it will be instantiated and returned to the bdbag fetch framework code and the class instance cached for the duration of the bag fetch operation. Subsequently, whenever a URL is encountered infetch.txt
with a protocol scheme matching that of the installed handler, that handler'sfetch
method will be invoked. -
There is also an optional string parameter,
allow_keychain
, which must be present and evaluate totrue
in order to toggle the propagation of the bdbagkeychain
into the handler code during the__init__
call. If theallow_keychain
parameter is missing or set to any other value that cannot be evaluated as a Python booleanTrue
, then the value of thekeychain
variable passed to the__init__
call will beNone
. In general, if the custom handler code has its own mechanism for managing credentials, then this parameter may be omitted. If the handler intends to make use of the bdbagkeychain
that is currently in context for the current user and fetch operation, then this parameter must be present and evaluate toTrue
. -
The remainder of the protocol scheme handler configuration object can consist of any valid JSON; the entire object value assigned to the scheme key will be passed as the
config
parameter to the__init__
method of the custom handler.
For example, given the following fetch_config
section:
{
"fetch_config": {
"s3": {
"handler":"my.custom.S3Transport",
"max_read_retries": 5,
"read_chunk_size": 10485760,
"read_timeout_seconds": 120
},
"foo": {
"handler":"my.custom.FooTransport",
"allow_keychain": true,
"my_foo_complex_config": {
"bar":[
"a","b","c"
],
"baz":{
"xyz":123
}
}
}
}
}
For the scheme foo
, the following object will be passed as the config
parameter to the __init__
method of my.custom.FooTransport
upon class instantiation:
{
"handler":"my.custom.FooTransport",
"allow_keychain": true,
"my_foo_complex_config": {
"bar":[
"a","b","c"
],
"baz":{
"xyz":123
}
}
}
Currently, only the default http
, https
and s3
transport handlers have configuration objects that control their behavior.
Parameter | Description |
---|---|
http |
Configuration for the http fetch handler. |
https |
Configuration for the https fetch handler. |
s3 |
Configuration for the s3 fetch handler. |
This object contains configuration parameters for the http
fetch handler.
Parameter | Description |
---|---|
session_config |
Session configuration parameters for the requests HTTP client library. The parameters mainly control retry logic. |
http_cookies |
Configuration parameters for automatic loading and merging of HTTP cookie files. |
allow_redirects |
A boolean indicating that redirects should automatically be followed, or not. |
redirect_status_codes |
An array of integers representing the HTTP status codes used for determining redirection. Defaults to [301, 302, 303, 307, 308] . |
Session configuration parameters for the requests
HTTP client library. The parameters mainly control retry logic. The retry logic is provided via the urllib3
library, wrapped by requests
.
For more infomation, see this external documentation.
Parameter | Description |
---|---|
retry_backoff_factor |
The exponential backoff factor for all retry attempts. Defaults to 1.0 . |
retry_connect |
The number of connect attempts to retry. Defaults to 5 . |
retry_read |
The number of read attempts to retry. Defaults to 5 . |
retry_status_forcelist |
A list of HTTP response codes that will force and automatic retry. Defaults to: [500,502,503,504] . |
Configuration parameters for automatic loading and merging of HTTP cookie files. These cookie files must follow the Mozilla/Netscape/CURL/WGET format as described here.
Parameter | Description |
---|---|
scan_for_cookie_files |
A boolean value that enables/disables the cookie scan feature globally. Defaults to True (enabled). |
search_paths |
An array of base directory paths from which to recursively search with search_paths_filter for file_names to use as input. Defaults to the system-dependent expansion of ~ . |
search_paths_filter |
An fnmatch.filter pattern that can be used to filter specific subdirectories of each path specified in search_paths . Defaults to .bdbag . |
file_names |
An array of input cookie filenames or fnmatch.filter patterns to match cookie filenames against. Defaults to [*cookies.txt] . |
This object contains configuration parameters for the https
fetch handler. The https
fetch handler configuration is
identical to the http
fetch handler configuration, with the following exceptions:
Parameter | Description |
---|---|
bypass_ssl_cert_verification |
Either the boolean value true or false , or an array of string values consisting of URL patterns to be used in a simple substring match against the target URLs found in a bag's fetch.txt file. For example, "bypass_ssl_cert_verification": ["https://raw.githubusercontent.com/fair-research/bdbag/"] will match a fetch.txt entry with a URL of "https://raw.githubusercontent.com/fair-research/bdbag/master/test/test-data/test-http/test-fetch-http.txt". Defaults to false . |
It is NOT RECOMMENDED setting bypass_ssl_cert_verification: true
as it will bypass SSL certificate validation for
ALL HTTPS requests. This will accept any TLS certificate presented by a remote server and will ignore hostname mismatches
and/or expired certificates, which will make the application vulnerable to man-in-the-middle (MitM) attacks.
This object contains configuration parameters for the s3
fetch handler.
Parameter | Description |
---|---|
max_read_retries |
Maximum number of socket read retries. Defaults to 5 . |
read_chunk_size |
Number of bytes to consume per read attempt. Defaults to 10485760 bytes (10MB). |
read_timeout_seconds |
Timeout in seconds per read attempt. Defaults to 120 . |
This object contains all implementation-specific resolver configuration parameters, keyed by resolver scheme. The current default handlers schemes are: [ark, minid, doi, and ga4ghdos
].
Each scheme can have multiple resolver configuration blocks in an array, where each block can be mapped to a different resolver namespace prefix.
Parameter | Description |
---|---|
handler |
This is the fully-qualified Python class name of a class derived from bdbag.fetch.resolvers.base_resolver.BaseResolverHandler and implementing the required functions. The bdbag resolver code will attempt to locate and instantiate this class at runtime. |
prefix |
This is an optional parameter that maps the handler resolution to only instances that contain the specific prefix found in the identifier. |
identifier_resolvers |
This is the same parameter as the global identifier_resolvers array. If found at this level, it will override the global setting for this scheme/prefix combination. |
Below is a sample bdbag.json
file:
{
"bag_config": {
"bag_algorithms": [
"md5",
"sha256"
],
"bag_metadata": {
"BagIt-Profile-Identifier": "https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
"Contact-Name": "mdarcy",
"Contact-Orcid": "0000-0003-2280-917X"
},
"bag_processes": 1,
"bagit_spec_version": "0.97"
},
"bdbag_config_version": "1.5.0",
"fetch_config": {
"http": {
"session_config": {
"retry_backoff_factor": 1.0,
"retry_connect": 5,
"retry_read": 5,
"retry_status_forcelist": [
500,
502,
503,
504
]
},
"http_cookies": {
"file_names": [
"*cookies.txt"
],
"scan_for_cookie_files": true,
"search_paths": [
"/home/mdarcy"
],
"search_paths_filter": ".bdbag"
}
},
"s3": {
"max_read_retries": 5,
"read_chunk_size": 10485760,
"read_timeout_seconds": 120
}
},
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"resolver_config": {
"ark": [
{
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": null
},
{
"handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": "57799"
},
{
"handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": "99999/fk4"
}
],
"doi": [
{
"handler": "bdbag.fetch.resolvers.doi_resolver.DOIResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
],
"prefix": "10.23725/"
}
],
"ga4ghdos": [
{
"handler": "bdbag.fetch.resolvers.dataguid_resolver.DataGUIDResolverHandler",
"identifier_resolvers": [
"n2t.net"
],
"prefix": "dg.4503/"
}
],
"minid": [
{
"handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
"identifier_resolvers": [
"n2t.net",
"identifiers.org"
]
}
]
}
}
The file keychain.json
is used to specify the authentication mechanisms and credentials for the various URLs that might
be encountered while trying to resolve (download) the files listed in a bag's fetch.txt file.
The format of keychain.json
is a JSON array containing a list of JSON objects, each of which specify a set of parameters used to
configure the authentication method and credentials to use for a specifed base URL.
Parameter | Description |
---|---|
uri |
This is the base URI used to specify when authentication should be used. When a URI reference is encountered in fetch.txt, an attempt will be made to match it against all base URIs specified in keychain.json and if a match is found, the request will be authenticated before file retrieval is attempted. |
auth_uri |
This is the authentication URI used to establish an authenticated session for the specified uri . This is currently assumed to be an HTTP(s) protocol URL. |
auth_type |
This is the authentication type used by the server specified by uri or auth_uri (if present). |
auth_params |
This is a child object containing authentication-type specific parameters used in session establishment. It will generally contain credential information such as a username and password, a cookie value, or client certificate parameters. It can also contain other parameters required for authentication with the given auth_type mechanism; for example the HTTP method (i.e., GET or POST ) to use with HTTP Basic Auth. |
Below is a sample keychain.json
file:
[
{
"uri": "https://some.host.com/somefiles/",
"auth_uri": "https://some.host.com/authenticate",
"auth_type": "http-form",
"auth_params": {
"username": "me",
"password": "mypassword",
"username_field": "username",
"password_field": "password"
}
},
{
"uri": "https://some.host.com/somefiles/",
"auth_uri": "https://some.host.com/authenticate",
"auth_type": "http-basic",
"auth_params": {
"auth_method":"POST",
"username": "me",
"password": "mypassword"
}
},
{
"uri": "https://some.host.com/somefiles/",
"auth_type": "cookie",
"auth_params": {
"cookies": [ "a_cookie_name=zxyfw1231_secret"]
}
},
{
"uri": "https://some.host.com/somefiles/",
"auth_type": "bearer-token",
"auth_params": {
"token": "<token>",
"allow_redirects_with_token": true
}
},
{
"uri": "ftp://some.host.com/somefiles/",
"auth_type": "ftp-basic",
"auth_params": {
"username": "anonymous",
"password": "[email protected]"
}
},
{
"uri": "s3://mybucket",
"auth_type": "aws-credentials",
"auth_params": {
"key": "foo",
"secret": "bar"
}
},
{
"uri": "gs://gcs-bdbag-integration-testing/",
"auth_type": "gcs-credentials",
"auth_params": {
"project_id": "bdbag-204999",
"allow_requester_pays": true
}
},
{
"uri": "gs://bdbag-dev/",
"auth_type": "gcs-credentials",
"auth_params": {
"project_id": "bdbag-204999",
"service_account_credentials_file": "/home/bdbag/bdbag-204400-41babdd46e24.json"
}
},
{
"uri": "globus://my_endpoint/my_files/",
"auth_type": "globus_transfer",
"auth_params": {
"local_endpoint": "b06c5a10-0b17-11e7-a73f-22000bf2d559",
"transfer_token": "AQBXNMizAAAAAAADPIg9SoyPk_dm0BOFcWT7pe-52fQKv2Je6zi-hEvJ5xkfXw8rLaL9mVg8RtOY-vy4qrQd"
}
}
]
A remote-file-manifest
configuration file is used by bdbag
during bag creation and update as a way
to include files in a bag that are not necesarily present on the local system, and therefore cannot be hashed.
The file is processed by bdbag
and the data used to generate both payload manifest entries and fetch.txt
entries in the result bag.
The remote-file-manifest
is structured as a JSON array containing a list of JSON objects that have the following attributes:
url
: The url where the file can be located or dereferenced from. This value MUST be present.length
: The length of the file in bytes. This value MUST be present.filename
: The filename (or path), relative to the bag 'data' directory as it will be referenced in the bag manifest(s) and fetch.txt files. This value MUST be present.- AT LEAST one (and ONLY one of each) of the following
algorithm:checksum
key-value pairs:md5
:<md5 hex value>
sha1
:<sha1 hex value>
sha256
:<sha256 hex value>
sha512
:<sha512 hex value>
- Other legal JSON keys and values of arbitrary complexity MAY be included, as long as the basic requirements of the structure (as described above) are fulfilled.
Below is a sample remote-file-manifest
configuration file:
[
{
"url":"https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
"length":699,
"filename":"bdbag-profile.json",
"sha256":"eb42cbc9682e953a03fe83c5297093d95eec045e814517a4e891437b9b993139"
},
{
"url":"ark:/88120/r8059v",
"length": 632860,
"filename": "minid_v0.1_Nov_2015.pdf",
"sha256": "cacc1abf711425d3c554277a5989df269cefaa906d27f1aaa72205d30224ed5f"
}
]
A bag-info
metadata configuration file consists of a single JSON object containing a set of JSON key-value pairs that will be
written as-is to the bag's bag-info.txt
file. NOTE: per the bagit
specification, strings are the only supported value type in bag-info.txt
.
Below is a sample bag-info
metadata configuration file:
{
"BagIt-Profile-Identifier": "https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
"External-Description": "Simple bdbag test",
"Arbitrary-Metadata-Field": "This is completely arbitrary"
}
A Research Object metadata configuration file consists of a single JSON object containing a set of JSON key-object pairs where
the key
is a /
delimited relative file path and the object
is any aribitratily complex JSON content. This format allows
bdbag
to process all RO metadata as an aggregation which can then be serialized into individual JSON file components relative
to the bag's metadata
directory.
NOTE: while this documentation refers to this configuration file as a ro
metadata file,
the contents of this configuration file only have to conform to the bagit-ro
conventions if bagit-ro
compatibility is the goal. Otherwise, this mechanism can be used as a generic way to create any number of
arbitrary JSON (or JSON-LD) metadata files as bagit
tagfiles.
Below is a sample ro metadata configuration file:
{
"manifest.json": {
"@context": [ "https://w3id.org/bundle/context" ],
"@id": "../",
"createdOn": "2018-02-08T12:23:00Z",
"aggregates": [
{ "uri": "../data/CTD_chem_gene_ixn_types.csv",
"mediatype": "text/csv"
},
{ "uri": "../data/CTD_chemicals.csv",
"mediatype": "text/csv"
},
{ "uri": "../data/CTD_pathways.csv",
"mediatype": "text/csv"
}
],
"annotations": [
{ "about": "../data/CTD_chem_gene_ixn_types.csv",
"content": "annotations/CTD_chem_gene_ixn_types.csv.jsonld"
}
]
},
"annotations/CTD_chem_gene_ixn_types.csv.jsonld": {
"@context": {
"schema": "http://schema.org/",
"object": "schema:object",
"TypeName": {
"@type": "schema:name",
"@id": "schema:name"
},
"Code": {
"@type": "schema:code",
"@id": "schema:code"
},
"Description": {
"@type": "schema:description",
"@id": "schema:description"
},
"ParentCode": {
"@type": "schema:code",
"@id": "schema:parentItem"
},
"results": {
"@id": "schema:object",
"@type": "schema:object",
"@container": "@set"
}
}
}
}