OpenGHG Inversions is a Python package that is being develoepd as part of the OpenGHG project with the aim of merging the data-processing and simulation modelling capabilities of OpenGHG with the atmospheric Bayesian inverse models developed by the Atmospheric Chemistry Research Group (ACRG) at the University of Bristol, UK.
Currently, OpenGHG Inversions includes the following regional inversion models:
- Hierarchical Bayesian Markov Chain Monte Carlo (HBMCMC) model (as described in Ganesan et al., 2014, ACP)
As OpenGHG Inversions is dependent on OpenGHG, please ensure that when running locally you are using Python 3.10 or later on Linux or MacOS. Please see the OpenGHG project for further installation instructions of OpenGHG and setting up an object store.
Check that you have Python 3.10 or greater:
python --version
(Note for Bristol ACRG group: If you are on Blue Pebble, the default anaconda module lang/python/anaconda
is Python 3.9. Use module avail
to list other options; lang/python/miniconda/3.10.10.cuda-12
or lang/python/miniconda/3.12.2.inc-perl-5.30.0
will work.)
Make a virtual environment
python -m venv openghg_inv
Next activate the environment
source openghg_inv/bin/activate
First you'll need to clone the repository
git clone https://github.com/openghg/openghg_inversions.git
Next make sure pip
and related install tools are up to date and then install OpenGHG Inversions using the editable install flag (-e
)
pip install --upgrade pip setuptools wheel
pip install -e openghg_inversions
Optionally, install the developer requirements (there is more information about this in the "Contributing" section below):
pip install -r requirements-dev.txt
At this point, run
python -c "import pymc"
This should run without printing any messages.
If you receive a message about pymc
or pytensor
using the numpy
C-API, then your inversions might run slowly because the fast linear algebra libraries used by numpy
haven't been found.
Solutions to this are:
- try
python -m pip install numpy
after upgradingpip, setuptools, wheel
- create a
conda
env, installnumpy
usingconda
, then usepip
to upgradepip, setuptools, wheel
and installopenghg_inversions
For an overview of OpenGHG inversions, see this primer.
Keyword arguments are propagated as follows:
- any key-value pair in an
ini
file or passed via the--kwargs
flag is passed to the MCMC function as a keyword argument. (Currently,fixedbasisMCMC
is the only available MCMC function) - any keyword argument not recognised by the MCMC function (i.e.
fixedbasisMCMC
) is passed to the functioninferpymc
inhbmcmc.inversion_pymc
, which is the function that creates and samples from the RHIME model.
Thus you can pass arguments to either fixedbasisMCMC
or inferpymc
, but all of these arguments will be specified in the ini
file (or command line).
Let's look at these two steps in detail.
Extra options can be added to an ini
file in almost any location.
The template ini file puts
these option under the heading MCMC.OPTIONS
:
[MCMC.OPTIONS]
averaging_error = True
min_error = 20.0
fixed_basis_outer_regions = False
use_bc = True
nuts_sampler = "numpyro"
save_trace = True
compute_min_error = True
no_model_error = False
These will be passed to the MCMC function (e.g. fixedbasisMCMC
) as keyword arguments.
Any argument in fixedbasisMCMC
can be specified in an ini
file this way.
When running inversions using the script run_hbmcmc.py
, you must specify the start and end date of
the inversion period, and you pass an ini
file using the flag -c
.
In addition, you can pass the output path using the flag --output-path
; this is useful if your SLURM script
uses different output locations for different array jobs.
You can also pass arbitrary keyword arguments to run_hbmcmc.py
using the --kwargs
flag.
For instance:
python run_hbmcmc.py "2019-01-01" "2019-02-01" -c "example.ini" --kwargs '{"averaging_error": true, "min_error": 20.0, "nuts_sampler": "numpyro"}'
It is crucial that you enclose the dictionary in single quotes, otherwise the command line will split the dictionary on white space.
Again, this can be used to change the arguments passed to an inversion on the fly (say, in a SLURM script).
The format of the dictionary inside single quotes must be JSON, because the value of kwargs
is parsed using json.loads
.
Python translates JSON according to this table.
In particular, "true"
in JSON translate to True
in Python (but "True"
will be translated as a string).
The parsing in our ini
files is more flexible; in particular, values that are Python statements will be translated to Python, so you don't need to worry about translation.
The following sections detail some parameters that enable/specify optional behaviour in the inversion.
This is not a comprehensive list (see the docstring for fixedbasisMCMC
in the hbmcmc module for more arguments).
Arguments affecting the data using in the inversion:
sites
: a list of the sites to use in the inversion. Other information applied on a site-to-site basis that is presented in lists must be in the same order as used in thesites
list.inlet
: a list of inlets for each site. If only one inlet is available for a given site and species, thenNone
may be used as the value for that site. If there are a range of inlet heights at a single site, and these should correspond to a single footprint release height, then you may use, for instance,slice(140, 160)
to combine inlet heights between 140 and 160 meters into a single timeseries of observations. -instrument
,fp_height
,obs_data_level
, andmet_model
must either be lists of the same length assites
, or a single value may be supplied and will be converted to a list of the correct length.
Arguments affecting the inverse model:
averaging_error
: ifTrue
, the error from resampling to the givenaveraging_period
will be added to the observation's error.use_bc
: defaults toTrue
. IfFalse
, no boundary conditions will be used in the inversion. This implicitly assumes that contributions from the boundary have been subtracted from the observations.fix_basis_outer_regions
:- Default value is
False
- If
True
, the "outer regions" of the (EUROPE
) domain use basis regions specified by a file provided by the Met Office (from their "InTem" model), and the "inner region", which includes the UK, is fit using our basis algorithms. - This option is only available for the
EUROPE
domain currently.
- Default value is
calculate_min_error
: calculate min_error (see below) on the fly using the "residual error method" or a method based on percentiles of observations. Available arguments:residual
: use "residual error method"percentile
: use method based on percentilesNone
: in this case, you should pass in a value directly using (for instance)min_error = 12.3
min_error_options
: additional parameters to pass to the function that compute min error. This should be a dictionary, and the available options depend on the function used. (The functions to compute min. model error are inmodel_error.py
).- If
calculate_min_error = "residual"
, then, for instance, you could usemin_error_options = {"robust": False, "by_site": True}
. (By default,robust
isFalse
, this is just to show the possibilities.)
- If
filters
: filters to apply to data (after it is resampled and aligned)filters = None
will skip filtering- if
filters
is a list of filters (or a string containing a single filter name), those filters will be applied to all sites. - if
filters
is a dictionary with site codes as keys and lists of filters as values, then each site will have filters applied individually according to this dictionary. All sites must supplied; to skip a site, passNone
instead of a list (or omit that site from the dictionary). For instance:filters = {"MHD": ["pblh_inlet_diff", "pblh_min"], "JFJ": None}
. - the list of available filters can be found in the
filtering
function in the utils module. - Further parameters affecting the model are in the next subsection: they are passed to
inferpymc
. xprior
andbcprior
: these should be a dictionary containing"pdf": <distribution>
and the arguments that should be passed to the PyMC distribution with that name.<distribution>
Arguments affecting the output of the inversion:
save_trace
:- The default value is
False
. - If
True
, the arvizInferenceData
output from sampling will be saved to the output path of the inversion, with a file name of the formf"{outputname}{start_data}_trace.nc
. To load this trace into arviz, you need to useInferenceData.from_netcdf
. - Alternatively, you can pass a path (including filename), and that path will be used.
- The default value is
As mentioned above, any keyword argument passed to fixedbasisMCMC
(either by an ini
file or from --kwargs
on the command line) that is not recognised by fixedbasisMCMC
is passed on to inferpymc
.
These parameters include:
min_error
: a non-negative float value specifying a lower bound for the model-measurement mismatch error (i.e. the error on (y - y_mod)).nuts_sampler
: a string, which defaults to"pymc"
. The other option is"numpyro"
, which will the JAX accelerated sampler from Numpyro; this tends to be significantly faster than the NUTS sampler built into PyMC.pollution_events_from_obs
: Determines whether the model error is calculated as a fraction of:- the measured enhancement above the modelled baseline (if
True
) - the prior modelled enhancement (if
False
)
- the measured enhancement above the modelled baseline (if
no_model_error
: ifTrue
, only use obs error in likelihood (omitting min. model error and model error from scaling pollution events).reparameterise_log_normal
: ifTrue
, then log normal priors will be sampled by transforming samples from standard normal random variable to samples from the appropriate log normal distribution.
The results of an inversions are returned as an xarray Dataset
.
The dimension nmeasure
consists of the time for each observation stacked into a single 1D array.
TODO: complete this part
Yerror
: obs. error used in the inversion; ifadd_averaging
is True, this will contain the combined "repeatability" and "variability"; otherwise, it will just contain "repeatability", if it is available, or "variability"Yerror_repeatablity
: obs. repeatability. If repeatability isn't available for some sites, then this is filled with zeros.Yerror_variability
: obs. variability.
To contribute to openghg_inversions
, you should also install the developer packages:
pip install -r requirements-dev.txt
This will install the packages flake8, pytest, black
.
We use black
to format our code. To check if your code needs reformatting, run:
black --check openghg_inversions
in your openghg_inversions
repository (with your virtual env activated).
If you replace the flag --check
with --diff
, you can see what will be changed.
To make these changes, run
black openghg_inversions
We also recommend using flake8
to check for code style issues, which you can run with:
flake8 openghg_inversions
You can run the tests using:
pytest
in the openghg_inversions
repository. (Make sure your virtual env is activated.)
To contribute new code, make a branch off of the devel
branch.
When your code is ready to be added, push it to github (origin
).
You can then open a "pull request" on github and request a code review.
It's helpful to write a description of the changes made in your PR, as well as linking to any relevant issues.
Your code must past the tests and be reviewed before it can be merged. After this, you can merge your branch and close it (it can always be recovered later if necessary).
Ganesan et al. (2014),ACP;
Western et al. (2021), Enviro. Sci. Tech Lett.