author | institute | title | date | subtitle |
---|---|---|---|---|
Timothée Poisot |
Université de Montréal |
Interpretable ML for biodiversity |
\today |
An introduction using species distribution models |
- How do we produce a model?
- How do we convey that it works?
- How do we talk about how it makes predictions?
... think of SDM as ML problems? : Because they are! We want to learn a predictive algorithm from data
... the focus on explainability? : We cannot ask people to trust - we must convince and explain
- Image recognition
- Sound recognition
- Generative AI
- ML basics
- cross-validation
- hyper-parameters tuning
- bagging and ensembles
- Pitfalls
- data leakage
- overfitting
- Explainable ML
- partial responses
- Shapley values
- a similar example fully worked out usually takes me 21 hours of class time
- this is an overview
- don't care about the output, care about the \alert{process}!
We have information about a species, taking the form of
Using this information, we can extract a suite of environmental variables for the locations where the species was observed
We can do the same thing for locations where the species was not observed
\alert{Where could we observe this species}?
We have a series of labels $\mathbf{y}n \in \mathbb{B}$, and features $\mathbf{X}{m,n} \in \mathbb{R}$
We want to find an algorithm
An algorithm that does this job well is generalizable (we can apply it on data it has not been trained on) and makes credible predictions
We will use data on observations of Turdus torquatus in Switzerland, downloaded from the copy of the eBird dataset on GBIF
Two series of environmental layers
- CHELSA2 BioClim variables (19)
- EarthEnv land cover variables (12)
Now is not the time to make assumptions about which are relevant!
We want
We generate \alert{pseudo}-absences with the following rules:
- Locations further away from a presence are more likely
- Locations less than 6km away from a presence are ruled out
- Pseudo-absences are twice as common as presences
Decision trees recursively split observations by picking the best variable and value.
Given enough depth, they can \alert{overfit} the training data (we'll get back to this).
We need an \alert{initial} model to get started: what if we use all the variables?
We shouldn't use all the variables.
But! It is a good baseline. A good baseline is important.
Can we train the model?
More specifically -- if we train the model, how well can we expect it to perform?
The way we answer this question is: in many parallel universes with slightly less data, is the model good?
What if the model guessed based on chance only?
What is \alert{chance only}?
50%, based on prevalence, or always the same answer
The null classifiers tell us what we need to beat in order to perform \alert{better than chance}.
Model | MCC | PPV | NPV | DOR | Accuracy |
---|---|---|---|---|---|
No skill | -0.00 | 0.34 | 0.66 | 1.00 | 0.55 |
Coin flip | -0.32 | 0.34 | 0.34 | 0.26 | 0.34 |
+ | 0.00 | 0.34 | 0.34 | ||
- | 0.00 | 0.66 | 0.66 |
In practice, the no-skill classifier is the most informative: what if we \alert{only} know the positive class prevalence?
- k-fold cross-validation
- no testing data here
Model | MCC | PPV | NPV | DOR | Accuracy |
---|---|---|---|---|---|
No skill | -0.00 | 0.34 | 0.66 | 1.00 | 0.55 |
Dec. tree (val.) | 0.80 | 0.83 | 0.96 | 210.06 | 0.91 |
Dec. tree (tr.) | 0.84 | 0.86 | 0.97 | 202.00 | 0.93 |
We \alert{train it}!
This training is done using the full dataset - there is no need to cross-validate, we know what to expect based on previous steps.
- \alert{variable selection}
- data transformation (we use PCA here, but there are many other)
- \alert{hyper-parameters tuning}
$P(+) > P(-)$ - This is the same thing as
$P(+) > 0.5$ - Is it, though?
Model | MCC | PPV | NPV | DOR | Accuracy |
---|---|---|---|---|---|
No skill | -0.00 | 0.34 | 0.66 | 1.00 | 0.55 |
Dec. tree (val.) | 0.80 | 0.83 | 0.96 | 210.06 | 0.91 |
Dec. tree (tr.) | 0.84 | 0.86 | 0.97 | 202.00 | 0.93 |
Tuned tree (val.) | 0.83 | 0.85 | 0.96 | 198.33 | 0.92 |
Tuned tree (tr.) | 0.84 | 0.85 | 0.97 | 174.94 | 0.92 |
Decision trees overfit: if we pick a maximum depth of 8 splits, how many nodes can we use?
- it's a single model my dudes
- different subsets of the training data may have different signal
- do we need all the variables all the time?
- bias v. variance tradeoff
- fewer variables make it harder to overfit
- bootstrap the training \alert{instances} (32 samples for speed)
- randomly sample
$\lceil \sqrt{n} \rceil$ variables
Model | MCC | PPV | NPV | DOR | Accuracy |
---|---|---|---|---|---|
No skill | -0.00 | 0.34 | 0.66 | 1.00 | 0.55 |
Dec. tree (val.) | 0.80 | 0.83 | 0.96 | 210.06 | 0.91 |
Dec. tree (tr.) | 0.84 | 0.86 | 0.97 | 202.00 | 0.93 |
Tuned tree (val.) | 0.83 | 0.85 | 0.96 | 198.33 | 0.92 |
Tuned tree (tr.) | 0.84 | 0.85 | 0.97 | 174.94 | 0.92 |
Forest (val.) | 0.77 | 0.79 | 0.96 | 111.07 | 0.89 |
Forest (tr.) | 0.77 | 0.79 | 0.95 | 78.34 | 0.89 |
Short answer: no
Long answer: maybe? Let's talk it through!
- what if we had a little less data (it's conceptually close to cross-validation!)
- uncertainty about locations, not predictions
Do we expect the model predictions to change at this location when we add more training data?
Layer | Variable | Import. |
---|---|---|
10 | BIO10 | 0.28209 |
5 | BIO5 | 0.253606 |
6 | BIO6 | 0.1741 |
13 | BIO13 | 0.0832986 |
15 | BIO15 | 0.0797567 |
26 | Cultivated and Managed Vegetation | 0.0793417 |
12 | BIO12 | 0.044542 |
29 | Snow/Ice | 0.0032655 |
If we assume that all the variables except one take their average value, what is the prediction associated to the value that is unchanged?
Equivalent to a mean-field approximation
Averaging the variables is \alert{masking a lot of variability}!
Alternative solution:
- Generate a grid for all the variables
- For all combinations in this grid, use it as the stand-in for the variables to replace
In practice: Monte-Carlo on a reasonable number of samples.
- partial responses can only generate model-level information
- they break the structure of values for all predictors at the scale of a single observation
- their interpretation is unclear
- how much is the \alert{average prediction} modified by a specific variable having a specific value?
- it's based on game theory (but it's not actually game theory)
- many highly desirable properties!
Layer | Variable | Import. | Shap. imp. |
---|---|---|---|
10 | BIO10 | 0.28209 | 0.310325 |
5 | BIO5 | 0.253606 | 0.252689 |
6 | BIO6 | 0.1741 | 0.215058 |
13 | BIO13 | 0.0832986 | 0.0639503 |
26 | Cultivated and Managed Vegetation | 0.0793417 | 0.0634541 |
12 | BIO12 | 0.044542 | 0.0370903 |
15 | BIO15 | 0.0797567 | 0.0315063 |
29 | Snow/Ice | 0.0032655 | 0.025927 |
- models we can train
- parameters can (should!) be tuned automatically
- we can use tools from explainable ML to give more clarity