Problem statement

author	institute	title	date	subtitle
Timothée Poisot	Université de Montréal	Interpretable ML for biodiversity	\today	An introduction using species distribution models

Main goals

How do we produce a model?
How do we convey that it works?
How do we talk about how it makes predictions?

But why...

... think of SDM as ML problems? : Because they are! We want to learn a predictive algorithm from data

... the focus on explainability? : We cannot ask people to trust - we must convince and explain

What we will not discuss

Image recognition
Sound recognition
Generative AI

Learning/teaching goals

ML basics
- cross-validation
- hyper-parameters tuning
- bagging and ensembles
Pitfalls
- data leakage
- overfitting
Explainable ML
- partial responses
- Shapley values

But wait!

a similar example fully worked out usually takes me 21 hours of class time
this is an overview
don't care about the output, care about the \alert{process}!

Problem statement

The problem in ecological terms

We have information about a species, taking the form of $(\text{lon}, \text{lat})$ for points where the species was observed

Using this information, we can extract a suite of environmental variables for the locations where the species was observed

We can do the same thing for locations where the species was not observed

\alert{Where could we observe this species}?

The problem in ML terms

We have a series of labels $\mathbf{y}n \in \mathbb{B}$, and features $\mathbf{X}{m,n} \in \mathbb{R}$

We want to find an algorithm $f(\mathbf{x}_m) = \hat y$ that results in the distance between $\hat y$ and $y$ being small

An algorithm that does this job well is generalizable (we can apply it on data it has not been trained on) and makes credible predictions

Setting up the data for our example

We will use data on observations of Turdus torquatus in Switzerland, downloaded from the copy of the eBird dataset on GBIF

Two series of environmental layers

CHELSA2 BioClim variables (19)
EarthEnv land cover variables (12)

Now is not the time to make assumptions about which are relevant!

The observation data

\

Problem (and solution)

We want $\textbf{y} \in \mathbb{B}$, and so far we are missing \alert{negative values}

We generate \alert{pseudo}-absences with the following rules:

Locations further away from a presence are more likely
Locations less than 6km away from a presence are ruled out
Pseudo-absences are twice as common as presences

The (inflated) observation data

\

Training the model

A simple decision tree

Decision trees recursively split observations by picking the best variable and value.

Given enough depth, they can \alert{overfit} the training data (we'll get back to this).

Setup

We need an \alert{initial} model to get started: what if we use all the variables?

We shouldn't use all the variables.

But! It is a good baseline. A good baseline is important.

Cross-validation

Can we train the model?

More specifically -- if we train the model, how well can we expect it to perform?

The way we answer this question is: in many parallel universes with slightly less data, is the model good?

Null classifiers

What if the model guessed based on chance only?

What is \alert{chance only}?

50%, based on prevalence, or always the same answer

Expectations

The null classifiers tell us what we need to beat in order to perform \alert{better than chance}.

Model	MCC	PPV	NPV	DOR	Accuracy
No skill	-0.00	0.34	0.66	1.00	0.55
Coin flip	-0.32	0.34	0.34	0.26	0.34
+	0.00	0.34			0.34
-	0.00		0.66		0.66

In practice, the no-skill classifier is the most informative: what if we \alert{only} know the positive class prevalence?

Cross-validation strategy

k-fold cross-validation
no testing data here

Cross-validation results

Model	MCC	PPV	NPV	DOR	Accuracy
No skill	-0.00	0.34	0.66	1.00	0.55
Dec. tree (val.)	0.80	0.83	0.96	210.06	0.91
Dec. tree (tr.)	0.84	0.86	0.97	202.00	0.93

What to do if the model is trainable?

We \alert{train it}!

This training is done using the full dataset - there is no need to cross-validate, we know what to expect based on previous steps.

Initial prediction

\

How is this model wrong?

\

Can we improve on this model?

\alert{variable selection}
data transformation (we use PCA here, but there are many other)
\alert{hyper-parameters tuning}

A note on PCA

\

Moving threshold classification

$P(+) > P(-)$
This is the same thing as $P(+) > 0.5$
Is it, though?

Learning curve for the threshold

\

Receiver Operating Characteristic

\

Precision-Recall Curve

\

Revisiting the model performance

Model	MCC	PPV	NPV	DOR	Accuracy
No skill	-0.00	0.34	0.66	1.00	0.55
Dec. tree (val.)	0.80	0.83	0.96	210.06	0.91
Dec. tree (tr.)	0.84	0.86	0.97	202.00	0.93
Tuned tree (val.)	0.83	0.85	0.96	198.33	0.92
Tuned tree (tr.)	0.84	0.85	0.97	174.94	0.92

Updated prediction

\

How is this model better?

\

But wait!

Decision trees overfit: if we pick a maximum depth of 8 splits, how many nodes can we use?

\

Ensemble models

Limits of a single model

it's a single model my dudes
different subsets of the training data may have different signal
do we need all the variables all the time?
bias v. variance tradeoff
fewer variables make it harder to overfit

Bootstrapping and aggregation

bootstrap the training \alert{instances} (32 samples for speed)
randomly sample $\lceil \sqrt{n} \rceil$ variables

Is this worth it?

Model	MCC	PPV	NPV	DOR	Accuracy
No skill	-0.00	0.34	0.66	1.00	0.55
Dec. tree (val.)	0.80	0.83	0.96	210.06	0.91
Dec. tree (tr.)	0.84	0.86	0.97	202.00	0.93
Tuned tree (val.)	0.83	0.85	0.96	198.33	0.92
Tuned tree (tr.)	0.84	0.85	0.97	174.94	0.92
Forest (val.)	0.77	0.79	0.96	111.07	0.89
Forest (tr.)	0.77	0.79	0.95	78.34	0.89

Short answer: no

Long answer: maybe? Let's talk it through!

Prediction of the rotation forest

\

Prediction of the rotation forest

\

Variation between predictions

\

What, exactly, is bootstrap telling us?

what if we had a little less data (it's conceptually close to cross-validation!)
uncertainty about locations, not predictions

Do we expect the model predictions to change at this location when we add more training data?

Variable importance

Layer	Variable	Import.
10	BIO10	0.28209
5	BIO5	0.253606
6	BIO6	0.1741
13	BIO13	0.0832986
15	BIO15	0.0797567
26	Cultivated and Managed Vegetation	0.0793417
12	BIO12	0.044542
29	Snow/Ice	0.0032655

But why?

Partial response curves

If we assume that all the variables except one take their average value, what is the prediction associated to the value that is unchanged?

Equivalent to a mean-field approximation

Example with temperature

\

Example with two variables

\

Spatialized partial response plot

\

Spatialized partial response (binary outcome)

\

Inflated response curves

Averaging the variables is \alert{masking a lot of variability}!

Alternative solution:

Generate a grid for all the variables
For all combinations in this grid, use it as the stand-in for the variables to replace

In practice: Monte-Carlo on a reasonable number of samples.

Example

\

Limitations

partial responses can only generate model-level information
they break the structure of values for all predictors at the scale of a single observation
their interpretation is unclear

Shapley

how much is the \alert{average prediction} modified by a specific variable having a specific value?
it's based on game theory (but it's not actually game theory)
many highly desirable properties!

Response curves revisited

\

On a map

\

Variable importance revisited

Layer	Variable	Import.	Shap. imp.
10	BIO10	0.28209	0.310325
5	BIO5	0.253606	0.252689
6	BIO6	0.1741	0.215058
13	BIO13	0.0832986	0.0639503
26	Cultivated and Managed Vegetation	0.0793417	0.0634541
12	BIO12	0.044542	0.0370903
15	BIO15	0.0797567	0.0315063
29	Snow/Ice	0.0032655	0.025927

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
code		code
data		data
figures		figures
template		template
.JuliaFormatter.toml		.JuliaFormatter.toml
.gitignore		.gitignore
Makefile		Makefile
Project.toml		Project.toml
README.md		README.md
latexmkrc		latexmkrc
makebackground.jl		makebackground.jl
slides.Jmd		slides.Jmd
slides.jl		slides.jl
slides.md		slides.md
slides.pdf		slides.pdf

tpoisot/InterpretableSDMWithJulia

Folders and files

Latest commit

History

Repository files navigation

Main goals

But why...

What we will not discuss

Learning/teaching goals

But wait!

Problem statement

The problem in ecological terms

The problem in ML terms

Setting up the data for our example

The observation data

Problem (and solution)

The (inflated) observation data

Training the model

A simple decision tree

Setup

Cross-validation

Null classifiers

Expectations

Cross-validation strategy

Cross-validation results

What to do if the model is trainable?

Initial prediction

How is this model wrong?

Can we improve on this model?

A note on PCA

Moving threshold classification

Learning curve for the threshold

Receiver Operating Characteristic

Precision-Recall Curve

Revisiting the model performance

Updated prediction

How is this model better?

But wait!

Ensemble models

Limits of a single model

Bootstrapping and aggregation

Is this worth it?

Prediction of the rotation forest

Prediction of the rotation forest

Variation between predictions

What, exactly, is bootstrap telling us?

Variable importance

But why?

Partial response curves

Example with temperature

Example with two variables

Spatialized partial response plot

Spatialized partial response (binary outcome)

Inflated response curves

Example

Limitations

Shapley

Response curves revisited

On a map

Variable importance revisited

Most important predictor

Summary

SDMs are (applied) machine learning

About

Topics

Resources

Stars

Watchers

Forks

Releases 3

Languages