Skip to content

Commit

Permalink
beta version ready
Browse files Browse the repository at this point in the history
  • Loading branch information
rtavenar committed May 17, 2020
1 parent 852a737 commit 67f623e
Show file tree
Hide file tree
Showing 14 changed files with 92 additions and 88 deletions.
6 changes: 3 additions & 3 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ jupyterhub_url : "" # The URL for your JupyterHub. If no URL,
jupyterhub_interact_text : "Interact" # The text that interact buttons will contain.

# Binder link settings
use_binder_button : true # If 'true', add a binder button for interactive links
use_binder_button : false # If 'true', add a binder button for interactive links
binderhub_url : "https://mybinder.org" # The URL for your BinderHub. If no URL, use ""
binder_repo_base : "https://github.com/" # The site on which the textbook repository is hosted
binder_repo_org : "rtavenar" # The username or organization that owns this repository
Expand All @@ -73,12 +73,12 @@ binder_repo_branch : "gh-pages" # The branch on which your textbo
binderhub_interact_text : "Interact" # The text that interact buttons will contain.

# Thebelab settings
use_thebelab_button : true # If 'true', display a button to allow in-page running code cells with Thebelab
use_thebelab_button : false # If 'true', display a button to allow in-page running code cells with Thebelab
thebelab_button_text : "Thebelab" # The text to display inside the Thebelab initialization button
codemirror_theme : "abcdef" # Theme for codemirror cells, for options see https://codemirror.net/doc/manual.html#config

# nbinteract settings
use_show_widgets_button : true # If 'true', display a button to allow in-page running code cells with nbinteract
use_show_widgets_button : false # If 'true', display a button to allow in-page running code cells with nbinteract

# Download settings
use_download_button : true # If 'true', display a button to download a zip file for the notebook
Expand Down
6 changes: 3 additions & 3 deletions content/parts/01/dtw.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jupyter:

# Dynamic Time Warping

This section covers my works related to Dynamic Time Warping for time series.
This section covers works related to Dynamic Time Warping for time series.

<!-- #region {"tags": ["popout"]} -->
**Note.** In ``tslearn``, such time series would be represented as arrays of
Expand Down Expand Up @@ -48,8 +48,8 @@ optimization problem:

\begin{equation}
DTW(\mathbf{x}, \mathbf{x}^\prime) =
\sqrt{ \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)}
\sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^2 }
\min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)}
\sqrt{ \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^2 }
\label{eq:dtw}
\end{equation}

Expand Down
12 changes: 4 additions & 8 deletions content/parts/01/dtw/dtw_da.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Optimal Transport for Domain Adaptation {% cite courty:hal-02112785 %}.
One significant difference however is that we rely on a reference modality for
alignment, which is guided by our application context.

## Use case
## Motivating use case

Phosphorus (P) transfer during storm events represents a significant part of
annual P loads in streams and contributes to eutrophication in downstream water
Expand All @@ -58,7 +58,7 @@ limit and test its ability to compare seasonal variability of P storm dynamics
in two headwater watersheds. Both watersheds are ca. 5 km², have similar
climate and geology, but differ in land use and P pressure intensity.

## Method
## Alignment-based resampling method

In the above-described setting, we have access to one modality (discharge,
commonly denoted $Q$) that is representative of the evolution of the flood.
Expand All @@ -69,8 +69,8 @@ Indeed, time series may have
1. different starting times due to the discharge threshold at which the
autosamplers were triggered,
2. different lengths and
3. differences in phase that yield different positions of the discharge peak
and of concentration data points relative to the hydrograph.
3. differences in phase that yield different temporal localization of the
discharge peak.

To align time series, we use the path associated to DTW.
This matching path can be viewed as the optimal way to perform point-wise
Expand All @@ -83,10 +83,6 @@ The reference discharge time series used in this study is chosen
as a storm event with full coverage of flow rise and flow recession phases.
Alternatively, one could choose a synthetic idealized storm hydrograph.

As stated above, the continuity condition imposed on admissible paths results
in each element of reference time series $\mathbf{x}^\text{ref}_\text{Q}$ being
matched with at least one element in each discharge time series from the
dataset.
We then use barycentric mapping based on obtained matches to realign other
modalities to the timestamps of the reference time series, as shown in the
following Figures:
Expand Down
16 changes: 8 additions & 8 deletions content/parts/01/dtw/dtw_gi.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ jupyter:
# DTW with Global Invariances

<!-- #region {"tags": ["popout"]} -->
**Note.** This work was part of Titouan Vayer's PhD thesis.
We were co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
**Note.** This work is part of Titouan Vayer's PhD thesis.
We are co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
<!-- #endregion -->

In this work we address the problem of comparing time series while taking
Expand All @@ -43,10 +43,8 @@ lie. More formally, we define Dynamic Time Warping with Global Invariances

\begin{equation}
\text{DTW-GI}(\mathbf{x}, \mathbf{x^\prime}) =
\sqrt{
\min_{f \in \mathcal{F}, \pi \in \mathcal{A}(\mathbf{x}, \mathbf{x^\prime})}
\sum_{(i, j) \in \pi} d(x_i, f(x^\prime_j))^2
} ,
\min_{f \in \mathcal{F}, \pi \in \mathcal{A}(\mathbf{x}, \mathbf{x^\prime})}
\sqrt{ \sum_{(i, j) \in \pi} d(x_i, f(x^\prime_j))^2 } \, ,
\label{eq:dtwgi}
\end{equation}

Expand Down Expand Up @@ -699,13 +697,15 @@ for idx_dataset, dataset_fun in enumerate(list_dataset_generators):

We also introduce soft counterparts following the definition of softDTW from
{% cite cuturi2017soft %}.
In this case, optimization consists in gradient descent and a wider variety of
feature space transformation families can be considered.

We validate the utility of this new metric on real world
We validate the utility of these similarity measures on real world
datasets on the tasks of human motion prediction (where motion is captured under
different points of view) and cover song identification (where song similarity
is defined up to a key transposition).
In both these settings, we observe that joint optimization on feature space
transformation and temporal alignment improves over standard techniques that
transformation and temporal alignment improves over standard approaches that
consider these as two independent steps.

## References
Expand Down
32 changes: 16 additions & 16 deletions content/parts/01/ot.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ distance that interpolates between Wasserstein distance between node feature
distributions and Gromov-Wasserstein distance between structures.

<!-- #region {"tags": ["popout"]} -->
**Note.** This work was part of Titouan Vayer's PhD thesis.
We were co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
**Note.** This work is part of Titouan Vayer's PhD thesis.
We are co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
<!-- #endregion -->

Here, we first introduce both Wasserstein and Gromov-Wasserstein distances and
Expand All @@ -49,9 +49,8 @@ beginning to end).
<!-- #endregion -->

\begin{equation}
W_p(\mu, \mu') = \left(
\min_{\pi \in \Pi(\mu, \mu^\prime)}
\sum_{i,j} d(x_i, x^\prime_j)^p \pi_{i,j} \right)^{\frac{1}{p}}
W_p(\mu, \mu') = \min_{\pi \in \Pi(\mu, \mu^\prime)}
\left(\sum_{i,j} d(x_i, x^\prime_j)^p \pi_{i,j} \right)^{\frac{1}{p}}
\label{eq:wass}
\end{equation}

Expand All @@ -72,8 +71,8 @@ distances, as illustrated below:
The corresponding distance is the Gromov-Wasserstein distance, defined as:

\begin{equation}
GW_p(\mu, \mu') = \left(
\min_{\pi \in \Pi(\mu, \mu^\prime)}
GW_p(\mu, \mu') = \min_{\pi \in \Pi(\mu, \mu^\prime)}
\left(
\sum_{i,j,k,l}
\left| d_\mu(x_i, x_k) - d_{\mu'}(x^\prime_j, x^\prime_l) \right|^p
\pi_{i,j} \pi_{k,l}
Expand Down Expand Up @@ -112,13 +111,13 @@ More formally, we consider undirected labeled graphs as tuples of the form $\mat
$(\mathcal{V},\mathcal{E})$ are the set of vertices and edges of the graph.
$\ell_f: \mathcal{V} \rightarrow \Omega_f$ is a labelling function which
associates each vertex $v_{i} \in \mathcal{V}$ with a feature
$a_{i}\stackrel{\text{def}}{=}\ell_f(v_{i})$ in some feature metric space
$a_{i} = \ell_f(v_{i})$ in some feature metric space
$(\Omega_f,d)$.
We will denote by _feature information_ the set of all the features
$\{a_{i}\}_{i}$ of the graph.
Similarly, $\ell_s: \mathcal{V} \rightarrow \Omega_s$ maps a vertex $v_i$ from
the graph to its structure representation
$x_{i} \stackrel{\text{def}}{=} \ell_s(v_{i})$ in some structure space
$x_{i} = \ell_s(v_{i})$ in some structure space
$(\Omega_s,C)$ specific to each graph.
$C : \Omega_s \times \Omega_s \rightarrow \mathbb{R_{+}}$ is a symmetric
application which aims at measuring the similarity between the nodes in the
Expand Down Expand Up @@ -178,7 +177,7 @@ E_{q}(\mathcal{G}, \mathcal{G}', \pi) =

The FGW distance looks for the coupling $\pi$ between vertices of the
graphs that minimizes the cost $E_{q}$ which is a linear combination of a cost
$d(a_{i},a^\prime_j)$ of transporting one feature $a_{i}$ to a feature $a^\prime_j$
$d(a_{i},a^\prime_j)$ of transporting feature $a_{i}$ to $a^\prime_j$
and a cost $|C(i,k)-C'(j,l)|$ of transporting pairs of nodes in each structure.
As such, the optimal coupling tends to associate pairs of feature and
structure points with similar distances within each structure pair and with
Expand All @@ -200,14 +199,14 @@ between the structures;
We also define a continuous counterpart for FGW which comes with a
concentration inequality in {% cite vayer:hal-02174316 %}.

We have presented a Conditional Gradient algorithm for optimization on the
We present a Conditional Gradient algorithm for optimization on the
above-defined loss.
We have also exposed a Block Coordinate Descent algorithm to compute graph
We also provide a Block Coordinate Descent algorithm to compute graph
barycenters _w.r.t._ FGW.

### Results

We show that FGW allows to extract meaningful barycenters:
We have shown that FGW allows to extract meaningful barycenters:

<!-- #region {"tags": ["popout"]} -->
**Note.** The code provided here uses integration of FGW provided by the
Expand Down Expand Up @@ -337,9 +336,10 @@ draw_graph(barycenter)
plt.title('FGW Barycenter');
```

We also show that these barycenters can be used for graph clustering.
Finally, we exhibit classification results for FGW embedded in a Gaussian kernel
SVM which leads to state-of-the-art performance (even outperforming graph
These barycenters can be used for graph clustering.
Finally, we have exhibited classification results for FGW embedded in a
Gaussian kernel SVM which leads to state-of-the-art performance
(even outperforming graph
neural network approaches) on a wide range of graph classification problems.

## References
Expand Down
15 changes: 9 additions & 6 deletions content/parts/01/temporal_kernel.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,11 @@ between feature sets embedded in the Reproducing Kernel Hilbert Space (RKHS)
associated with $K$:

\begin{equation}
SQFD(\mathbf{x}, \mathbf{x}^\prime)^2 = K(\mathbf{x}, \mathbf{x}) +
K(\mathbf{x}^\prime, \mathbf{x}^\prime) - 2 K(\mathbf{x}, \mathbf{x}^\prime).
SQFD(\mathbf{x}, \mathbf{x}^\prime) =
\sqrt{K(\mathbf{x}, \mathbf{x})
+ K(\mathbf{x}^\prime, \mathbf{x}^\prime)
- 2 K(\mathbf{x}, \mathbf{x}^\prime)}
\, .
\end{equation}

## Local temporal kernel
Expand Down Expand Up @@ -166,7 +169,7 @@ ax_s_y.plot(- s_y1, numpy.arange(s_y1.shape[0])[::-1],
"b-", linewidth=3.);
```

$k_t$ is then a RBF kernel itself, and kernel approximation techniques such as
$k_t$ is then a RBF kernel itself, and
Random Fourier Features {% cite NIPS2007_3182 %} can be
used in order to approximate it with a linear kernel.

Expand All @@ -193,7 +196,7 @@ computation $b_\phi(\cdot)$ in the feature space (which can be done offline)
followed by (ii) a Euclidean distance computation in $O(D)$ time, where $D$ is
the dimension of the feature map $\phi(x)$.
Overall, we have a distance between timestamped feature sets whose
complexity can be tuned via the map dimensionality $D$.
precision / complexity tradeoff can be tuned via the map dimensionality $D$.

## Evaluation

Expand All @@ -205,9 +208,9 @@ computer vision community at the time of this work.
However, in our small data context, they proved useful for the task at hand.
<!-- #endregion -->

In order to evaluate the classifier presented above, we used the UCR Time
In order to evaluate the method presented above, we have used the UCR Time
Series Classification archive, which, at the time, was made of monodimensional
data only.
time series only.
We decided not to work on raw data but rather extract local features to
describe our time series.
We chose to rely on temporal SIFT features, that we had introduced in
Expand Down
3 changes: 2 additions & 1 deletion content/parts/01_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ Second, in [Sec. 1.2](01/dtw.html), time series are treated as sequences, which
means that only ordering is of importance (time delay between observations
is ignored) and variants of the Dynamic Time Warping algorithm are used.
Finally, in [Sec. 1.3](01/ot.html), undirected labeled graphs are seen as
discrete distributions over the feature-structure product space.
discrete distributions over the feature-structure product space and we rely on
optimal transport distances.
15 changes: 7 additions & 8 deletions content/parts/02/early.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,7 @@ The cost function is of the following form:
\label{eq:loss_early}
\end{equation}

where $\hat{y}$ is the class predicted by the model,
$\mathcal{L}_c(\cdot,\cdot,\cdot)$ is a
where $\mathcal{L}_c(\cdot,\cdot,\cdot)$ is a
classification loss and $t$ is the timestamp at which a
decision is triggered by the system.
In this setting, $\alpha$ drives the tradeoff between accuracy and earliness
Expand Down Expand Up @@ -85,8 +84,8 @@ We are co-supervising François together with Laetitia Chapel and Chloé Friguet

Relying on Equation \eqref{eq:dachraoui} to decide prediction time can be
tricky. We show in the following that in some cases (related to specific
configurations of training time confusion matrices), such an approach will lead
to undesirable behaviors.
configurations of the training time confusion matrices), such an approach will
lead to undesirable behaviors.

Using Bayes rule, Equation \eqref{eq:dachraoui} can be re-written

Expand Down Expand Up @@ -136,8 +135,8 @@ cost of a larger computational complexity.

We also showed that in order to limit inference time complexity, one could
learn a _decision triggering classifier_ that, based on the time series
$\mathbf{x}_{\rightarrow t}$
observed up to time $t$ predicts whether a decision should be triggered or not.
$\mathbf{x}_{\rightarrow t}$, predicts whether a decision should be triggered
or not.
In this setting, the target values $\gamma_t$ used to train this
_decision triggering classifier_
were computed from expected costs $f_\tau$ presented above:
Expand Down Expand Up @@ -180,9 +179,9 @@ We have hence proposed a representation learning framework that
covers these three limitations {% cite ruwurm:hal-02174314 %}.

In more details, we rely on a feature extraction module (that can either be
made of convolutional or recurrent submodules) to extract a fixed-sized
made of causal convolutions or recurrent submodules) to extract a fixed-sized
representation $h_t$ from an incoming time series $\mathbf{x}_{\rightarrow t}$.
An important point here is that this feature extractor can operate on time
An important point here is that this feature extractor should operate on time
series whatever their length (and hence a different feature extractor need not
to be learned for each time series length).
Then, this feature is provided as input to two different heads, as shown in the
Expand Down
22 changes: 11 additions & 11 deletions content/parts/02/shapelets_cnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,27 +16,27 @@ jupyter:
# Shapelet-based Representations and Convolutional Models

In this section, we will cover works that either relate to the Shapelet
representation for time series or to the family of (1d) Convolutional Neural
representation for time series or to the family of (1D) Convolutional Neural
Networks, since these two families of methods are very similar in spirit
{% cite lods:hal-01565207 %}.

## Data Augmentation for Time Series Classification

<!-- #region {"tags": ["popout"]} -->
**Note.** This work is part of Arthur Le Guennec's Master internship.
**Note.** This work was part of Arthur Le Guennec's Master internship.
We were co-supervising Arthur together with Simon Malinowski.
<!-- #endregion -->

We have shown in {% cite leguennec:halshs-01357973 %} that augmenting time
series classification datasets was an efficient way to improve generalization
for shallow Convolutional Neural Networks.
for Convolutional Neural Networks.
The data augmentation strategies that were investigated in this work are
local warping and window slicing and both lead to improvements.

## Learning to Mimic a Target Distance

<!-- #region {"tags": ["popout"]} -->
**Note.** This work is part of Arnaud Lods' Master internship.
**Note.** This work was part of Arnaud Lods' Master internship.
We were co-supervising Arnaud together with Simon Malinowski.
<!-- #endregion -->

Expand All @@ -50,9 +50,9 @@ used similarity measure for time series.
However, it suffers from its non differentiability and the fact that it does
not satisfy metric properties.
Our goal in {% cite lods:hal-01565207 %} was to introduce a Shapelet model that
extracts latent representations such that Euclidean distance in the latent
space is as close as possible to Dynamic Time Warping between original time
series.
extracts latent representations such that Euclidean distance between latent
representations is as close as possible to Dynamic Time Warping between original
time series.
The resulting model is an instance of a Siamese Network:

![](../../images/siamese_ldps.png)
Expand Down Expand Up @@ -433,8 +433,8 @@ Semi-Sparse Group Lasso) loss that allows to enforce sparsity on some individual
variables only:

\begin{equation}
\mathcal{L}^{\mathrm{SSGL}}(y, \hat{y}, \boldsymbol{\theta}) =
\mathcal{L}(y, \hat{y}, \boldsymbol{\theta})
\mathcal{L}^{\mathrm{SSGL}}(\mathbf{x}, y, \boldsymbol{\theta}) =
\mathcal{L}(\mathbf{x}, y, \boldsymbol{\theta})
+ \alpha \lambda
\left\| \mathbf{M}_\text{ind} \boldsymbol{\beta} \right\|_1
+ (1-\alpha) \lambda \sum_{k=1}^{K} \sqrt{p_k}
Expand All @@ -447,7 +447,7 @@ features in our random shapelet case), $\boldsymbol{\theta}$ is the set of
all model weights, including weights $\boldsymbol{\beta}$ that are directly
connected to the features (_ie._ these are weights from the first layer), that
are organized in groups $\boldsymbol{\beta}^{(k)}$ of size $p_k$ ($p_k=2$ in the
random shapelet context), each group corresponding to a different shapelet.
random shapelet context, each group corresponding to a different shapelet).

```python tags=["hide_input"]
%config InlineBackend.figure_format = 'svg'
Expand Down Expand Up @@ -610,7 +610,7 @@ terms of both Mean Squared Error (MSE) and estimation of zero coefficients.

When applied to the specific case of random shapelets, we have shown that this
lead to improved accuracy as soon as datasets are large enough for coefficients
to be properly estimated.
to be estimated properly.

## Learning Shapelets that Look Like Time Series Snippets

Expand Down
Loading

0 comments on commit 67f623e

Please sign in to comment.