beta version ready

rtavenar · May 17, 2020 · 67f623e · 67f623e
1 parent 852a737
commit 67f623e
Show file tree

Hide file tree

Showing 14 changed files with 92 additions and 88 deletions.
diff --git a/_config.yml b/_config.yml
@@ -64,7 +64,7 @@ jupyterhub_url                   : ""  # The URL for your JupyterHub. If no URL,
 jupyterhub_interact_text         : "Interact"  # The text that interact buttons will contain.
 
 # Binder link settings
-use_binder_button                : true  # If 'true', add a binder button for interactive links
+use_binder_button                : false  # If 'true', add a binder button for interactive links
 binderhub_url                    : "https://mybinder.org"  # The URL for your BinderHub. If no URL, use ""
 binder_repo_base                 : "https://github.com/"  # The site on which the textbook repository is hosted
 binder_repo_org                  : "rtavenar"  # The username or organization that owns this repository
@@ -73,12 +73,12 @@ binder_repo_branch               : "gh-pages"  # The branch on which your textbo
 binderhub_interact_text          : "Interact"  # The text that interact buttons will contain.
 
 # Thebelab settings
-use_thebelab_button              : true  # If 'true', display a button to allow in-page running code cells with Thebelab
+use_thebelab_button              : false  # If 'true', display a button to allow in-page running code cells with Thebelab
 thebelab_button_text             : "Thebelab"  # The text to display inside the Thebelab initialization button
 codemirror_theme                 : "abcdef"  # Theme for codemirror cells, for options see https://codemirror.net/doc/manual.html#config
 
 # nbinteract settings
-use_show_widgets_button              : true  # If 'true', display a button to allow in-page running code cells with nbinteract
+use_show_widgets_button              : false  # If 'true', display a button to allow in-page running code cells with nbinteract
 
 # Download settings
 use_download_button              : true  # If 'true', display a button to download a zip file for the notebook

diff --git a/content/parts/01/dtw.md b/content/parts/01/dtw.md
@@ -15,7 +15,7 @@ jupyter:
 
 # Dynamic Time Warping
 
-This section covers my works related to Dynamic Time Warping for time series.
+This section covers works related to Dynamic Time Warping for time series.
 
 <!-- #region {"tags": ["popout"]} -->
 **Note.** In ``tslearn``, such time series would be represented as arrays of
@@ -48,8 +48,8 @@ optimization problem:
 
 \begin{equation}
 DTW(\mathbf{x}, \mathbf{x}^\prime) =
-    \sqrt{ \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)}
-        \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^2 }
+    \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)}
+        \sqrt{ \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^2 }
 \label{eq:dtw}
 \end{equation}
 

diff --git a/content/parts/01/dtw/dtw_da.md b/content/parts/01/dtw/dtw_da.md
@@ -34,7 +34,7 @@ Optimal Transport for Domain Adaptation {% cite courty:hal-02112785 %}.
 One significant difference however is that we rely on a reference modality for
 alignment, which is guided by our application context.
 
-## Use case
+## Motivating use case
 
 Phosphorus (P) transfer during storm events represents a significant part of
 annual P loads in streams and contributes to eutrophication in downstream water
@@ -58,7 +58,7 @@ limit and test its ability to compare seasonal variability of P storm dynamics
 in two headwater watersheds. Both watersheds are ca. 5 km², have similar
 climate and geology, but differ in land use and P pressure intensity.
 
-## Method
+## Alignment-based resampling method
 
 In the above-described setting, we have access to one modality (discharge,
 commonly denoted $Q$) that is representative of the evolution of the flood.
@@ -69,8 +69,8 @@ Indeed, time series may have
 1. different starting times due to the discharge threshold at which the
 autosamplers were triggered,
 2. different lengths  and
-3. differences in phase that yield different positions of the discharge peak
-and of concentration data points relative to the hydrograph.
+3. differences in phase that yield different temporal localization of the
+discharge peak.
 
 To align time series, we use the path associated to DTW.
 This matching path can be viewed as the optimal way to perform point-wise
@@ -83,10 +83,6 @@ The reference discharge time series used in this study is chosen
 as a storm event with full coverage of flow rise and flow recession phases.
 Alternatively, one could choose a synthetic idealized storm hydrograph.
 
-As stated above, the continuity condition imposed on admissible paths results
-in each element of reference time series $\mathbf{x}^\text{ref}_\text{Q}$ being
-matched with at least one element in each discharge time series from the
-dataset.
 We then use barycentric mapping based on obtained matches to realign other
 modalities to the timestamps of the reference time series, as shown in the
 following Figures:

diff --git a/content/parts/01/dtw/dtw_gi.md b/content/parts/01/dtw/dtw_gi.md
@@ -16,8 +16,8 @@ jupyter:
 # DTW with Global Invariances
 
 <!-- #region {"tags": ["popout"]} -->
-**Note.** This work was part of Titouan Vayer's PhD thesis.
-We were co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
+**Note.** This work is part of Titouan Vayer's PhD thesis.
+We are co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
 <!-- #endregion -->
 
 In this work we address the problem of comparing time series while taking
@@ -43,10 +43,8 @@ lie. More formally, we define Dynamic Time Warping with Global Invariances
 
 \begin{equation}
     \text{DTW-GI}(\mathbf{x}, \mathbf{x^\prime}) =
-        \sqrt{
-            \min_{f \in \mathcal{F}, \pi \in \mathcal{A}(\mathbf{x}, \mathbf{x^\prime})}
-                \sum_{(i, j) \in \pi} d(x_i, f(x^\prime_j))^2
-        } ,
+        \min_{f \in \mathcal{F}, \pi \in \mathcal{A}(\mathbf{x}, \mathbf{x^\prime})}
+            \sqrt{ \sum_{(i, j) \in \pi} d(x_i, f(x^\prime_j))^2 } \, ,
     \label{eq:dtwgi}
 \end{equation}
 
@@ -699,13 +697,15 @@ for idx_dataset, dataset_fun in enumerate(list_dataset_generators):
 
 We also introduce soft counterparts following the definition of softDTW from
 {% cite cuturi2017soft %}.
+In this case, optimization consists in gradient descent and a wider variety of
+feature space transformation families can be considered.
 
-We validate the utility of this new metric on real world
+We validate the utility of these similarity measures on real world
 datasets on the tasks of human motion prediction (where motion is captured under
 different points of view) and cover song identification (where song similarity
 is defined up to a key transposition).
 In both these settings, we observe that joint optimization on feature space
-transformation and temporal alignment improves over standard techniques that
+transformation and temporal alignment improves over standard approaches that
 consider these as two independent steps.
 
 ## References

diff --git a/content/parts/01/ot.md b/content/parts/01/ot.md
@@ -22,8 +22,8 @@ distance that interpolates between Wasserstein distance between node feature
 distributions and Gromov-Wasserstein distance between structures.
 
 <!-- #region {"tags": ["popout"]} -->
-**Note.** This work was part of Titouan Vayer's PhD thesis.
-We were co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
+**Note.** This work is part of Titouan Vayer's PhD thesis.
+We are co-supervising Titouan together with Laetitia Chapel and Nicolas Courty.
 <!-- #endregion -->
 
 Here, we first introduce both Wasserstein and Gromov-Wasserstein distances and
@@ -49,9 +49,8 @@ beginning to end).
 <!-- #endregion -->
 
 \begin{equation}
-    W_p(\mu, \mu') = \left(
-        \min_{\pi \in \Pi(\mu, \mu^\prime)}
-            \sum_{i,j} d(x_i, x^\prime_j)^p \pi_{i,j} \right)^{\frac{1}{p}}
+    W_p(\mu, \mu') = \min_{\pi \in \Pi(\mu, \mu^\prime)}
+        \left(\sum_{i,j} d(x_i, x^\prime_j)^p \pi_{i,j} \right)^{\frac{1}{p}}
     \label{eq:wass}
 \end{equation}
 
@@ -72,8 +71,8 @@ distances, as illustrated below:
 The corresponding distance is the Gromov-Wasserstein distance, defined as:
 
 \begin{equation}
-    GW_p(\mu, \mu') = \left(
-        \min_{\pi \in \Pi(\mu, \mu^\prime)}
+    GW_p(\mu, \mu') = \min_{\pi \in \Pi(\mu, \mu^\prime)}
+        \left(
             \sum_{i,j,k,l}
             \left| d_\mu(x_i, x_k) - d_{\mu'}(x^\prime_j, x^\prime_l) \right|^p
             \pi_{i,j} \pi_{k,l}
@@ -112,13 +111,13 @@ More formally, we consider undirected labeled graphs as tuples of the form $\mat
 $(\mathcal{V},\mathcal{E})$ are the set of vertices and edges of the graph.
 $\ell_f: \mathcal{V} \rightarrow \Omega_f$ is a labelling function which
 associates each vertex $v_{i} \in \mathcal{V}$ with a feature
-$a_{i}\stackrel{\text{def}}{=}\ell_f(v_{i})$ in some feature metric space
+$a_{i} = \ell_f(v_{i})$ in some feature metric space
 $(\Omega_f,d)$.
 We will denote by _feature information_ the set of all the features
 $\{a_{i}\}_{i}$ of the graph.
 Similarly, $\ell_s: \mathcal{V} \rightarrow \Omega_s$ maps a vertex $v_i$ from
 the graph to its structure representation
-$x_{i} \stackrel{\text{def}}{=} \ell_s(v_{i})$ in some structure space
+$x_{i} = \ell_s(v_{i})$ in some structure space
 $(\Omega_s,C)$ specific to each graph.
 $C : \Omega_s \times \Omega_s \rightarrow \mathbb{R_{+}}$ is a symmetric
 application which aims at measuring the similarity between the nodes in the
@@ -178,7 +177,7 @@ E_{q}(\mathcal{G}, \mathcal{G}', \pi) =
 
 The FGW distance looks for the coupling $\pi$ between vertices of the
 graphs that minimizes the cost $E_{q}$ which is a linear combination of a cost
-$d(a_{i},a^\prime_j)$ of transporting one feature $a_{i}$ to a feature $a^\prime_j$
+$d(a_{i},a^\prime_j)$ of transporting feature $a_{i}$ to $a^\prime_j$
 and a cost $|C(i,k)-C'(j,l)|$ of transporting pairs of nodes in each structure.
 As such, the optimal coupling tends to associate pairs of feature and
 structure points with similar distances within each structure pair and with
@@ -200,14 +199,14 @@ between the structures;
 We also define a continuous counterpart for FGW which comes with a
 concentration inequality in {% cite vayer:hal-02174316 %}.
 
-We have presented a Conditional Gradient algorithm for optimization on the
+We present a Conditional Gradient algorithm for optimization on the
 above-defined loss.
-We have also exposed a Block Coordinate Descent algorithm to compute graph
+We also provide a Block Coordinate Descent algorithm to compute graph
 barycenters _w.r.t._ FGW.
 
 ### Results
 
-We show that FGW allows to extract meaningful barycenters:
+We have shown that FGW allows to extract meaningful barycenters:
 
 <!-- #region {"tags": ["popout"]} -->
 **Note.** The code provided here uses integration of FGW provided by the
@@ -337,9 +336,10 @@ draw_graph(barycenter)
 plt.title('FGW Barycenter');
 ```
 
-We also show that these barycenters can be used for graph clustering.
-Finally, we exhibit classification results for FGW embedded in a Gaussian kernel
-SVM which leads to state-of-the-art performance (even outperforming graph
+These barycenters can be used for graph clustering.
+Finally, we have exhibited classification results for FGW embedded in a
+Gaussian kernel SVM which leads to state-of-the-art performance
+(even outperforming graph
 neural network approaches) on a wide range of graph classification problems.
 
 ## References

diff --git a/content/parts/01/temporal_kernel.md b/content/parts/01/temporal_kernel.md
@@ -42,8 +42,11 @@ between feature sets embedded in the Reproducing Kernel Hilbert Space (RKHS)
 associated with $K$:
 
 \begin{equation}
-    SQFD(\mathbf{x}, \mathbf{x}^\prime)^2 = K(\mathbf{x}, \mathbf{x}) +
-        K(\mathbf{x}^\prime, \mathbf{x}^\prime) - 2 K(\mathbf{x}, \mathbf{x}^\prime).
+    SQFD(\mathbf{x}, \mathbf{x}^\prime) =
+        \sqrt{K(\mathbf{x}, \mathbf{x})
+              + K(\mathbf{x}^\prime, \mathbf{x}^\prime)
+              - 2 K(\mathbf{x}, \mathbf{x}^\prime)}
+        \, .
 \end{equation}
 
 ## Local temporal kernel
@@ -166,7 +169,7 @@ ax_s_y.plot(- s_y1, numpy.arange(s_y1.shape[0])[::-1],
             "b-", linewidth=3.);
 ```
 
-$k_t$ is then a RBF kernel itself, and kernel approximation techniques such as
+$k_t$ is then a RBF kernel itself, and
 Random Fourier Features {% cite NIPS2007_3182 %} can be
 used in order to approximate it with a linear kernel.
 
@@ -193,7 +196,7 @@ computation $b_\phi(\cdot)$ in the feature space (which can be done offline)
 followed by (ii) a Euclidean distance computation in $O(D)$ time, where $D$ is
 the dimension of the feature map $\phi(x)$.
 Overall, we have a distance between timestamped feature sets whose
-complexity can be tuned via the map dimensionality $D$.
+precision / complexity tradeoff can be tuned via the map dimensionality $D$.
 
 ## Evaluation
 
@@ -205,9 +208,9 @@ computer vision community at the time of this work.
 However, in our small data context, they proved useful for the task at hand.
 <!-- #endregion -->
 
-In order to evaluate the classifier presented above, we used the UCR Time
+In order to evaluate the method presented above, we have used the UCR Time
 Series Classification archive, which, at the time, was made of monodimensional
-data only.
+time series only.
 We decided not to work on raw data but rather extract local features to
 describe our time series.
 We chose to rely on temporal SIFT features, that we had introduced in

diff --git a/content/parts/01_metrics.md b/content/parts/01_metrics.md
@@ -16,4 +16,5 @@ Second, in [Sec. 1.2](01/dtw.html), time series are treated as sequences, which
 means that only ordering is of importance (time delay between observations
 is ignored) and variants of the Dynamic Time Warping algorithm are used.
 Finally, in [Sec. 1.3](01/ot.html), undirected labeled graphs are seen as
-discrete distributions over the feature-structure product space.
+discrete distributions over the feature-structure product space and we rely on
+optimal transport distances.
diff --git a/content/parts/02/early.md b/content/parts/02/early.md
@@ -39,8 +39,7 @@ The cost function is of the following form:
 \label{eq:loss_early}
 \end{equation}
 
-where $\hat{y}$ is the class predicted by the model,
-$\mathcal{L}_c(\cdot,\cdot,\cdot)$ is a
+where $\mathcal{L}_c(\cdot,\cdot,\cdot)$ is a
 classification loss and $t$ is the timestamp at which a
 decision is triggered by the system.
 In this setting, $\alpha$ drives the tradeoff between accuracy and earliness
@@ -85,8 +84,8 @@ We are co-supervising François together with Laetitia Chapel and Chloé Friguet
 
 Relying on Equation \eqref{eq:dachraoui} to decide prediction time can be
 tricky. We show in the following that in some cases (related to specific
-configurations of training time confusion matrices), such an approach will lead
-to undesirable behaviors.
+configurations of the training time confusion matrices), such an approach will
+lead to undesirable behaviors.
 
 Using Bayes rule, Equation \eqref{eq:dachraoui} can be re-written
 
@@ -136,8 +135,8 @@ cost of a larger computational complexity.
 
 We also showed that in order to limit inference time complexity, one could
 learn a _decision triggering classifier_ that, based on the time series
-$\mathbf{x}_{\rightarrow t}$
-observed up to time $t$ predicts whether a decision should be triggered or not.
+$\mathbf{x}_{\rightarrow t}$, predicts whether a decision should be triggered
+or not.
 In this setting, the target values $\gamma_t$ used to train this
 _decision triggering classifier_
 were computed from expected costs $f_\tau$ presented above:
@@ -180,9 +179,9 @@ We have hence proposed a representation learning framework that
 covers these three limitations {% cite ruwurm:hal-02174314 %}.
 
 In more details, we rely on a feature extraction module (that can either be
-made of convolutional or recurrent submodules) to extract a fixed-sized
+made of causal convolutions or recurrent submodules) to extract a fixed-sized
 representation $h_t$ from an incoming time series $\mathbf{x}_{\rightarrow t}$.
-An important point here is that this feature extractor can operate on time
+An important point here is that this feature extractor should operate on time
 series whatever their length (and hence a different feature extractor need not
 to be learned for each time series length).
 Then, this feature is provided as input to two different heads, as shown in the

diff --git a/content/parts/02/shapelets_cnn.md b/content/parts/02/shapelets_cnn.md
@@ -16,27 +16,27 @@ jupyter:
 # Shapelet-based Representations and Convolutional Models
 
 In this section, we will cover works that either relate to the Shapelet
-representation for time series or to the family of (1d) Convolutional Neural
+representation for time series or to the family of (1D) Convolutional Neural
 Networks, since these two families of methods are very similar in spirit
 {% cite lods:hal-01565207 %}.
 
 ## Data Augmentation for Time Series Classification
 
 <!-- #region {"tags": ["popout"]} -->
-**Note.** This work is part of Arthur Le Guennec's Master internship.
+**Note.** This work was part of Arthur Le Guennec's Master internship.
 We were co-supervising Arthur together with Simon Malinowski.
 <!-- #endregion -->
 
 We have shown in {% cite leguennec:halshs-01357973 %} that augmenting time
 series classification datasets was an efficient way to improve generalization
-for shallow Convolutional Neural Networks.
+for Convolutional Neural Networks.
 The data augmentation strategies that were investigated in this work are
 local warping and window slicing and both lead to improvements.
 
 ## Learning to Mimic a Target Distance
 
 <!-- #region {"tags": ["popout"]} -->
-**Note.** This work is part of Arnaud Lods' Master internship.
+**Note.** This work was part of Arnaud Lods' Master internship.
 We were co-supervising Arnaud together with Simon Malinowski.
 <!-- #endregion -->
 
@@ -50,9 +50,9 @@ used similarity measure for time series.
 However, it suffers from its non differentiability and the fact that it does
 not satisfy metric properties.
 Our goal in {% cite lods:hal-01565207 %} was to introduce a Shapelet model that
-extracts latent representations such that Euclidean distance in the latent
-space is as close as possible to Dynamic Time Warping between original time
-series.
+extracts latent representations such that Euclidean distance between latent
+representations is as close as possible to Dynamic Time Warping between original
+time series.
 The resulting model is an instance of a Siamese Network:
 
 ![](../../images/siamese_ldps.png)
@@ -433,8 +433,8 @@ Semi-Sparse Group Lasso) loss that allows to enforce sparsity on some individual
 variables only:
 
 \begin{equation}
-    \mathcal{L}^{\mathrm{SSGL}}(y, \hat{y}, \boldsymbol{\theta}) =
-        \mathcal{L}(y, \hat{y}, \boldsymbol{\theta})
+    \mathcal{L}^{\mathrm{SSGL}}(\mathbf{x}, y, \boldsymbol{\theta}) =
+        \mathcal{L}(\mathbf{x}, y, \boldsymbol{\theta})
         + \alpha \lambda
             \left\| \mathbf{M}_\text{ind} \boldsymbol{\beta} \right\|_1
         + (1-\alpha) \lambda \sum_{k=1}^{K} \sqrt{p_k}
@@ -447,7 +447,7 @@ features in our random shapelet case), $\boldsymbol{\theta}$ is the set of
 all model weights, including weights $\boldsymbol{\beta}$ that are directly
 connected to the features (_ie._ these are weights from the first layer), that
 are organized in groups $\boldsymbol{\beta}^{(k)}$ of size $p_k$ ($p_k=2$ in the
-random shapelet context), each group corresponding to a different shapelet.
+random shapelet context, each group corresponding to a different shapelet).
 
 ```python tags=["hide_input"]
 %config InlineBackend.figure_format = 'svg'
@@ -610,7 +610,7 @@ terms of both Mean Squared Error (MSE) and estimation of zero coefficients.
 
 When applied to the specific case of random shapelets, we have shown that this
 lead to improved accuracy as soon as datasets are large enough for coefficients
-to be properly estimated.
+to be estimated properly.
 
 ## Learning Shapelets that Look Like Time Series Snippets