Differentiability of DTW and the case of soft-DTW

You can re-use the content in this post at your will, as soon as you cite this page using the following BiBTeX entry:

@misc{tavenard.blog.softdtw,
  author="Romain Tavenard",
  title="Differentiability of DTW and the case of soft-DTW",
  year=2021,
  howpublished="\url{https://rtavenar.github.io/blog/softdtw.html}"
}

Code to reproduce figures is available here.

Introduction

We have seen in a previous blog post how one can use Dynamic Time Warping (DTW) as a shift-invariant similarity measure between time series. In this new post, we will study some aspects related to the differentiability of DTW. One of the reasons why we focus on differentiability is that this property is key in modern machine learning approaches.

[CuBl17] provide a nice example setting in which differentiability is desirable: Suppose we are given a forecasting task A forecasting task is a task in which we are given the beginning of a time series and the goal is to predict the future behavior of the series.in which the exact temporal localization of the temporal motifs to be predicted are less important than their overall shapes. In such a setting, it would make sense to use a shift-invariant similarity measure in order to assess whether a prediction made by the model is close enough from the ground-truth. Hence, a rather reasonable approach could be to tune the parameters of a neural network in order to minimize such a loss. Since optimization for this family of models heavily relies on gradient descent, having access to a differentiable shift-invariant similarity measure between time series is a key ingredient of this approach.

Differentiability of DTW

Let us start by having a look at the differentiability of Dynamic Time Warping. To do so, we will rely on the following theorem from [BoSh98]:

Let $Φ$ be a metric space, $X$ be a normed space, and $Π$ be a compact subset of $Φ$ . Let us define the optimal value function $v$ as:

$v (x) = inf_{π \in Π} f (x; π) .$

Suppose that:

for all $π \in Φ$ , the function $x \mapsto f (x; π)$ is differentiable;
$f (x; π)$ and $D_{x} f (x; π)$ the derivative of $x \mapsto f (x; π)$ are continuous on $X \times Φ$ .

If, for $x^{0} \in X$ , $π \mapsto f (x^{0}; π)$ has a unique minimizer $π^{0}$ over $Π$ then $v$ is differentiable at $x^{0}$ and $D v (x^{0}) = D_{x} f (x^{0}; π^{0})$ .

Let us come back to Dynamic Time Warping, and suppose we are given a reference time series

x_{ref}

. We would like to study the differentiability of

\begin{aligned} v (x) & = D T W_{2} (x, x_{ref}) \\ = min_{π \in A (x, x_{ref})} {⟨ A_{π}, D_{2} (x, x_{ref}) ⟩}^{\frac{1}{2}} \end{aligned}

then the previous Theorem tells us that

v

is differentiable everywhere except when:

This second condition is illustrated in the Figure below in which we vary the value of a single element in one of the time series (for visualization purposes) and study the evolution of

D T W_{2} (x, x_{ref})

as a function of this value:

Note the sudden change in slope at the position marked by a vertical dashed line, which corresponds to a case where (at least) two distinct optimal alignment paths coexist.

Soft-DTW and variants

Soft-DTW [CuBl17] has been introduced as a way to mitigate this limitation. The formal definition for soft-DTW is the following:

soft- D T W^{γ} (x, x^{'}) = min_{π \in A (x, x^{'})}^{γ} \sum_{(i, j) \in π} d (x_{i}, x_{j}^{'})^{2}

where

min^{γ}

is the soft-min operator parametrized by a smoothing factor

γ

A note on soft-min

Note that when gamma tends to

0^{+}

, the term corresponding to the lower

a_{i}

value will dominate other terms in the sum, and the soft-min then tends to the hard minimum, as illustrated below:

soft- D T W^{γ} (x, x^{'}) \overset{γ \to 0^{+}}{\to} D T W_{2} (x, x^{'})^{2} .

However, contrary to DTW, soft-DTW is differentiable everywhere for strictly positive

γ

even if, for small

γ

values, sudden changes can still occur in the loss landscape, as seen in the Figure below:

Note that the recurrence relation we had in Equation (2) of the post on DTW still holds with this

min^{γ}

formulation:

R_{i, j} = d (x_{i}, x_{j}^{'})^{2} + min^{γ} (R_{i - 1, j}, R_{i, j - 1}, R_{i - 1, j - 1}),

where

soft- D T W^{γ} (x, x^{'}) = R_{n - 1, m - 1}

. As a consequence, the

O (m n)

DTW algorithm is still valid here.

Soft-Alignment Path

where

Σ^{| A (x, x^{'}) |}

is the set of probability distributions over paths and

H (p)

is the entropy of a given probability distribution

p

A (very short) note on entropy

For a discrete probability distribution $p$ , entropy (also known as Shannon entropy) is defined as $H (p) = - \sum_{i} p_{i} \log (p_{i})$

and is maximized by the uniform distribution, as seen below:

p_{γ}^{⋆} (π) = \frac{e^{- ⟨ A_{π}, D_{2} (x, x^{'}) ⟩ / γ}}{k_{GA}^{γ} (x, x^{'})}

where

k_{GA}^{γ} (x, x^{'})

is the Global Alignment kernel [CVBM07] that acts as a normalization factor here.

This formulation leads to the following definition for the soft-alignment matrix

A_{γ}

A_{γ}

is a matrix that informs, for each pair

(i, j)

, how much it will be taken into account in the matching.

Note that when

γ

tends toward

+ \infty

p_{γ}^{⋆}

weights tend to the uniform distribution, hence the averaging operates over all alignments with equal weights, and the corresponding

A_{γ}

matrix tends to favor diagonal matches, regardless of the content of the series

x

and

x^{'}

However, the sum in Equation (1) is intractable due to the very large number of paths in

A (x, x^{'})

. Fortunately, once soft-DTW has been computed,

A_{γ}

can be obtained through a backward dynamic programming pass with complexity

O (m n)

(see more details in [CuBl17]).

Computing this

A_{γ}

matrix is especially useful since it is directly related to the gradients of the soft-DTW similarity measure:

\nabla_{x} soft- D T W^{γ} (x, x^{'}) = {(\frac{\partial D_{2} (x, x^{'})}{\partial x})}^{T} A_{γ} .

Properties

As discussed in [JaCuGr20], soft-DTW is not invariant to time shifts, as DTW is. Suppose

x

is a time series that is constant except for a motif that occurs at some point in the series, and let us denote by

x_{+ k}

a copy of

x

in which the motif is temporally shifted by

k

timestamps. Then the quantity

The reason behind this sensibility to time shifts is that soft-DTW provides a weighted average similarity score across all alignment paths (where stronger weights are assigned to better paths), instead of focusing on the single best alignment as done in DTW.

Another important property of soft-DTW is its “denoising effect,” in the sense that, for a given time series

x_{ref}

, the minimizer of

soft- D T W^{γ} (x, x_{ref})

is not

x_{ref}

itself but rather a smoothed version of it:

Finally, as seen in Figure 2,

min^{γ}

lower bounds the min operator. As a result, soft-DTW lower bounds DTW. Another way to see it is by taking a closer look at the entropy-regularized formulation for soft-DTW and observing that a distribution that would have a probability of 1 for the best path and 0 for all other paths is an element of

Σ^{| A (x, x^{'} |}

whose cost is equal to

D T W (x, x^{'})

. Since soft-DTW is a minimum over all probability distributions in

Σ^{| A (x, x^{'} |}

, it hence has to be lower or equal to

D T W (x, x^{'})

. Contrary to DTW, soft-DTW is not bounded below by zero, and we even have:

Related Similarity Measures

In [BlMeVe21], new similarity measures are defined, that rely on soft-DTW. In particular, soft-DTW divergence is introduced to counteract the non-positivity of soft-DTW:

This divergence has the advantage of being minimized for

x = x^{'}

and being exactly 0 in that case.

Also, in [HaDeJe21], a variant of

min^{γ}

, called

{smoothMin}^{γ}

is used in the recurrence formula. Contrary to

min^{γ}

{smoothMin}^{γ}

upper bounds the min operator:

As a consequence, the resulting similarity measure upper bounds DTW. Note also that [HaDeJe21] suggest that the DTW variants presented in these posts are not fully suited for representation learning and additional contrastive losses should be used to help learn useful representations.

Conclusion

We have seen in this post that DTW is not differentiable everywhere, and that there exists alternatives that basically change the min operator into a differentiable alternative in order to get a differentiable similarity measure that can later be used as a loss in gradient-based optimization.

References

[BlMeVe21]

Mathieu Blondel, Arthur Mensch & Jean-Philippe Vert. Differentiable divergences between time series. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2021. Link

[BoSh98]

J. Frédéric Bonnans & Alexander Shapiro. Optimization problems with perturbations: A guided tour. SIAM Review, 1998. Link

[CuBl17]

Marco Cuturi & Mathieu Blondel. Soft-DTW: A differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning, 2017. Link

[CVBM07]

Marco Cuturi, Jean-Philippe Vert, Oystein Birkenes & Tomoko Matsui. A kernel for time series based on global alignments. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007. Link

[HaDeJe21]

Isma Hadji, Konstantinos G. Derpanis & Allan D. Jepson. Representation learning via global temporal alignment and cycle-consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. Link

[JaCuGr20]

Hicham Janati, Marco Cuturi & Alexandre Gramfort. Spatio-temporal alignments: Optimal transport through space and time. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2020. Link

[MeBl18]

Arthur Mensch & Mathieu Blondel. Differentiable dynamic programming for structured prediction and attention. In Proceedings of the International Conference on Machine Learning, 2018. Link

Differentiability of DTW and the case of soft-DTW

Romain Tavenard

Contents

Introduction

Differentiability of DTW

Soft-DTW and variants

A note on soft-min

Soft-Alignment Path

Properties

Related Similarity Measures

Conclusion

References