You can re-use the content in this post at your will, as soon as you cite this page
using the following BiBTeX entry:
@misc{tavenard.blog.softdtw,
author="Romain Tavenard",
title="Differentiability of DTW and the case of soft-DTW",
year=2021,
howpublished="\url{https://rtavenar.github.io/blog/softdtw.html}"
}
We have seen in a previous blog post how one can use Dynamic Time Warping (DTW) as a shift-invariant similarity measure between time series. In this new post, we will study some aspects related to the differentiability of DTW. One of the reasons why we focus on differentiability is that this property is key in modern machine learning approaches.
[CuBl17] provide a nice example setting in which differentiability is desirable: Suppose we are given a forecasting task A forecasting task is a task in which we are given the beginning of a time series and the goal is to predict the future behavior of the series.in which the exact temporal localization of the temporal motifs to be predicted are less important than their overall shapes. In such a setting, it would make sense to use a shift-invariant similarity measure in order to assess whether a prediction made by the model is close enough from the ground-truth. Hence, a rather reasonable approach could be to tune the parameters of a neural network in order to minimize such a loss. Since optimization for this family of models heavily relies on gradient descent, having access to a differentiable shift-invariant similarity measure between time series is a key ingredient of this approach.
Differentiability of DTW
Let us start by having a look at the differentiability of Dynamic Time Warping. To do so, we will rely on the following theorem from [BoSh98]:
Let be a metric space, be a normed space, and be a compact subset of . Let us define the optimal value function as:
Suppose that:
for all , the function is differentiable;
and the derivative of are continuous on .
If, for , has a unique minimizer over then is differentiable at and .
Let us come back to Dynamic Time Warping, and suppose we are given a reference time series . We would like to study the differentiability of
then the previous Theorem tells us that is differentiable everywhere except when:
since, in this case, the non-differentiability of the square root function breaks condition 1 of the Theorem above;
there exist several optimal paths for the DTW problem.
This second condition is illustrated in the Figure below in which we vary the value of a single element in one of the time series (for visualization purposes) and study the evolution of as a function of this value:
Note the sudden change in slope at the position marked by a vertical dashed line, which corresponds to a case where (at least) two distinct optimal alignment paths coexist.
Soft-DTW and variants
Soft-DTW [CuBl17] has been introduced as a way to mitigate this limitation. The formal definition for soft-DTW is the following:
where is the soft-min operator parametrized by a smoothing factor .
A note on soft-min
The soft-min operator is defined as:
Note that when gamma tends to , the term corresponding to the lower value will dominate other terms in the sum, and the soft-min then tends to the hard minimum, as illustrated below:
As a consequence, we have:
However, contrary to DTW, soft-DTW is differentiable everywhere for strictly positive even if, for small values, sudden changes can still occur in the loss landscape, as seen in the Figure below:
Note that the recurrence relation we had in Equation (2) of the post on DTW still holds with this formulation:
where . As a consequence, the DTW algorithm is still valid here.
Soft-Alignment Path
It is shown in [MeBl18] that soft-DTW can be re-written:
where is the set of probability distributions over paths and is the entropy of a given probability distribution .
A (very short) note on entropy
For a discrete probability distribution , entropy (also known as Shannon entropy) is defined as
and is maximized by the uniform distribution, as seen below:
where is the Global Alignment kernel [CVBM07] that acts as a normalization factor here.
This formulation leads to the following definition for the soft-alignment matrix
(1)
is a matrix that informs, for each pair , how much it will be taken into account in the matching.
Note that when tends toward , weights tend to the uniform distribution, hence the averaging operates over all alignments with equal weights, and the corresponding matrix tends to favor diagonal matches, regardless of the content of the series and .
However, the sum in Equation (1) is intractable due to the very large number of paths in . Fortunately, once soft-DTW has been computed, can be obtained through a backward dynamic programming pass with complexity (see more details in [CuBl17]).
Computing this matrix is especially useful since it is directly related to the gradients of the soft-DTW similarity measure:
Properties
As discussed in [JaCuGr20], soft-DTW is not invariant to time shifts, as DTW is. Suppose is a time series that is constant except for a motif that occurs at some point in the series, and let us denote by a copy of in which the motif is temporally shifted by timestamps. Then the quantity
grows linearly with :
The reason behind this sensibility to time shifts is that soft-DTW provides a weighted average similarity score across all alignment paths (where stronger weights are assigned to better paths), instead of focusing on the single best alignment as done in DTW.
Another important property of soft-DTW is its “denoising effect,” in the sense that, for a given time series , the minimizer of is not itself but rather a smoothed version of it:
Finally, as seen in Figure 2, lower bounds the min operator. As a result, soft-DTW lower bounds DTW. Another way to see it is by taking a closer look at the entropy-regularized formulation for soft-DTW and observing that a distribution that would have a probability of 1 for the best path and 0 for all other paths is an element of whose cost is equal to . Since soft-DTW is a minimum over all probability distributions in , it hence has to be lower or equal to . Contrary to DTW, soft-DTW is not bounded below by zero, and we even have:
Related Similarity Measures
In [BlMeVe21], new similarity measures are defined, that rely on soft-DTW. In particular, soft-DTW divergence is introduced to counteract the non-positivity of soft-DTW:
This divergence has the advantage of being minimized for and being exactly 0 in that case.
Also, in [HaDeJe21], a variant of , called is used in the recurrence formula. Contrary to , upper bounds the min operator:
As a consequence, the resulting similarity measure upper bounds DTW. Note also that [HaDeJe21] suggest that the DTW variants presented in these posts are not fully suited for representation learning and additional contrastive losses should be used to help learn useful representations.
Conclusion
We have seen in this post that DTW is not differentiable everywhere, and that there exists alternatives that basically change the min operator into a differentiable alternative in order to get a differentiable similarity measure that can later be used as a loss in gradient-based optimization.
References
[BlMeVe21]
Mathieu Blondel, Arthur Mensch & Jean-Philippe Vert. Differentiable divergences between time series. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2021. Link
[BoSh98]
J. Frédéric Bonnans & Alexander Shapiro. Optimization problems with perturbations: A guided tour. SIAM Review, 1998. Link
[CuBl17]
Marco Cuturi & Mathieu Blondel. Soft-DTW: A differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning, 2017. Link
[CVBM07]
Marco Cuturi, Jean-Philippe Vert, Oystein Birkenes & Tomoko Matsui. A kernel for time series based on global alignments. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007. Link
[HaDeJe21]
Isma Hadji, Konstantinos G. Derpanis & Allan D. Jepson. Representation learning via global temporal alignment and cycle-consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. Link
[JaCuGr20]
Hicham Janati, Marco Cuturi & Alexandre Gramfort. Spatio-temporal alignments: Optimal transport through space and time. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2020. Link
[MeBl18]
Arthur Mensch & Mathieu Blondel. Differentiable dynamic programming for structured prediction and attention. In Proceedings of the International Conference on Machine Learning, 2018. Link