An introduction to Dynamic Time Warping

You can re-use the content in this post at your will, as soon as you cite this page using the following BiBTeX entry:

@misc{tavenard.blog.dtw,
  author="Romain Tavenard",
  title="An introduction to Dynamic Time Warping",
  year=2021,
  howpublished="\url{https://rtavenar.github.io/blog/dtw.html}"
}

Code to reproduce figures is available here.

Alignment-based metrics

This post is the first in a series about assessing similarity between time series. More specifically, we will be interested in alignment-based metrics, Here we use the term “metrics” in a pretty unformal manner, that is an equivalent of “similarity measure.” that rely on a temporal alignment of the series in order to assess their similarity.

Before entering into more details about these metrics, let us define our base objects: time series. In the following, a time series is a sequence of features: \(x = (x_0, \dots, x_{n-1})\). All features from a time series lie in the same space \(\mathbb{R}^p\). Below is an example univariate A time series is said univariate if all its feature vectors are monodimensional (\(p=1\)). time series:

Let us now illustrate the typical behavior of alignment-based metrics with an example.

Here, we are computing similarity between two time series using either Euclidean distance (left) or Dynamic Time Warping (DTW, right), which is an instance of alignment-based metric that we will present in more details later in this post. In both cases, the returned similarity is the sum of distances between matched features. Here, matches are represented by gray lines and the distance associated to a match between \(i\)-th feature in time series \(x\) and \(j\)-th feature in time series \(x^\prime\) is \(d(x_i, x^\prime_j)\). Note how DTW matches distinctive patterns of the time series, which is likely to result in a more sound similarity assessment than when using Euclidean distance that matches timestamps regardless of the feature values.

Now let us see how this property translates in a machine learning setting. Suppose we are given the following unlabelled time series dataset:

If you look carefully at this dataset, you might notice that there are three families of series in it. Let us see if a classical clustering algorithm can detect these three typical shapes. To do so, we will use the \(k\)-means algorithm, which aims at forming clusters as compact as possible with respect to a given similarity metric. By default the metric is Euclidean distance, which gives, in our example:

As one can see, this is not fully satisfactory. First, Cluster 2 mixes two distinct time series shapes. Second, the barycenters for each cluster are not especially representative of the time series gathered in the clusters. Even Cluster 1, which seems to be the “purest” one, suffers from this last pitfall, since the local oscillations that are observed towards the end of the series have a lower magnitude in the reconstructed barycenter than in the series themselves.

Let us now switch to Dynamic Time Warping as the core metric for our \(k\)-means algorithm. The resulting clusters are this time closer to what one could expect:

This is because time series in each group are very similar up to a time shift, which is a known invariant of Dynamic Time Warping, as we will see.

Dynamic Time Warping

We will now review Dynamic Time Warping (DTW) in more details. DTW is a similarity measure between time series that has been introduced independently in the literature by [Vint68] and [SaCh78], in both cases for speech applications. Note that, in this series of posts, we will stick to the formalism from [SaCh78], which is more standard in the literature.

Let us consider two time series \(x\) and \({x}^\prime\) of respective lengths \(n\) and \(m\). Here, all elements \(x_i\) and \(x^\prime_j\) are assumed to lie in the same \(p\)-dimensional space and the exact timestamps at which observations occur are disregarded: only their ordering matters.

Dynamic Time Warping seeks for the temporal alignment A temporal alignment is a matching between time indexes of the two time series. that minimizes Euclidean distance between aligned series, as illustrated in the Figure below:

Problem formulation

\[DTW_q({x}, {x}^\prime) = \min_{\pi \in \mathcal{A}({x}, {x}^\prime)} \left( \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^q \right)^{\frac{1}{q}} \](1)

Here, an alignment path \(\pi\) of length \(K\) is a sequence of \(K\) index pairs \(\left((i_0, j_0), \dots , (i_{K-1}, j_{K-1})\right)\) and \(\mathcal{A}({x}, {x}^\prime)\) is the set of all admissible paths. In order to be considered admissible, a path should satisfy the following conditions:

Dot product notation

Another way to represent a DTW path is to use a binary matrix whose non-zero entries are those corresponding to a matching between time series elements. This representation is related to the index sequence representation used above through:

\[\begin{equation} (A_\pi)_{i,j} = \left\{ \begin{array}{rl} 1 & \text{ if } (i, j) \in \pi \\ 0 & \text{ otherwise} \end{array} \right. \,\,\,\,\,\,\, . \end{equation}\]

This is illustrated in the Figure below where nonzero entries in the binary matrix are represented as dots and the equivalent sequence of matchings is produced on the right:

Using matrix notation, Dynamic Time Warping can be written as the minimization of a dot product between matrices:

\[\begin{equation*} DTW_q({x}, {x}^\prime) = \min_{\pi \in \mathcal{A}({x}, {x}^\prime)} \left\langle A_\pi, D_q({x}, {x}^\prime) \right\rangle^{\frac{1}{q}} \end{equation*}\]

where \(D_q({x}, {x}^\prime)\) stores distances \(d(x_i, x^\prime_j)\) at the power \(q\).

Algorithmic Solution

Though the optimization problem in Equation (1) is minimization over a finite set, the number of admissible paths (coined Delannoy number) becomes very large even for moderate time series lengths. Assuming \(m\) and \(n\) are the same order, there exists \(O\left(\frac{(3 + 2\sqrt{2})^n}{\sqrt{n}}\right)\) different paths in \(\mathcal{A}(x, x^\prime)\), which makes it intractable to actually list all paths sequentially in order to compute the minimum.

Fortunately, an exact solution to this optimization problem can be found using dynamic programming. Dynamic programming relies on recurrence, which consists in linking the solution of a given problem to solutions of (easier) sub-problems. Once this link is known, the dynamic programming approach solves the original problem by recursively solving required sub-problems and storing their solutions for later use (so as not to re-compute subproblems several times).

where the notation \({x}_{\rightarrow i}\) denotes time series \({x}\) observed up to timestamp \(i\) (included). Then, we can observe that:

\[ \begin{aligned} R_{i,j} &= \min_{\pi \in \mathcal{A}({x}_{\rightarrow i}, {x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi} d(x_k, x^\prime_l)^q \\ &\stackrel{*}{=} d(x_i, x^\prime_j)^q + \min_{\pi \in \mathcal{A}({x}_{\rightarrow i}, {x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi[:-1]} d(x_k, x^\prime_l)^q \\ &\stackrel{**}{=} d(x_i, x^\prime_j)^q + \min ({\color{MidnightBlue}R_{i-1, j}}, {\color{Red}R_{i, j-1}}, {\color{ForestGreen}R_{i-1, j-1}}) \end{aligned} \](2)

\((*)\) comes from the constraints on admissible paths \(\pi\): the last element on an admissible path needs to match the last elements of the series. Also, \((**)\) results from the contiguity conditions on the admissible paths. Indeed, a path that would align time series \({x}_{\rightarrow i}\) and \({x}^\prime_{\rightarrow j}\) necessarily encapsulates either:

This implies that filling a matrix that would store \(R_{i,j}\) terms row-by-row In practice, the matrix could be filled column-by-column too. The important part is that the terms \(R_{i-1, j}\), \(R_{i, j-1}\) and \(R_{i-1, j-1}\) are accessible when computing \(R_{i, j}\). When vectorizing code is of importance, an even better strategy is to compute the \(R\) terms one anti-diagonal at a time [TrDe20]. is sufficient to retrieve \(DTW_q({x}, {x}^\prime)\) as \({R_{n-1, m-1}}^{\frac{1}{q}}\).

These observations result in the following \(O(mn)\) algorithm to compute the exact optimum for DTW (assuming computation of \(d(\cdot,\cdot)\) is \(O(1)\)):

Properties

However, mathematically speaking, DTW is not a valid metric since it satisfies neither the triangular inequality nor the identity of indiscernibles. More specifically, DTW is invariant to time shifts. In other words, if \({x}\) is a time series that is constant except for a motif that occurs at some point in the series, and if \({x}_{+k}\) is a copy of \({x}\) in which the motif is temporally shifted by \(k\) timestamps, then \(DTW_q({x}, {x}_{+k}) = 0\), as illustrated below:

Setting Additional Constraints

As we have seen, Dynamic Time Warping is invariant to time shifts, whatever their temporal span. In order to allow invariances to local deformations only, one can impose additional constraints on the set of admissible paths.

Such constraints typically translate into enforcing nonzero entries in \(A_\pi\) to stay close to the diagonal. The Sakoe-Chiba band [SaCh78] is a constant-width band parametrized by a radius \(r\) (also called warping window size sometimes). Another standard global constraint is the Itakura parallelogram [Itak75] that sets a maximum slope \(s\) for alignment paths, which leads to a parallelogram-shaped constraint. The impact of these parameters is illustrated in the Figure below:

In practice, global constraints on admissible DTW paths restrict the set of possible matches for each element in a time series. The number of possible matches for an element is always \(2r+1\) for Sakoe-Chiba constraints (except for border elements), while it varies depending on the time index for Itakura parallelograms:

As stated above, setting such constraints leads to restricting the shift invariance to local shifts only. Typically, DTW with a Sakoe-Chiba band constraint of radius \(r\) is invariant to time shifts of magnitude up to \(r\), but is no longer invariant to longer time shifts:

Conclusion

We have seen in this post how alignment-based metrics can prove useful when dealing with temporally shifted time series. We have presented in more details the most common of these metrics, which is Dynamic Time Warping (DTW). If you enjoyed this post, check out this other one on the specific topic of the differentiability of DTW.

References

[Itak75]

Fumitada Itakura. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 1975. https://www.ee.columbia.edu/~dpwe/papers/Itak75-lpcasr.pdf

[SaCh78]

Hiroaki Sakoe & Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 1978. https://www.irit.fr/~Julien.Pinquier/Docs/TP_MABS/res/dtw-sakoe-chiba78.pdf

[TrDe20]

Christopher Tralie & Elizabeth Dempsey. Exact, parallelizable dynamic time warping alignment with linear memory. In Proceedings of the International Society for Music Information Retrieval Conference, 2020. https://arxiv.org/abs/2008.02734

[Vint68]

Taras K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 1968. https://link.springer.com/content/pdf/10.1007/BF01074755.pdf