The method presented in this section consists in defining a kernel between sets of timestamped objects (typically features). This allows, in particular, to consider the case of irregularly sampled time series.
Note. This work was part of Adeline Bailly's PhD thesis. We were co-supervising Adeline together with Laetitia Chapel.
Match kernel and Signature Quadratic Form Distance
Our method relies on a kernel k(⋅,⋅) between local features. Based on this local kernel, one can compute the match kernel (Bo & Sminchisescu, 2009) between sets of local features as:
K(x,x′)=∑i∑jk(xi,x′j)and the Signature Quadratic Form Distance (SQFD, (Beecks, Uysal, & Seidl, 2009)) is the distance between feature sets x and x′ embedded in the Reproducing Kernel Hilbert Space (RKHS) associated with K:
SQFD(x,x′)=√K(x,x)+K(x′,x′)−2K(x,x′).Local Temporal Kernel
We introduce a time-sensitive local kernel defined as:
kt((xi,ti),(x′j,t′j))=eγt(t′j−ti)2k(xi,x′j).This kernel is positive semi definite (psd), as the product of two psd kernels and, if k is the radial basis function (RBF) kernel, it can be written as:
kt((xi,ti),(x′j,t′j))=k(g(xi,ti),g(x′j,t′j)).with g(xi,ti)=(xi,0,…,xi,d−1,√γtγfti)
where xi,l denotes the l-th feature of the i-th observation in x.
The code below illustrates the impact of the ratio √γtγf on the kernel matrix (larger γt leads to paying less attention to off-diagonal elements):
def g(x, ratio):
sz = x.shape[0]
return numpy.hstack(
(
x,
ratio * numpy.linspace(0., 1., sz).reshape((-1, 1))
)
)
gamma_f = 10.
gamma_t = 50.
ratio = numpy.sqrt(gamma_t / gamma_f)
# Build augmented representations ($g$ function)
s_y1_t = g(s_y1, ratio)
s_y2_t = g(s_y2, ratio)
# Plotting stuff
plt.figure(figsize=(5, 5))
ax_gram = plt.axes(rect_gram)
ax_s_x = plt.axes(rect_s_x)
ax_s_y = plt.axes(rect_s_y)
ax_gram.axis("off")
ax_s_x.axis("off")
ax_s_x.set_xlim((0, s_y2.shape[0] - 1))
ax_s_y.axis("off")
ax_s_y.set_ylim((0, s_y1.shape[0] - 1))
# Show kernel matrix and series on the same plot
gram = numpy.exp(-gamma_f * cdist(s_y1_t, s_y2_t, "sqeuclidean"))
ax_gram.imshow(gram)
ax_s_x.plot(numpy.arange(s_y2.shape[0]), s_y2,
"b-", linewidth=3.)
ax_s_y.plot(- s_y1, numpy.arange(s_y1.shape[0])[::-1],
"b-", linewidth=3.);
kt is then a RBF kernel itself, and Random Fourier Features (Rahimi & Recht, 2008) can be used in order to approximate it with a linear kernel.
If ϕ is a feature map such that
kt((xi,ti),(x′j,t′j))≈⟨ϕ(g(xi,ti)),ϕ(g(x′j,t′j))⟩,then
SQFD(x,x′)≈‖In other words, once feature sets are embedded in this finite-dimensional space, approximate SQFD computation is performed through (i) a barycenter computation b_\phi(\cdot) in the feature space (which can be performed offline) followed by (ii) a Euclidean distance computation with a time complexity of O(D), where D is the dimension of the feature map \phi. Overall, we have a distance between timestamped feature sets the precision / complexity tradeoff of which can be tuned via the map dimensionality D.
Evaluation
Note that the use of such handcrafted features was already outdated in the computer vision community at the time of this work. However, in our small data context, they proved useful for the task at hand.
In order to evaluate the method presented above, we used the UCR Time Series Classification archive (Bagnall, Lines, Vickers, & Keogh, 2018), which, at the time, was made of monodimensional time series only. We decided not to work on raw data but rather extract local features to describe our time series. We chose to rely on temporal SIFT features, that we had introduced in (Bailly, Malinowski, Tavenard, Guyet, & Chapel, 2015; Bailly, Malinowski, Tavenard, Chapel, & Guyet, 2016). These features are straight-forward 1D adaptations of the Scale-Invariant Feature Transform (SIFT) framework introduced in Computer Vision (Lowe, 2004).
We show in (Tavenard et al., 2017) that kernel approximation leads to better trade-offs in terms of computational complexity vs. kernel approximation than a pre-processing of the feature sets that would rely on k-means clustering. We also show that the obtained distance, once embedded in a Support Vector Machine with Gaussian kernel, leads to classification performance that is competitive with the state-of-the-art.
References
- Bo, L., & Sminchisescu, C. (2009). Efficient Match Kernel between Sets of Features for Visual Recognition. In Neural Information Processing Systems (pp. 135–143).
- Beecks, C., Uysal, M. S., & Seidl, T. (2009). Signature Quadratic Form Distances for Content-Based Similarity. In Proceedings of the ACM International Conference on Multimedia (pp. 697–700).
- Rahimi, A., & Recht, B. (2008). Random Features for Large-Scale Kernel Machines. In Neural Information Processing Systems (pp. 1177–1184).
- Bagnall, A., Lines, J., Vickers, W., & Keogh, E. (2018). The UEA & UCR Time Series Classification Repository. Retrieved from www.timeseriesclassification.com
- Bailly, A., Malinowski, S., Tavenard, R., Guyet, T., & Chapel, L. (2015). Bag-of-Temporal-SIFT-Words for Time Series Classification. In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data. Porto, Portugal.
- Bailly, A., Malinowski, S., Tavenard, R., Chapel, L., & Guyet, T. (2016). Dense Bag-of-Temporal-SIFT-Words for Time Series Classification. In Advanced Analysis and Learning on Temporal Data. Springer. https://doi.org/10.1007/978-3-319-44412-3_2
- Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2), 91–110.
- Tavenard, R., Malinowski, S., Chapel, L., Bailly, A., Sanchez, H., & Bustos, B. (2017). Efficient Temporal Kernels between Feature Sets for Time Series Classification. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery.