Significant Pattern Mining for Time Series

bra-ket · on June 12, 2019

related: Matrix Profiles for time series https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

uoaei · on June 12, 2019

See Stumpy for a handy library to get this working quickly (written in Python): https://github.com/TDAmeritrade/stumpy

seanlaw · on June 13, 2019

Hi all, I am the creator of STUMPY and wanted to thank you for your interest. Please feel free to post questions on our Github issues and we'll try to assist where we can.

ashsriv · on June 13, 2019

I am still a little confused about the real world application of MatrixProfile. It looks really good but once an MP is made then what ?

Can this be automated to say for example - Based on your window, here are all the anomalies.

amai · on June 13, 2019

Don’t forget: „Clustering of Time Series Subsequences is Meaningless“ : https://www.cs.ucr.edu/~eamonn/meaningless.pdf

Pseudomanifold · on June 13, 2019

But this is not about clustering. It's about figuring out to what extent a certain subclass of features, namely the 'shapelets', are statistically significantly associated with a pre-defined binary outcome.

The paper you mentioned is interesting, though, because it shows an issue that many algorithms are privy to: if the number of samples/features gets too large, at some point, you are only comparing _means_.

(We are working on a paper to show the issues of this when it comes to time series classification.)

valyala · on June 18, 2019

Where to store time series data for further analysis? It is possible to use Prometheus for this - see https://medium.com/@valyala/analyzing-prometheus-data-with-e...

graycat · on June 12, 2019

Their math in their description of their data is in error: They need to state that the T_i (T with a subscript i), for i = 0, 1, 2, ..., n are distinct.

More standard would be a function d: {0, 1, ..., n} --> R^{1 x m} x {0, 1}.

Pseudomanifold · on June 12, 2019

Seems to be standard terminology for time series classification to me, to be honest. I think the approach would also work if there are duplicates in the data. Although the estimate would be overly optimistic, right?

graycat · on June 12, 2019

With their notation they have not specified that the T's are unique. So, a first fix up would be just to state that the T's were distinct. And it would help to be explicit that i from 0, 1, 2, ... corresponded to increasing time. Moreover, is the data equally spaced in time? Likely, yes, and in that case, clearly say so.

jmmcd · on June 13, 2019

No, i indexes the patient, not time. (T_0, y_0) is one patients entire time series.

module0000 · on June 13, 2019

This sure reads and looks like technical analysis indicators for time series data.

It's useful though - example: 5 day MA of disk errors rises over the 15 day == likely failure