Article

On the need for time series data mining benchmarks: a survey and empirical demonstration

Authors:
Eamonn Keogh

University of California - Riverside, Riverside, CA

University of California - Riverside, Riverside, CA
View Profile

,
Shruti Kasetty

University of California - Riverside, Riverside, CA

University of California - Riverside, Riverside, CA
View Profile

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2002Pages 102–111https://doi.org/10.1145/775047.775062

Published:23 July 2002Publication History

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 102–111

ABSTRACT

In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of "improvement" that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.

References

Agrawal, R., Faloutsos, C. & Swami, A. (1993). Efficient similarity search in sequence databases. In proceedings of the 4th Int'l Conference on Foundations of Data Organization and Algorithms. Chicago, IL, Oct 13--15. pp 69--84.]] Google ScholarDigital Library
Agrawal, R., Lin, K. I., Sawhney, H. S. & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept. pp 490--50.]] Google ScholarDigital Library
Agrawal, R., Psaila, G., Wimmers, E. L. & Zait, M. (1995). Querying shapes of histories. In proceedings of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept 11--15. pp 502--514.]] Google ScholarDigital Library
André-Jönsson, H. & Badal. D. (1997). Using signature files for querying time-series data. In proceedings of Principles of Data Mining and Knowledge Discovery, Ist European Symposium. Trondheim, Norway, Jun 24--27. pp 211--220.]] Google ScholarDigital Library
Bailey, D. (1991). Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review, Aug. 1991, pp. 54--55.]]Google Scholar
Bay, S. (1999). UCI Repository of Kdd databases {http://kdd.ics.uci.edu/}. Irvine, CA: University of California, Department of Information and Computer Science]]Google Scholar
Berndt, D. J. & Clifford, J. (1996). Finding patterns in time series: a dynamic programming approach. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Menlo Park, CA. pp 229--248.]] Google ScholarDigital Library
Bozkaya, T., Yazdani, N. & Ozsoyoglu, Z. M. (1997). Matching and indexing sequences of different lengths. In proceedings of the 6th Int'l Conference on Information and Knowledge Management. Las Vegas, NV, Nov 10--14. pp 128--135.]] Google ScholarDigital Library
Caraça-Valente, J. P. & Lopez-Chavarrias, I. (2000). Discovering similar patterns in time series. In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data mining. Boston, MA, Aug 20--23. pp 497--505.]] Google ScholarDigital Library
Chan, K. & Fu, A. W. (1999). Efficient time series matching by wavelets. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23--26. pp 126--133.]] Google ScholarDigital Library
Chu, K. & Wong, M. (1999). Fast time-series searching with scaling and shifting. In proceedings of the l8th ACM Symposium on Principles of Database Systems. Philadelphia, PA, May 31-Jun 2. pp 237--248.]] Google ScholarDigital Library
Cohen, W. (1993). Efficient pruning methods for separate-and-conquer rule learning systems. In proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France. pp 88--994.]]Google Scholar
Das, G., Gunopulos, D. & Mannila, H. (1997). Finding similar time series. In proceedings of Principles of Data Mining and Knowledge Discovery, 1st European Symposium. Trondheim, Norway, Jun 24--27. pp 88--100.]] Google ScholarDigital Library
Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule discovery from time series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 16--22.]]Google Scholar
Debregeas, A. & Hebrail, G. (1998). Interactive interpretation of kohonen maps applied to curves. In proceedings of the 4th Int'l Conference of Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 179--183.]]Google Scholar
Faloutsos, C., Jagadish, H., Mendelzon, A. & Milo, T. (1997). A signature technique for similarity-based queries. In proceedings of the Int'l Conference on Compression and Complexity of Sequences. Positano-Salemo, Italy, Jun 11--13.]] Google ScholarDigital Library
Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25--27. pp 419--429.]] Google ScholarDigital Library
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D. & El Abbadi, A. (2001). Approximate nearest neighbor searching in multimedia databases. In proceedings of the 17th IEEE Int'l Conference on Data Engineering. Heidelberg, Germany, Apr 2--6. pp 503--511.]] Google ScholarDigital Library
Gavrilov, M., Angnelov, D., Indyk, P. & Motwani, R. (2000). Mining the stock market: which measure is best? In proceedings of the 6th ACM Int'I Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20--23. lap 487--496.]] Google ScholarDigital Library
Ge, X. & Smyth, P. (2000). Deformable markov model templates for time-series pattern matching. In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining. Boston, MA, Aug 20--23. pp 81--90.]] Google ScholarDigital Library
Geurts, P. (2001). Pattern extraction for time series classification. In proceedings of Principles of Data Mining and Knowledge Discovery, 5th European Conference. Freiburg, Germany, Sept 3--5. pp 115--127.]] Google ScholarDigital Library
Goldin, D. & Kanellakis, P. (1995) On similarity queries for time-series data: constraint specification and implementation. In proceedings of the 1st Int'l Conference on the Principles and Practice of Constraint Programming. Cassis, France, Sept 19--22. pp 137--153.]] Google ScholarDigital Library
Guralnik, V. & Srivastava, J. (1999). Event detection from time series data. In proceedings of the 5th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining. San Diego, CA, Aug 15--18. pp 33--42.]] Google ScholarDigital Library
Huang, Y. & Yu, P. S. (1999). Adaptive query processing for time-series data. In proceedings of the 5th Int'l Conference on Knowledge Discovery and Data Mining. San Diego, CA, Aug 15--18. pp 282--286.]] Google ScholarDigital Library
Huhtala, Y., Kärkkäinen, J. & Toivonen, H. (1999). Mining for similarities in aligned time series using wavelets. Data Mining and Knowledge Discovery: Theory, Tools, and Technology, SPIE Proceedings Series, Vol. 3695. Orlando, FL, Apr. pp 150--160.]]Google Scholar
Indyk, P., Koudas, N. & Muthukrishnan, S. (2000). Identifying representative trends in massive time series data sets using sketches. In proceedings of the 26th Int'l Conference on Very Large Data Bases. Cairo, Egypt, Sept 10--14. pp 363--372.]] Google ScholarDigital Library
Kahveci, T. & Singh, A. (2001). Variable length queries for time. series data. In proceedings of the 17th Int'l Conference on Data Engineering. Heidelberg, Germany, Apr 2--6. pp 273--282.]] Google ScholarDigital Library
Kahveci, T., Singh, A. & Gurel, A. (2002). An efficient index structure for shift and scale invariant search of multi-attribute time sequences. In proceedings of the 18th Int'l Conference on Data Engineering. San Jose, CA, Feb 26-Mar 1. to appear.]] Google ScholarDigital Library
Kalpakis, K., Gada, D. & Puttagunta, V. (2001). Distance measures for effective clustering of ARIMA time-series. In proceedings of the 1EEE Int'l Conference on Data Mining. San Jose, CA, Nov 29-Dec 2. pp 273--280.]] Google ScholarDigital Library
Keogh, E. & Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification, clustering artd relevance feedback. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27--31. pp 239--241.]]Google Scholar
Keogh, E. & Smyth, P. (1997). A probabilistic approach to fast pattern matching in time series databases. In proceedings of the 3rd Int'l Conference on Knowledge Discovery and Data Mining. Newport Beach, CA, Aug 14--17. pp 24--20.]]Google Scholar
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. (2001). Locally adaptive dimensionality reduction for indexing large time series databases. In proceedings of ACM SIGMOD Conference on Management of Data. Santa Barbara, CA, May 21--24. pp 151--162.]] Google ScholarDigital Library
Kibler, D., & Langley, P. (1988). Machine learning as an experimental science. In Proceedings of the 3rd European Working Session on Learning. pp. 81--92]]Google Scholar
Kim, E., Lam, J. M. & Han, J. (2000). AIM: approximate intelligent matching for time series data. In proceedings of Data Warehousing and Knowledge Discovery, 2nd Int'l Conference. London, UK, Sep 4--6. pp 347--357.]] Google ScholarDigital Library
Korn, F., Jagadish, H. & Faloutsos, C. (1997). Efficiently supporting ad hoc queries in large datasets of time sequences. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Tucson, AZ, May 13--15. pp 289--300.]] Google ScholarDigital Library
Lam, S. K. & Wong, M. H. (1998). A fast projection algorithm for sequence data searching, Data & Knowledge Engineering, Vol. 28(3). pp 321--339.]] Google ScholarDigital Library
Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D. & Allan, J. (2000). Mining of concurrent text and time series. In proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining. Boston, MA, Aug 20--23. pp 37--44.]]Google Scholar
Lee, S., Chun, S., Kim, D., Lee, J. & Chung, C. (2000). Similarity search for multidimensional data sequences. In proceedings of the 16th Int'l Conference on Data Engineering. San Diego, CA, Feb 28-Mar 3. pp 599--608.]] Google ScholarDigital Library
Li, C., Yu, P. S. & Castelli, V. (1998). MALM: a framework for mining sequence database at multiple abstraction levels. In proceedings of the 7th ACM CIKM Int'I Conference on Information and Knowledge Management. Bethesda, MD, Nov 3--7. pp 267--272.]] Google ScholarDigital Library
Loh, W., Kim, S. & Whang, K. (2000). Index interpolation: an approach to subsequence matching supporting normalization transform in time-series databases. In proceedings of the 9th ACM CIKM Int'l Conference on Information and Knowledge Management. McLean, VA, Nov 6--11. pp 480--487.]] Google ScholarDigital Library
Park, S., Chu, W. W., Yoon, J. & Hsu, C. (2000). Efficient searches for similar subsequences of different lengths in sequence databases. In proceedings of the 16th Int'l Conference on Data Engineering. San Diego, CA, Feb 28-Mar 3. pp 23--32.]] Google ScholarDigital Library
Park, S., Kim, S. & Chu, W. W. (2001). Segment-based approach for subsequence searches in sequence databases. In proceedings of the 16th ACM Symposium on Applied Computing. Las Vegas, NV, Mar 11--14. pp 248--252.]] Google ScholarDigital Library
Park, S., Lee, D. & Chu, W. W. (1999). Fast retrieval of similar subsequences in long sequence databases. In proceedings of the 3rd IEEE Knowledge and Data Engineering Exchange Workshop. Chicago, IL, Nov 7.]] Google ScholarDigital Library
Polly, W. P. M. & Wong, M. H. (2001). Efficient and robust feature extraction and pattern matching of time series by a lattice structure. In proceedings of the 10th ACM CIKM Int'I Conference on Information and Knowledge Management. Atlanta, GA, Nov 5--10. pp 271--278.]] Google ScholarDigital Library
Popivanov, I. & Miller, R. J. (2002). Similarity search over time series data using wavelets. In proceedings of the 18th Int'l Conference on Data Engineering. San Jose, CA, Feb 26-Mar 1. pp 212--221.]] Google ScholarDigital Library
Pratt, K. B. & Fink, E. (2002). Search for patterns in compressed time series. Int'l Journal of Image and Graphics. to appear.]]Google Scholar
Prechelt. L. (1995). A quantitative study of neural network learning algorithm evaluation practices. In proceedings of the 4th Int'l Conference on Artificial Neural Networks. pp. 223--227.]]Google ScholarCross Ref
Qu, Y., Wang, C. & Wang, X. S. (1998). Supporting fast search in time series for movement patterns in multiples scales. In proceedings of the 7th ACM CIKM Int'I Conference on Information and Knowledge Management. Bethesda, MD, Nov 3--7. pp 251--258.]] Google ScholarDigital Library
Rafiei, D. & Mendelzon, A. O. (1998). Efficient retrieval of similar time sequences using DFT. In proceedings of the 5th Int'l Conference on Foundations of Data Organization and Algorithms. Kobe, Japan, Nov 12--13.]]Google Scholar
Refiei, D. (1999). On similarity-based queries for time series data. In proceedings of the 15th IEEE Int'l Conference on Data Engineering. Sydney, Australia, Mar 23--26. pp 410--417.]] Google ScholarDigital Library
Shahabi, C., Tian, X. & Zhao, W. (2000). TSA-tree: a wavelet based approach to improve the efficiency of multi-level surprise and trend queries. In proceedings of the 12th Int'l Conference on Scientific and Statistical Database Management. Berlin, Germany, Jul 26--28. pp 55--68.]] Google ScholarDigital Library
Shatkay, H. & Zdonik, S. (1996). Approximate queries and representations for large data sequences. In proceedings of the 12th IEEE Int'l Conference on Data Engineering. New Orleans, LA, Feb 26-Mar 1. pp 536--545.]] Google ScholarDigital Library
Simon, J. L. (1994). What some puzzling problems teach about the theory of simulation and the use of resampling. The American Statistician, Vol. 48(4). Nov. pp 1--4.]]Google Scholar
Struzik, Z. & Siebes, A. (1999). The Haar wavelet transform in the time series similarity paradigm. In proceedings of Principles of Data Mining and Knowledge Discovery, 3rd European Conference. Prague, Czech Republic, Sept 15--18. pp 12--22.]] Google ScholarDigital Library
Walker, J. (2001). HotBits: Genuine random numbers generated by radioactive decay. www.fourrnilab.ch/hotbits/]]Google Scholar
Wang, C. & Wang, X. S. (2000). Multilevel filtering for high dimensional nearest neighbor search. In proceedings of ACM SIGMOD Workshop on Research lssues in Data Mining and Knowledge Discovery. Dallas, TX, May 14. pp 37--43.]]Google Scholar
Wang, C. & Wang, X. S. (2000). Supporting content-based searches on time series via approximation. In proceedings of the 12th Int'l Conference on Scientific and Statistical Database Management. Berlin, Germany, Jul 26--28. pp 69--81.]] Google ScholarDigital Library
Wang, C. & Wang, X. S. (2000). Supporting subsefies nearest neighbor search via approximation. In proceedings of the 9th ACM CIKM Int'I Conference on Information and Knowledge Management. McLean, VA, Nov 6--11. pp 314--321.]] Google ScholarDigital Library
Wu, L., Faloutsos, C., Sycara, K. & Payne, T. R. (2000). FALCON: feedback adaptive loop for content-based retrieval. In proceedings of the 26th Int'l Conference on Very Large Data Bases. Cairo, Egypt, Sept 10--14. pp 297--306.]] Google ScholarDigital Library
Wu, Y., Agrawal, D. & El Abbadi, A. (2000). A comparison of DFT and DWT based similarity search in time-series databases. In proceedings of the 9th ACM CIKM Int'I Conference on Information and Knowledge Management. McLean, VA, Nov 6--11. pp 488--495.]] Google ScholarDigital Library
Yi, B. & Faloutsos, C. (2000). Fast time sequence indexing for arbitrary lp norms. In proceedings of the 26th Int'l Conference on Very Large Databases. Cairo, Egypt, Sept 10--14. pp 385--394.]] Google ScholarDigital Library
Yi, B., Jagadish, H. & Faloutsos, C. (1998). Efficient retrieval of similar time sequences under time warping. In proceedings of the 14th Int'l Conference on Data Engineering. Orlando, FL, Feb 23--27. pp 201--20.]] Google ScholarDigital Library

Index Terms

On the need for time series data mining benchmarks: a survey and empirical demonstration
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Searching and mining trillions of time series subsequences under dynamic time warping
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets ...
Read More
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of ...
Read More
Parallel algorithms for mining association rules in time series data
ISPA'03: Proceedings of the 2003 international conference on Parallel and distributed processing and applications

A tremendous growing interest in finding dependency among patterns has been developing in the domain of time series data mining. It is quite effective to find how current and past values in the streams of data are related to the future. However, these ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
experimental evaluation
time series
Qualifiers
- Article
Conference

Acceptance Rates
KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 294
  Total Citations
  View Citations
- 3,761
  Total Downloads
- Downloads (Last 12 months)186
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the need for time series data mining benchmarks: a survey and empirical demonstration

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Searching and mining trillions of time series subsequences under dynamic time warping

On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

Parallel algorithms for mining association rules in time series data