We are witnessing an unprecedented growth in the amount of data that is being collected and made available for data mining. While the availability of large-scale datasets presents exciting opportunities for advancing sciences, healthcare, understanding of human behavior etc., mining the data set for useful information becomes a computationally challenging task. We are in an era where the volume of data is growing faster than the rate at which available computing power is growing, thereby creating a dire need for computationally efficient algorithms for data mining. From an algorithmic complexity standpoint, we are transitioning from a mindset where algorithms with linear complexity in the size of the data set were considered efficient to an era when algorithms with linear complexity have become infeasible owing to the large size of the datasets. This necessitates the creation of algorithms with sub-linear time complexity tailored to big data.

One of the most fundamental data analytics tasks is that of querying a data set to see if a particular pattern of symbols appears in the data set either exactly or approximately. We consider the version of the problem where sketches of the original signal can be computed offline and stored and the objective is to minimize the sketching complexity and the complexity of answering queries. After providing a brief overview of existing algorithms for fast substring matching, I will describe our algorithm for substring matching which leverages the sparse Fourier transform computation-based approach introduced by Pawar and Ramchandran. We show that our algorithm can find matches with high probability (asymptotically in the size of the dataset and the query) with sub-linear time complexity. Under some conditions, our algorithm outperforms many other substring matching algorithms in the literature.

Potential applications of this work include text matching, audio/image matching, DNA matching in genomics, metabolomics, radio astronomy, searching for signatures of events within large databases, detecting viruses within binary executable files.