Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(NM) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorithm that is linear in the data size.
Bibliographical noteFunding Information:
This work is supported by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098. This research used resources of the National Energy Research Scientific Computing (NERSC), which is supported by the Office of Science of the U.S. Department of Energy. We express our thanks to Yudong Tan for running and generating the graphs that compare the qualities and running times of some of the clustering algorithms mentioned.
All Science Journal Classification (ASJC) codes
- Information Systems
- Hardware and Architecture
- Computer Networks and Communications
- Artificial Intelligence