Hostname: page-component-cd9895bd7-gbm5v Total loading time: 0 Render date: 2024-12-27T21:29:53.909Z Has data issue: false hasContentIssue false

Diluvian Clustering: A Fast, Effective Algorithm for Clustering Compositional and Other Data

Published online by Cambridge University Press:  24 August 2015

Nicholas W. M. Ritchie*
Affiliation:
Materials Measurement Science Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD 20899-8372, USA
*
*Corresponding author. nicholas.ritchie@nist.gov
Get access

Abstract

Diluvian Clustering is an unsupervised grid-based clustering algorithm well suited to interpreting large sets of noisy compositional data. The algorithm is notable for its ability to identify clusters that are either compact or diffuse and clusters that have either a large number or a small number of members. Diluvian Clustering is fundamentally different from most algorithms previously applied to cluster compositional data in that its implementation does not depend upon a metric. The algorithm reduces in two-dimensions to a case for which there is an intuitive, real-world parallel. Furthermore, the algorithm has few tunable parameters and these parameters have intuitive interpretations. By eliminating the dependence on an explicit metric, it is possible to derive reasonable clusters with disparate variances like those in real-world compositional data sets. The algorithm is computationally efficient. While the worst case scales as O(N2) most cases are closer to O(N) where N is the number of discrete data points. On a mid-range 2014 vintage computer, a typical 20,000 particle, 30 element data set can be clustered in a fraction of a second.

Type
Equipment and Software Development
Copyright
© Microscopy Society of America 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

a

Official contribution of the National Institute of Standards and Technology; not subject to copyright in the United States.

References

Aggarwal, C.C. & Reddy, C.K (2014). Data Clustering: Algorithms and Applications. Boca Raton, FL: CRC Press.Google Scholar
Bright, D.S. & Newbury, D.E (2004). Maximum pixel spectrum: A new tool for detecting and recovering rare, unanticipated features from spectrum image data cubes. J Microsc 216(2), 186193.CrossRefGoogle ScholarPubMed
Cortes, C. & Vapnik, V (1995). Support-vector networks. Mach Lear 20(3), 273297.CrossRefGoogle Scholar
Dempster, A., Laird, N. & Rubin, D (1977). Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc BMethodol 39(1), 138.Google Scholar
Gan, G., Ma, C. & Wu, J (2007). Data Clustering: Theory, Algorithms and Applications. Philadelphia, PA: ASA-SIAM Series on Statistics and Applied Probability.Google Scholar
Goldstein, J.I., Newbury, D.E., Joy, D.C., Lyman, C.E., Echlin, P., Lifshin, E., Sawyer, L. & Michael, J.R (2003). Scanning Electron Microscopy and X-ray Microanalysis. New York, NY: Kluwer Academic/Plenum Publishers.CrossRefGoogle Scholar
Kotula, P., Keenan, M. & Michael, J.R (2003). Automated analysis of SEM X-ray spectral images: A powerful new microanalysis tool. Microsc Microanal 9, 117.Google Scholar
MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, University of California Press, Berkley, CA, USA.Google Scholar
Mott, R.B., Waldman, C.G., Batcheler, R. & Friel, J.J (1995). Position-tagged spectrometry: A new approach for EDS spectrum-imaging. In Proc. Microscopy and Microanalysis, Bailey G.W., Ellisman M.H., Hennigar R.A. & Zaluzec N.J. (Eds.), pp. 592593. New York, NY: Jones and Begell Publishing.Google Scholar
Newbury, D.E (2005). X-ray spectrometry and spectrum image mapping at output count rates above 100 kHz with a silicon drift detector on a scanning electron microscope. Scanning 27, 227239.CrossRefGoogle Scholar
Schamber, F.H (1977). A modification of the linear least squares fitting method which provides continuum suppression. In X-Ray Fluorescence Analysis of Environmental Samples, Dzubay, T. (Ed.), pp. 241257. Ann Arbor, MI: Ann Arbor Science Publishers.Google Scholar
Schikuta, E (1996). Grid-clustering: a fast hierarchical clustering method for very large data sets. In Proceedings 15th International Conference on Pattern Recognition, IEEE Computer Society Press, Los Alamitos, CRPC-TR93358, pp. 101–105.Google Scholar
Vandecreme, A., Bajcsy, P., Ritchie, N.W.M. & Scott, J.H (2014). Interactive analysis of terabyte-sized SEM-EDS hyperspectral images. Microsc Microanal 20–S3, 654655.CrossRefGoogle Scholar
Wilson, N.C., MacRae, C.M., Torpy, A., Davidson, C.J. & Vicenzi, E.P. (2012). Hyperspectral cathodoluminescence examination of defects in a carbonado diamond. Microsc Microanal 18(6), 110.Google Scholar
Zimek, A (2014). Clustering high-dimension data. In Data Clustering: Algorithms and Applications, Aggarwal, C. & Reddy, C. (Eds.), pp 201230. Boca Raton, FL: CRC Press.Google Scholar