Finding Groups in Data: An Introduction to Cluster Analysis

Leonard Kaufman and Peter J. Rousseeuw

Summary

Cluster analysis is the art of finding groups in data. It is applied in many domains, sometimes under other names such as numerical taxonomy or automatic data classification. Finding Groups in Data: An Introduction to Cluster Analysis presents a small set of clustering methods that have many applications.

The authors show the general user, who may not have a mathematical or statistical background, how to use this powerful tool. The first chapter discusses the various types of data (including interval-scaled and binary variables, as well as similarity data) and shows how to choose an appropriate clustering method. The remaining six chapters cover six different clustering methods, and can be read independently of one another. Each chapter follows a common format. The first three sections give a short description of the clustering method, explain how to use it, and analyze a set of examples. The two following sections (which may be skipped without loss of understanding) discuss the algorithm and its implementation, and some related methods in the literature.

The authors present three partitioning methods and three hierarchical techniques. These six procedures have been chosen for their robustness, consistency, and general applicability. Some of the methods are new, such as the approach for partitioning large data sets, and the L1 method for fuzzy clustering. Also, the clusterings are accompanied by graphical displays and corresponding quality coefficients, which help the user to select the number of clusters and to see whether the method has found groups that were actually present in the data.

The programs described here are for the IBM PC, but the source code is very portable and has been run on several types of mainframes. The programs, together with their sources and the data sets used in the book, are available on floppy disks by writing to the authors.

Finding Groups in Data: An Introduction to Cluster Analysis should prove useful to applied statisticians, students, and anyone using quantitative methods.


Table of contents

  1. Introduction
    1. Types of Data and How to Handle Them
      1. Interval-Scaled Variables
      2. Dissimilarities
      3. Similarities
      4. Binary Variables
      5. Nominal, Ordinal, and Ratio Variables
      6. Mixed Variables
    2. Which Clustering Algorithm to Choose
      1. Partitioning Methods
      2. Hierarchical Methods
    3. A Schematic Overview of Our Programs
    4. Computing Dissimilarities with the Program DAISY
    Exercises and Problems

  2. Partitioning Around Medoids (Program PAM)
    1. Short Description of the Method
    2. How to Use the Program PAM
      1. Interactive Use and Input
      2. Output
      3. Missing Values
    3. Examples
    4. More on the Algorithm and the Program
      1. Description of the Algorithm
      2. Structure of the Program
    5. Related Methods and References
      1. The k-Medoid Method and Optimal Plant Location
      2. Other Methods Based on the Selection of Representative Objects
      3. Methods Based on the Construction of Central Points
      4. Some Other Nonhierarchical Methods
      5. Why Did We Choose the k-Medoid Method?
      6. Graphical Displays
    Exercises and Problems

  3. Clustering Large Applications (Program CLARA)
    1. Short Description of the Method
    2. How to Use the Program CLARA
      1. Interactive Use and Input
      2. Output
      3. Missing Values
    3. An Example
    4. More on the Algorithm and the Program
      1. Description of the Algorithm
      2. Structure of the Program
      3. Limitations and Special Messages
      4. Modifications and Extensions of CLARA
    5. Related Methods and References
      1. Partitioning Methods for Large Data Sets
      2. Hierarchical Methods for Large Data Sets
      3. Implementing CLARA on a Parallel Computer
    Exercises and Problems

  4. Fuzzy Analysis (program FANNY)
    1. The Purpose of Fuzzy Clustering
    2. How to Use the Program FANNY
      1. Interactive Use and Input
      2. Output
    3. Examples
    4. More on the Algorithm and the Program
      1. Description of the Algorithm
      2. Structure of the Program
    5. Related Methods and References
      1. Fuzzy k-Means and the MND2 Method
      2. Why Did We Choose FANNY?
      3. Measuring the Amount of Fuzziness
      4. A Graphical Display of Fuzzy Memberships
    Exercises and Problems

  5. Agglomerative Nesting (Program AGNES)
    1. Short Description of the Method
    2. How to Use the Program AGNES
      1. Interactive Use and Input
      2. Output
    3. Examples
    4. More on the Algorithm and the Program
      1. Description of the Algorithm
      2. Structure of the Program
    5. Related Methods and References
      1. Other Agglomerative Clustering Methods
      2. Comparing Their Properties
      3. Graphical Displays
    Exercises and Problems

  6. Divisive Analysis (Program DIANA)
    1. Short Description of the Method
    2. How to Use the Program DIANA
    3. Examples
    4. More on the Algorithm and the Program
      1. Description of the Algorithm
      2. Structure of the Program
    5. Related Methods and References
      1. Variants of the Selected Method
      2. Other Divisive Techniques
    Exercises and Problems

  7. Monothetic Analysis (Program MONA)
    1. Short Description of the Method
    2. How to Use the Program MONA
      1. Interactive Use and Input
      2. Output
    3. Examples
    4. More on the Algorithm and the Program
      1. Description of the Algorithm
      2. Structure of the Program
    5. Related Methods and References
      1. Association Analysis
      2. Other Monothetic Divisive Algorithms for Binary Data
      3. Some Other Divisive Clustering Methods
    Exercises and Problems

Appendix
  1. Implementation and Structure of the Programs
  2. Running The Programs
  3. Adapting the Programs to Your Needs
  4. The Program CLUSPLOT
References

Author Index

Subject Index


Book details

Wiley-Interscience, New York (Series in Applied Probability and Statistics), 342 pages.
ISBN 0-471-87876-6.
Third printing, 10 reviews.


Software incorporation

CLUSFIND (cluster analysis)

  • in S-PLUS Version 3.4 and later (as the functions daisy, pam, clara, fanny, agnes, diana and mona)


Program

Program CLUSFIND - Datasets CLUSFIND


Books - Details

Antwerp Group on Robust & Applied Statistics
Department of Mathematics and Computer Sciences
University of Antwerp (UA)
Middelheimlaan 1, B-2020 Antwerpen, Belgium
agoras@mail.win.ua.ac.be
http://www.agoras.ua.ac.be/