Clustering

Anand Rajaraman; Jeffrey David Ullman

doi:10.1017/CBO9781139058452.008

This chapter is part of a book that is no longer available to purchase from Cambridge Core

7 - Clustering

Anand Rajaraman and

Jeffrey David Ullman

Show author details

Anand Rajaraman: Affiliation:
WalmartLabs
Jeffrey David Ullman: Affiliation:
Stanford University, California

Book contents

Get access

Summary

Clustering is the process of examining a collection of “points,” and grouping the points into “clusters” according to some distance measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another. A suggestion of what clusters might look like was seen in Fig. 1.1. However, there the intent was that there were three clusters around three different road intersections, but two of the clusters blended into one another because they were not sufficiently separated.

Our goal in this chapter is to offer methods for discovering clusters in data. We are particularly interested in situations where the data is very large, and/or where the space either is high-dimensional, or the space is not Euclidean at all. We shall therefore discuss several algorithms that assume the data does not fit in main memory. However, we begin with the basics: the two general approaches to clustering and the methods for dealing with clusters in a non-Euclidean space.

Introduction to Clustering Techniques

We begin by reviewing the notions of distance measures and spaces. The two major approaches to clustering – hierarchical and agglomerative – are defined. We then turn to a discussion of the “curse of dimensionality,” which makes clustering in high-dimensional spaces difficult, but also, as we shall see, enables some simplifications if used correctly in a clustering algorithm.

Type: Chapter
Information: Mining of Massive Datasets , pp. 213 - 251

DOI: https://doi.org/10.1017/CBO9781139058452.008 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

7 - Clustering

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive