Mining Data Streams

Anand Rajaraman; Jeffrey David Ullman

doi:10.1017/CBO9781139058452.005

This chapter is part of a book that is no longer available to purchase from Cambridge Core

4 - Mining Data Streams

Anand Rajaraman and

Jeffrey David Ullman

Show author details

Anand Rajaraman: Affiliation:
WalmartLabs
Jeffrey David Ullman: Affiliation:
Stanford University, California

Book contents

Get access

Summary

Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make another assumption: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing.

The algorithms for processing streams each involve summarization of the stream in some way. We shall start by considering how to make a useful sample of a stream and how to filter a stream to eliminate most of the “undesirable” elements. We then show how to estimate the number of different elements in a stream using much less storage than would be required if we listed all the elements we have seen.

Another approach to summarizing a stream is to look at only a fixed-length “window” consisting of the last n elements for some (typically large) n. We then query the window as if it were a relation in a database. If there are many streams and/or n is large, we may not be able to store the entire window for every stream, so we need to summarize even the windows.

Type: Chapter
Information: Mining of Massive Datasets , pp. 108 - 138

DOI: https://doi.org/10.1017/CBO9781139058452.005 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book contents

4 - Mining Data Streams

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive