Hostname: page-component-7b9c58cd5d-bslzr Total loading time: 0.001 Render date: 2025-03-14T02:12:07.214Z Has data issue: false hasContentIssue false

Machine Learning for Text, by Charu C. Aggarwal, New York, Springer, 2018. ISBN 9783319735306. XXIII + 493 pages.

Review products

Machine Learning for Text, by Charu C. Aggarwal, New York, Springer, 2018. ISBN 9783319735306. XXIII + 493 pages.

Published online by Cambridge University Press:  12 January 2021

Xiaolei Lu*
Affiliation:
Assistant Professor, College of Foreign Languages and Cultures, Xiamen University, No. 422 South Siming Rd, Siming district, Xiamen, China Email: luxiaolei@xmu.edu.cn
Rights & Permissions [Opens in a new window]

Abstract

Type
Book Review
Copyright
© The Author(s), 2021. Published by Cambridge University Press

This book focuses on machine learning for text, integrating knowledge in the fields of machine learning, information retrieval, and natural language processing. This is the first (text)book on text analytics that covers the intersections of data mining, language modeling, and deep learning in a holistic way.

The book is organized into 14 chapters. Each chapter adopts a fixed outline, starting with an introduction of the topic and the organization of the chapter, and ending with a summary, bibliographic notes (including software resources) and exercises. These chapters provide a comprehensive survey of text analytics from three aspects: Classical Machine Learning Algorithms and Models (Chapters 1–8); Information Retrieval and Ranking (Chapter 9); and Sequence-centric Text Mining (Chapters 10–14).

Part I—Classical Machine Learning Algorithms and Models

Chapters 1 to 8 discuss data preparation and some basic text analytical models, such as text clustering and text classification algorithms. Following a brief introduction, Chapter 2 starts with text preprocessing and similarity computation. In the next chapter, the author moves on to talk about dimensionality reduction and topic modeling with the main focus on document-term matrices. In Chapter 4, Aggarwal describes a number of aspects of unsupervised text clustering, namely the feature engineering, clustering models, internal/external clustering evaluation, and two common appraisal mistakes easily made by practitioners.

Chapters 5 to 7 all deal with text classification. Chapter 5 introduces the four basic models for text classification—naive Bayes classifier, nearest-neighbor classifier, decision trees, and rule-based classifiers. Then, the author moves on to discuss more complex text classification algorithms—Linear Classification and Regression for text, with a closer examination of Least Squares, Support Vector Machines, and Logistic Regression. This chapter also addresses how the linear models can be generalized and applied to nonlinear settings using kernel transformations. Chapter 7 is devoted to classifier evaluation and performance improvement. The author examines the methods of evaluating the accuracy of a classifier, such as holdout and cross-validation. In addition, he explains how to employ ensemble learning to reduce bias (boosting) and/or variance (bagging), enhancing the classifier performance.

The final chapter of part I shifts to text mining methods for heterogeneous data, including shared matrix factorization, factorization machines, joint probabilistic modeling, and relationship graph techniques. Instead of introducing the algorithm for certain specific types of heterogeneous setting, the author points out the common threads that can be applied to all sorts of heterogeneous data.

Part II—Information Retrieval and Ranking

Many aspects of information retrieval are closely related to text mining. Chapter 9 examines Web-specific textual queries for information retrieval. Beginning with indexing and query processing, the author presents three aspects of Web-based information retrieval: two common types of data structures for retrieval applications; crawling and query processing for Web-centric data mining; and evaluation of Web documents.

Firstly, the author discusses two common types of data structures in keyword-centric queries: the dictionary and inverted index. The hash table and binary tree are introduced as dictionary data structures. For the inverted index, the author proposes the linear time indexing method to avoid memory exhaustion. In addition, practical solutions, such as caching and compression tricks, are presented to optimize a query.

Crawling and search engine queries are also introduced in this part. The author introduces universal crawlers and preferential (focused and topical) crawlers, as well as ways to avoid Spider traps and shingle for near-duplicate detection. Unlike the traditional information retrieval methods discussed in the previous part of this chapter, query processing in search engines has to include the process of lifting the tremendous burden on Web servers caused by a large number of queries.

Part II also covers the evaluation of Web documents. After a brief introduction of the scoring process, the author discusses some basic scoring models, such as vector space models with tf-idf, the binary independence model, the BM25 model, and query likelihood models. In addition, the author introduces link-based algorithms such as PageRank (query-independent) and HITS (query-specific) measurement to rank web pages.

The first part has eight chapters; however, part II only has one chapter, which makes the two parts seem rather unevenly distributed. Also, while the first part is logically connected with the third part, the second part, which is more about information retrieval and their generalizations to search engines than machine learning for texts, seems a little bit out of place between these two parts.

Part III—Sequence-centric Text Mining

Starting with text sequence modeling and deep learning, part III is mainly about sequence-centric text mining. Aggarwal discusses some advanced applications such as text summarization, information extraction, opinion mining and sentiment analysis, text segmentation, streaming clustering, and event detection.

When introducing sequence-centric text summarization, the author covers extractive and abstractive methods, both for single-document and multiple-document text summarization. Various methods for extractive summarization are introduced, such as topic-word methods, latent methods, and machine learning techniques. However, the author devotes relatively scant attention to abstractive summarization.

With respect to information extraction, the author mainly addresses two subtasks—named-entity recognition and relationship extraction. He introduces several Named Entity Recognition methods, namely, rule-based methods, transformation to token-level classification, hidden Markov models, maximum entropy Markov models, and conditional random fields. In addition, he discusses how relationship extraction tasks can be transformed into classification tasks and how relationship can be predicted using explicit feature engineering and kernel methods.

Then Aggarwal shifts to opinion mining and sentiment analysis, which can be carried out from phrase, sentence, document, and aspect level. In this part, he revisits the methodologies that have been covered in previous chapters, such as feature engineering, entity extraction, and classification. In addition, he employs spam detection to improve the quality of opinion mining and briefly introduces text opinion summarization.

The final chapter focuses on text segmentation, streaming text clustering, and event detection. The author introduces unsupervised (TextTiling and C99)/supervised text segmentation and k-means-based text stream clustering. He also discusses unsupervised/supervised event detection, suggesting the transformation of the supervised event detection problem into one of supervised segmentation and the event detection problem into an information extraction one.

Written in an accessible style and with many sophisticated algorithms and derivations of formulae, this book provides an up-to-date and comprehensive introduction to text analytics. Integrating intersecting research, synthetic algorithms, and layering approaches in the fields of machine learning, information retrieval, and Natural Language Processing (NLP), this book is a step toward further establishing text mining as a distinct academic field.

However, this book is more about machine learning, especially traditional statistical machine learning rather than deep learning and neural networks. Some state-of-the-art deep learning methods (at the time when the book was written) such as attention mechanism are absent. In addition, some exercises in the book could have been slightly better designed, and some programming exercises for text mining applications would have been useful.

Despite these few issues, this textbook is highly recommended for all those involved in the field of text mining who are looking for clear guidance and a comprehensive survey of text analytics, including graduate students, researchers, professors, and industrial practitioners. This will be a very important reference for machine learning in text for years to come.

Acknowledgments

The writing of this review was supported by the project of Humanities and Social Sciences, Ministry of Education, People’s Republic of China (Grant No. 18YJCZH117) and the project of Education Reform, Department of Education, Fujian province (Grant No. JZ180061).