Author: Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze Publisher: Cambridge University Press, 2008 ISBN: 978-0-521-86571-5R, \$51.52 (amazon.com)
Reviewer: John S. Griffin
Overview
The most visible Information Retrieval applications today are the ubiquitous World Wide Web search engines we use every day. They aid in wading through the flood of information that bombards each and every one of us. These search engines utilize various Information Retrieval methodologies to accomplish what they do so well. Information retrieval (IR) itself is the science (art?) of searching for documents, and for information within documents. Searching databases also would fall into this category.
In discussions of this subject you will notice that there is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval. For purposes of this review (the reviewer is admittedly making a big leap here) these can be considered one and the same.
IR utilizes a multitude of technologies such as computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, statistics and physics. Many universities and public libraries use IR systems to provide access to books, journals and other documents.
The reader should have completed a Discrete Mathematics course as midway through the book you become quickly immersed in advanced concepts.
Summary of Contents
The first eight chapters of the book are devoted to the basics of information retrieval. The indexes and queries considered in Chapters 1 through 5 only deal with Boolean retrieval, in which a document either matches a query or does not. Chapters 6 through 8 deal with document ranking methods or \textit{weighting the results returned from queries and evaluating its effectiveness. Chapters 9 through 21 build on the foundation of the first eight chapters to cover a variety of more advanced topics. Chapters 11 and 12 invoke probability theory to compute scores for documents on queries. Chapters 13 through 15 treat the problem of classifying documents into a set of known categories, given a set of documents along with the classes they belong to. Chapters 16, 17, and 18 consider the problem of inducing clusters of related documents from a collection. Finally, chapters 19 through 21 discuss the problem associated with web search.
A somewhat detailed discussion of contents chapter by chapter follows.
Chapter 1 introduces inverted indexes and shows how simple Boolean queries can be processed using such indexes.
Chapter 2 builds on this introduction by detailing the manner in which documents are pre-processed before indexing and by discussing how inverted indexes are augmented in various ways for functionality and speed. Tokenization, stop words, normalization and stemming are introduced
Chapter 3 discusses search structures for dictionaries and how to process queries that have spelling errors and other imprecise matches to the vocabulary in the document collection being searched. Wildcard queries, spelling correction and k-gram indexes are discussed.
Chapter 4 describes a number of algorithms for constructing the inverted index from a text collection with particular attention to highly scalable and distributed algorithms that can be applied to very large collections. Distributed indexing concepts are covered here. The Reuters-RCV1 collection is introduced as the standard document collection for use throughout the book.
Chapter 5 covers techniques for compressing dictionaries and inverted indexes. These techniques are critical for achieving sub-second response times to user queries in large search engines. Heap's and Ziph's laws are introduced along with \gamma-codes.
Chapters 6 and 7 A desire to measure the extent to which a document matches a query, or the score of a document for a query, motivates the development of term weighting and the computation of scores in , leading to the idea of a list of documents that are rank-ordered for a query. Term frequency, inverse document frequency and most importantly the Vector Space Model is introduced and discussed in detail.
Chapter 8 focuses on the evaluation of an information retrieval system based on the relevance of the documents it retrieves, allowing us to compare the relative performances of different systems on benchmark document collections and queries. A list of standard document collections is presented. The underlying concepts in this chapter deal with Precision, Recall and the F-factor.
Chapter 9 discusses methods by which retrieval can be enhanced through the use of techniques like relevance feedback and query expansion, which aim at increasing the likelihood of retrieving relevant documents. Pseudo (blind) relevance feedback is also discussed. The very important Rocchio algorithm is presented.
Chapter 10 considers IR from documents that are structured with markup languages like XML and HTML. This chapter treats structured retrieval by reducing it to the vector space scoring methods from Chapter 6.
Chapter 11 develops traditional probabilistic IR, which provides a framework for computing the probability of relevance of a document, given a set of query terms. This probability may then be used as a score in ranking. The last few pages of the chapter are dedicated to an appraisal of the probabilistic methods presented here.
Chapter 12 illustrates an alternative, wherein, for each document in a collection, a language model is built from which one can estimate a probability that the language model generates a given query. This probability is another quantity to rank-order documents.
Chapter 13 motivates statistical classification as one of the key technologies needed for a successful search engine; introduces Naive Bayes, a conceptually simple and efficient text classification method; and outlines the standard methodology for evaluating text classifiers. This is one of the longest chapters in the book and delves into much detail.
Chapter 14 employs the vector space model from Chapter 6 and introduces two classification methods, Rocchio and k nearest neighbor (kNN), that operate on document vectors. It also presents a long, detailed discussion of the bias-variance trade off as an important characterization of learning problems that provides criteria for selecting an appropriate method for a text classification problem.
Chapter 15 introduces support vector machines, which many researchers currently view as the most effective text classification method. Also developed are connections in this chapter between the problem of classification and seemingly disparate topics such as the induction of scoring functions from a set of training examples. A discussion of issues related to the classification of text documents is given. This chapter closes with a presentation on machine learning.
Chapter 16 first gives an overview of a number of important applications of clustering in IR. Two flat clustering algorithms are then given: the K-means algorithm, an efficient and widely used document clustering method, and the expectation-maximization algorithm, which is computationally more expensive, but also more flexible.
Chapter 17 motivates the need for hierarchically structured clusterings (instead of flat clusterings) in many applications in IR and introduces a number of clustering algorithms that produce a hierarchy of clusters. These include Group-average and \textit{Centroid. The chapter also addresses the difficult problem of automatically computing labels for clusters.
Chapter 18 develops methods from linear algebra that constitute an extension of clustering and also offer intriguing prospects for algebraic methods in IR, which have been pursued in the approach of latent semantic indexing.
Chapter 19 gives a summary of the basic challenges in web search, Spam, etc. Also, a set of techniques that are pervasive in web information retrieval are discussed.
Chapter 20 describes the architecture and requirements of a basic web crawler.
Chapter 21 considers the power of link analysis in web search, using in the process several methods from linear algebra and advanced probability theory.
Opinion
This book has been available for review and comment ever since the authors thought it was at the point of being called a "`beta"' release (spoken from a software engineering perspective). This reviewer actually referenced it as he was writing his own book. It ia still available today at http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html. Along with an HTML version there are two PDF versions as well, one for printing and one for on-line viewing. Along with these viewable documents there are slides, exercises (both interactive and traditional), an errata list and even a link to a discussion forum for the book. The availability of electronic copies along with the modest (in this reviewer's opinion) cost of the printed version are points in its favor as far as students are concerned.
The book is quite well organized and presented. Chapters 1 through 5 are the core to any course on information retrieval and they do an admirable job. Along with this, the depth and breath of additional coverage leaves little unturned. At the time of its printing this was, to this reviewer's knowledge, the first book on Information Retrieval to contain a chapter concerning the World Wide Web, let alone three. This topic is absolutely necessary in these times and it's not expected to decrease in importance anytime soon.
Every chapter concludes with a References and further reading section that is admirably comprehensive. These references would make it easy for additional research or study (just a suggestion).
Overall, this reviewer highly recommends this book. It is an excellent introduction to the tactics of information retrieval and its methodologies and intricacies. |