rediff.com

Archive for August, 2010

Donna Harman on the Evolution of Search Technology

http://datastore.rediff.com/h5000-w5000/thumb/5C66667261/2l3n5wyf7aewehc2.D.0.geveva_searc_conf__blog.jpgListening to Donna Harman’s account of the evolution of Search technology at the SIGIR 2010 Conference in Geneva was a rare pleasure as I doubt whether many others are as  well qualified as she is to give such an account.

She titled her talk, ‘Is the Cranfield Paradign Outdated?’ alluded to the seminal work on Search at the School of Aeronautical Engineering at Cranfield in the UK between 1958-60.

The work at Cranfield involved comparison of four indexing schemes on 18,000 papers on aerodynamics were manually indexed under these four types of indexing. The authors of the papers were asked what were the basic problems addressed in the paper. Manual searches were then done. It was discovered that it made no difference what indexing scheme was used. The real issue, it was discovered, was the descriptors that were used.

This work was extended in the 1962- 66 period ( referred to in the literature as Cranfield 2). The goal here was to retrieve all relevant documents from a collection of 1400 papers on aeronautical engineering  based on four indexes of 31,25 and 13 descriptors each all of which were done manually. Again, the authors of the papers were asked about the basic problems addressed in these papers. Five levels of relevance assessments were used: complete, High, Useful, Minimal Value, No Interest. It was in this experiment that crucial breakthroughs in Search were made: Relevance and Precision as metrics for judging the efficacy of a Search and the discovery that words in the documents could be used for indexing.

This,says, Donna Harman, was the Cranfield Paradigm: Real questions were asked, there was a large enough collection of documents, the collection was made before the questions were framed and intuitive metrics were used.

The arena then shifted across the Atlantic when Mike Klein from Cranfield spent time in Cornell in 1967-68. The SMART project, as it was called, used the Cranfield collection of documents plus Medical Abstracts.

But things really got going only after the US Defense Department, through DARPA asked NIST, the US National Institute of Standards and Technology ( something like the Indian Standards Institution that we have here in India)  for help in an intelligence analysis project in 1990. NIST was Donna Harman’s perch and her involvement with Search started here.These were  the so called TREC studies.

As Donna describes it, the ‘user model’ for this project involved intelligence analysts searching through a full-text collection of diverse documents with a goal of high recall, i.e. all possible documents relevant to a query needed to be returned without worrying too much about the precision of the search. The document collection was made up of  the Wall Street Journal ( 1987-89, 1990-92) , the Associated Press ( 1989,88), the Federal Register, Ziff-Davis Computer Abstracts, and DOE Abstracts.

The TREC studies went through several phases and practically all the technology in use in today’s Search industry evolved out of this project.

The Cranfield Paradigm of (a) modeling a real user application (b) having a large enough collection of documents (c) building the collection before the queries were formulated (d) one query to produce one answer, conversely the same query ought to return always the same answer,  still held through these TREC experiments.

…which really brings us to the question that Donna Harman posed at SIGIR 2010, Geneva. Does the Cranfield Paradigm hold in an era where we know that user models have evolved in many new directions. For example, one-query-one-answer is not really the user model that reigns today. Users , today, do a sequence of queries, each query is often based on the answers returned in the earlier query. Or , take the example of a Hotel Search where the search provider has large financial incentives to take you down a path you may not want to go, or a user model such as in Amazon book search where they want to persuade you to buy. Or when we know that users usually look at only the top few results which flies against the Cranfield paradigm of re-usable collections. This latter piece is an important consideration because, says, Donna Harman, much of the benefits of Cranfield ( and TREC) came from the re-usability of their document collections.

So, is the Cranfield Paradigm outdated?  was the question she posed to all of us in the audience at the University of Geneva ( incidentally founded by Calvin, the Protestant Reformer)

, , , , , ,

No Comments

Copyright © 2019 Rediff.com India Limited. All rights Reserved.  
Terms of Use  |   Disclaimer  |   Feedback  |   Advertise with us