World Class- In Our Backyard


It was late night at the end of a long, long day at work. Even the lights in the fishing boats anchored in Bombay harbour for the night were out. I was fishing in another ocean that night, trawling the internet for any research paper that would provide us with a method to detect India-related content on the web.

There is a brute force way to do that – employ hundreds to check every major website in the world… But every time the content changed you’d have to go back and redo the work. Anyway, hiring that many people would cost too much.  There had to be a more elegant way. For example, write a computer program to patiently check sites throughout the world and send back snippets of information that was of interest to Indians.

We understood the first steps in doing this. For instance you can easily get a computer program to tag an article as ‘India interest’ if it encounters the word “Gandhi” in an article. Generalizing from this rule you could make the program compare every word it encounters against a list of Indian pronouns to detect ‘India interest content’. Even more generally you could make the program compare words it encounters with a corpus known to be relevant to Indians, for example the last five years of Business Standard articles.

This much we had figured out on our own. The trick was how to do this economically. Here again the brute force method is to crawl every single English language site in the world and look for words to compare with our chosen corpus. But the elegant way would be to devise a method of inspired guessing as to where to look for first and where to look for next.

That night I was trawling the internet for research papers that described methods of inspired guessing.

Here was one! Accelerated Focused Crawling through

Online Relevance Feedback.  I skimmed through the paper; it pretty much dealt with the problem I had in mind.

            It was past midnight in India, a good time to call the author and ask if he was willing to consult for us, I said to myself.

I ran my eyes back up to the start of the paper to check which American University the paper had came out of. Then came the shock! The research paper was from our own backyard- IIT Bombay. And the lead author was Soumen Chakravarty, a Professor of Computer Science there. I was astounded. Anything to do with web crawling was hot; that’s the stuff Search Engines are built on. There was a person from India publishing on such a hot topic in a prestigious international journal?

            I couldn’t wait for the sun to rise to call him. The next day, I and some colleagues trudged to IIT Bombay to meet the professor. He was sitting at his computer in an ice cold room in a remote corner of the Computer Science Department which itself was in a remote corner of the IIT Bombay campus. He was very helpful and immediately gave us the computer code that we needed.

            I was curious about how he got interested in this topic. He pulled out a book from the stack in his room:  Mining the Web: Discovering Knowledge from Hypertext Data.

“This is a textbook I have just finished writing for US computer science students”, he said nonchalantly tossing it back into the stack. “It will be out in later this year in the US.”

It turns out that he was at Stanford at the same time as Brin and Page, the Google founders; they went and found a venture capitalist to fund their search engine efforts  and Soumen came back to India to work at IIT Bombay: “because my mother is old and does not want to leave India”.

            Landing the Soumen catch turned out to be the easy part. Getting to engage IIT Bombay in a commercial relationship was to be a near-impossible task. The process for such an engagement is unchartered territory for Indian academic institutions. We settled on a compromise: we hired two of his star graduate students (or more accurately he persuaded them to join us instead of doing what all their classmates did- immigrate to America). Since then, we have been happily working together; whenever we run into a really tough computer science problem, we could get to Soumen through his students.

            I have ever since felt mildly guilty about this arrangement that gave us so much know-how for so little payment. Till I encountered the head of Sarnoff Labs at a conference in Beijing. Sarnoff Labs, based in Princeton New Jersey, is a haloed center for innovation since the 1940’s and is responsible for bringing to the world, Colour Television and the Electron Microscope, among other inventions.  He shocked the Beijing audience and me by asserting that “big corporations in the world maintain R&D departments mainly to impress Wall Street.”  He said that the era of corporations maintaining large central research labs staffed with Nobel Prize winners (Bell Labs symbolized this) is long gone. The locus of  innovation has almost entirely moved to start-ups. These start-ups staff themselves with recent graduates of top institutions and the alumni who maintain close consultative links to their University professors. 

            I was much relieved to hear this. So, the arrangement we stumbled into: hiring a professor’s top students and getting the professor for free seems to be the way R&D is done today. And the irony was that we found it in our own backyard.







