19 December 2010
As you probably know because a zillion people have posted about this, Google has released a new feature that allows you to examine the relative frequency of words and phrases in the Google Books corpus. Here’s a description of the Ngram feature by one of the engineers.
But is the feature useful for linguistic research? Well, there is at least one serious study that uses Google’s corpus. A team led by Jean-Baptiste Michel and Erez Lieberman-Aiden have published “Quantitative Analysis of Culture Using Millions of Digitized Books” in Science (subscription required). Linguist Geoffrey Nunberg has a more accessible article about this research here. But this work does not use the publicly available Ngram tool. Can the Ngram tool, as opposed to the underlying corpus, be used for real work, or is it just an amusing diversion?
My impression is that with appropriate caution it can be useful as a quick tool verify or refute an impression or statement, but, like the counts of Google hits on the web, it is too crude to be definitive. There are simply too many uncertainties and uncontrolled factors underlying the numbers it provides. Any results it returns have to be verified by a much better controlled search of a corpus than this publicly available tool provides.