Google as a Research Tool

24 July 2011

Google web search is essentially useless as a research tool, as illustrated by this Language Log post which points out that the search engine conflates Freud and Darwin.

Google search results are optimized to help you find a particular site, not to provide information about language use or other data. The combination of personalized search results, based on things like you’re location and search history, plus the algorithms that automatically amend your search parameters mean that one person’s result is not going to be the same as another’s. There was a time when you could rely on the number of raw hits as a rough approximation of a term’s popularity on the web, but with Google’s constant updating of the algorithm’s that power the site, this is no longer the case.

This isn’t necessarily a bad thing. Google has gotten a lot better at serving up exactly what you want. Nowadays it is rare for me to go beyond the top ten results to find the site I need. The other day I did a Google search on “running shoes Toronto” and the number one result was a store right around the corner that I did not know existed. I promptly walked over and bought a pair. (Google obviously knew where I was, thanks to my use of Google Maps, which can be a bit unsettling, but was damn useful.) It used to be the case that any Google search for a hotel turned up hundreds of supersaver travel sites, offering low prices on generic hotel rooms. Finding the website of an actual hotel was next to impossible. But nowadays, a search for hotels actually turns up the hotel web sites, with the sites offering “great deals” buried down below. But this commercial convenience does mean that trying to make any type of objective observations based on Google results is fruitless.

English-Only Signs in NYC

21 July 2011

Linguist Dennis Baron has a post on a New York City ordinance, dating from 1933, that requires the name of a store “to be publicly revealed and prominently and legibly displayed in the English language either upon a window...or upon a sign conspicuously placed upon the exterior of the building” (General Business Laws, Sec. 9-b, Art. 131). I’m not sure that I agree with Dr. Baron’s assessment that the law is discriminatory and serves no other useful purpose.

The arguments for such signs are that they help combat fraud and fly-by-night stores (this was the purpose of the law when it was drafted) and it aids first responders. Dr. Baron points out that people can lie in any language and GPS is better than signs for first responders. Those rebuttals don’t cut the mustard. Yes, a recognizable name does not stop fraud, but it helps in identifying the location. And people who dial 911 don’t necessarily have GPS coordinates available when they give their location. Besides, as the first responders are driving up the street, a sign is much more useful in accurately pinpointing the location than GPS, which can be off by tens of meters. (Not to mention, that line of sight to GPS satellites is often not available in the canyons of New York City.)

But he does have a point about discrimination, particularly against Asian and Arab businesses.

Perhaps what is needed is not a law that requires English, but a one that requires the name be prominently displayed in the Latin alphabet. The point is not to dictate what language is used, but to ensure the sign can be read by the vast majority of people in the city. The law should also clarify that it is not “Latin characters only,” and that “prominently displayed” does not mean it has to be the largest writing on the sign, only that it can be clearly read from a reasonable distance. (Given the street grid and architecture of NYC, it would be very difficult to specify a distance, say twenty-five meters, in advance. If you can’t see the store front from twenty-five meters away, it would be absurd to require that large a sign.) Enforcement of the law would also have to be monitored to ensure that particular groups were not unfairly targeted.

And there is something to be said for making life welcoming and convenient for everyone who lives in the city. Such a law might actually help these businesses, but making them more welcoming to a larger customer base. It would also help, in a small way, in reducing the insularity of ethnic enclaves in the city. We don’t want eliminate these enclaves, they are a vital and vibrant part of city life, but by making the names of the businesses readable to the majority of residents would help foster ties to and communication with the city at large.

Typos and Digital Publishing

20 July 2011

There’s a recent opinion piece in the New York Times Online about how digital publishing has created a boom in typos and bad spelling making their way through to appear in the final versions of books and other publications. It’s interesting, but I don’t buy the arguments put forth by Virginia Heffernan, the article’s author. The cause isn’t digital technology, it’s corporate economics.

As Ms. Heffernan points out, there have always been bad spellers, and the ability to spell correctly does not correlate with excellence in writing. She rightly gives the example of F. Scott Fitzgerald, who was a notoriously incompetent speller. There has always been pressure to publish quickly. What has changed? Ms. Heffernan says:

Before digital technology unsettled both the economics and the routines of book publishing, they explained, most publishers employed battalions of full-time copy editors and proofreaders to filter out an author’s mistakes. Now, they are gone.

To which I reply, post hoc ergo propter hoc. Digital technology did not fire the copy editors, management did. The cause is the surge in Wall Street mergers and acquisitions that began in the 1980s. Publishing was always a low-profit enterprise, with expected returns of 5–8%. But once assembled into huge media conglomerates, book divisions had to compete with more profitable divisions and owners demanded returns of 20–25%. To accomplish this, publishers churned out more product and cut overhead—all those copy editors and their princely salaries.

(The same economics are what is killing newspapers. Yes, the decline in ad revenue is unsustainable in the long term, but for the moment, most newspapers are still highly profitable. They’re cash machines. What is killing newspapers is the enormous debt they have accumulated in acquiring other newspapers and becoming media conglomerates.)

Ms. Heffernan also blames “writerly inattention.” Manuscripts are longer and more carelessly assembled on the word processor, says Ms. Heffernan. I find this hard to believe. The same economic incentives that kept published books to a certain length in the typewriter era still obtain, and successful online writers know that brevity is important to retaining a reader’s attention, perhaps more so than in print. And I can’t believe that manuscripts were better organized and structured in the days of the typewriter. Word processors allow a writer to edit and structure a text much more easily and consistently than is ever possible with a typewriter. Although it may be true that the apparent ease with which authorial editing can be done electronically encourages sloppy writers who would have been daunted by the prospects of doing it in the typewriter era. So editors may be seeing more terribly constructed manuscripts in their slush piles.

There is one area where digital technology does make a difference in spelling, but Ms. Heffernan doesn’t touch on it. That is poorly scanned e-books. Amazon and Apple’s iStore are filled with cheap editions of public domain books that have been hastily scanned and converted to text with optical character recognition software. While OCR software has gotten pretty good, it’s error rate is still considerable, and any OCR’d text needs a thorough proofreading before it is worthy of publication. But again, economics step in and the low prices these e-books command, typically around one dollar, don’t make it feasible to hire proofreaders. As a result, these texts are rife with errors, some to the point they are unreadable. Even here, it is economics and not technology that is creating the problem.

(Hat Tip: Barbara Need)

Anachronistic Scientists

18 July 2011

In a post that makes some nice points about semantic change, Pete Langman, writing in the Guardian’s “Mind Your Language” blog, makes a whopping error:

Old words evolve, too, by stepping out of the dictionary and back into oral culture. Johnson’s use of the word “science” perfectly illustrates his point. “Science” now means a specific mode of inquiry; indeed, it presents a certain type of knowledge guaranteed, so to speak, by the method that underpins it. It was first used in the modern sense in 1834. But when Johnson used it, he meant scientia, or knowledge in the broader sense. The use of the word “scientist” to describe anyone before 1834 is not only anachronistic, but erroneous.

Now the main point of this paragraph is absolutely correct and worth saying. We have to be careful when applying current definitions to works written in the past. But the example is problematic, and the final sentence is ludicrous.

First, Langman conflates the words science and scientist. It is scientist that is first attested in 1834, not science. Second, the invention of the modern concept of science cannot be pinned down to any particular date. The OED entry for science is problematic; it’s one of those entries that has been haphazardly updated over the decades and needs a thorough scrubbing, and you can’t tell when the modern sense of the word emerges. I’ll define the modern concept using Merriam-Webster: “knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method [i.e., systematic use of empirical and controlled observation].” There are plenty of examples of excellent science from before 1834. The names Galileo, Kepler, Newton, Jenner, and Davy spring immediately to mind. The Royal Society was founded in 1660. I can even point to flashes of scientific methodology by the eighth/ninth-century Bede and the tenth/eleventh-century Ælfric. Now it is true that most of the modern institutions of science-as-we-know-it-today (e.g., journals, professional/university laboratories) did not come into being until the first half of the nineteenth century, but that’s a historical or sociological issue, not a linguistic one, and it does not mean that there weren’t earlier examples of good science.

Langman seems to be saying that we shouldn’t call anyone a scientist because that word didn’t exist before 1834. He isn’t the first to make this claim. I’ve heard others make it, but it is just patently absurd. We can certainly apply words anachronistically. Just because Humphry Davy (d. 1829) didn’t have the convenient label to denote his profession doesn’t mean that we can’t look back today and describe him as a superb scientist. Are we to say that there is no such thing as the Middle Ages because the term wasn’t coined until 1605, and no one living the period would have described it as such? Now, people will point to the fact that Newton toyed with mysticism and alchemy in addition to his scientific pursuits, but Linus Pauling (1901–94), who is one of only two people to win Nobel Prizes in two different fields—the other is Marie Curie—also engaged in crank medical research involving vitamins. Great scientists are not immune from bad ideas and don’t always apply the scientific method to everything they do.

Now Langman is correct that we should be careful when applying terms like science and scientist to pre-modern eras. I mentioned Bede and Ælfric; now while I can point to examples of them using the empirical method in their work, I would by no stretch of the imagination label these medieval monks as scientists or describe their general approach to discovery as scientific. Which is why I wrote “flashes of scientific methodology” and not “science.” Use of modern senses of such words are indeed anachronistic, but that does not mean they are “erroneous.”

[Edit: Upon rereading Mr. Langman’s post and engaging in a lengthy discussion with him on the forums here, I realize my initial assessment of the piece is a bit unfair. So I’ve changed the opening paragraph above. — dw]