ChatGPT vs Wordorigins.org Smackdown

Drawing of a computer casing with the label “HAL 9000” and a red camera lens

Depiction of HAL 9000, the artificially intelligent computer in Stanley Kubrick’s 1968 film 2001: A Space Odyssey

What happens when an AI tries to do etymology?
21 December 2022

A new large-language-model, artificial-intelligence (AI) engine, ChatGPT, has debuted. When one types a query into the AI, it returns a response in grammatically correct and sensical English. It has caused a bit of a stir. For one thing, there are those who are foretelling the end of human writing, or at the least the end of the of the first-year university essay. (Mark Liberman over at Language Log has a non-sensational summary of the debate, including links to various articles opining on the AI and its impact, and an assessment of the ability to detect whether or not an essay was written by the AI.) But from my playing around with it, ChatGPT is indeed an impressive step-forward in machine-produced, natural language, but the fears, moral panic, and click-bait headlines it has inspired are overblown, at least for the immediate future.

The AI is really quite remarkable. It produces intelligible, grammatically and (mostly) idiomatically correct English sentences and paragraphs. For instance, I typed How are you doing? into the engine and got this response:

As an artificial intelligence, I don't experience emotions in the same way that humans do. I am a computer program designed to assist with tasks and answer questions to the best of my ability. Is there something specific you would like to know or discuss?

While this response would pass the Turing test, many other responses, while correct English, are problematic. I asked the AI a series of questions about the origins of various words and phrases. Most of the responses would earn the grade of D in my first-year composition course. A typical response to a question about the origin of a term is:

The phrase “an exception that proves the rule” is a common saying that is used to indicate that an exception to a rule or generalization actually serves to confirm the rule or generalization. The phrase is often used to emphasize that a particular exception does not invalidate a rule, but rather serves to demonstrate its validity.

The phrase “an exception that proves the rule” is believed to have originated in the late 16th century, and it is thought to have been derived from the Latin phrase “exceptio probat regulam,” which means “the exception confirms the rule.” The phrase was originally used in the context of legal reasoning, and it was used to indicate that an exception to a rule actually served to confirm the rule's validity.

Over time, the phrase “an exception that proves the rule” has come to be used more broadly in everyday language to refer to any exception that serves to confirm a rule or generalization.

The response is very general and does not contain citations for the sources of information. The AI’s responses are often structured in the three-paragraph “hamburger” essay format of introduction, body paragraph, and conclusion that is favored by high school and first-year university students. In this case, the introductory paragraph is merely a literal reading of the phrase. The body contains three facts, the date, the Latin origin, and the origin in legal contexts. The conclusion is simply a restatement of the introduction. This one is remarkable only in that the facts are basically correct, if superficial and cursory.

In the questions I asked, the AI occasionally got the origin correct. In addition to the exception that proves the rule, it also correctly sourced big bang to astronomer Fred Hoyle in the 1940s. And the response to my question about gaslight correctly flagged Patrick Hamilton’s play as the source, later popularized by the 1944 movie starring Ingrid Bergman. But correct facts and specificity were not usually the case.

In some cases, the AI plumped for a long-debunked false etymology. And it did so on my very first query when it gave the port out, starboard home origin for posh. When asked about the origin of the whole nine yards, the AI came up with the following:

The phrase “the whole nine yards” is thought to have originated in the United States in the mid-20th century. There are a few different theories about its origin, but the most widely accepted one is that it refers to the amount of fabric used in a military uniform. According to this theory, a standard U.S. Army uniform required about nine yards of fabric, so if someone had “the whole nine yards,” they would have a complete and fully outfitted uniform.

Another theory is that the phrase refers to the length of a football field, which is 100 yards. If someone goes “the whole nine yards,” they are going all out or giving their all.

It is also possible that the phrase “the whole nine yards” is simply a colorful and vivid way of saying “everything” or “all of it,” without any specific meaning or reference to a particular quantity.

First, the dating is wrong. I’ve seen many different fabric-related hypotheses, but the military uniform version is a new one on me. And as the AI phrases it, the football hypothesis makes no sense—how is going nine yards giving one’s all on a hundred-yard field? The hedging, as in thought to have and is also possible, is also typical of the answers generated, and also of first-year university essays. If it had emotions, I would say the AI lacks confidence in its answers.

In a few instances, the AI pointed to supposed Old English words as the origin. It came close with threshold, saying it comes from:

“þresceold,” which is a combination of the words “þresc,” meaning “threshold,” and “ceold,” meaning “cold.”

While the spelling is wrong, þerscold is indeed an Old English word and the source of the Present-Day English word. But the breakdown of the syllables is wrong.

In other instances, the AI gave fictitious, and sometimes laughable, Old English roots. For instance, of fuck it said:

The word “fuck” is believed to be derived from the Old English word “fōc,” which means “goat.”

Goat?! Where did that come from? The Old English word for goat is gat. Fōc is not a word found in the extant Old English corpus.

Sometimes it gets the origin mostly correct but makes other errors, as in the origin of the name of the element americium:

Americium is a synthetic chemical element with the atomic number 95 and the symbol Am. It was discovered in 1944 by a team of scientists at the University of California, Berkeley, led by Glenn T. Seaborg. The element was named “americium” in honor of the United States, where the element was discovered and first produced.

The name “americium” is derived from the Latin word “America,” which was the name of the region of the world now known as the Americas. The name “America” was chosen by the Italian explorer Amerigo Vespucci, who explored the region in the early 16th century and claimed it for Spain. The name “americium” was chosen for the new element because it was discovered and first produced in the United States, which is located in the Americas.

Vespucci did not name the continents; they were named for him. And there is also weird usage in the Latin word “America,” which was the name of the region of the world now known as the Americas; the Latin name did not exist until Europeans made their way there and named the region for Vespucci. Also, the naming of the element was more nuanced than simply naming it for the United States.

Or there is the case of political sense of lobby, which the AI correctly traces to the nineteenth century Britain and does not repeat the false story of the origin in Washington, DC hotel lobbies. But then it says lobby comes from the Old French word “lob,” which means “a place where one can lounge,” which is just wrong.

Or the linguistic term snowclone, which the AI correctly dates and credits to Geoffrey Pullum, but which also says:

The term is a play on the word “snowclone,” which refers to a type of fossilized snowflake that is preserved in ice or permafrost.

Now this sense of snowclone might very well be an obscure technical definition (although I’ve never heard of it), but it is definitely not the inspiration for the linguistic term.

But giving factually incorrect and long-disproven information is something that undoubtedly will be corrected as the AI improves. The AI does not search the internet; rather it relies on training data that has been fed to it by the developers, and garbage-in/garbage-out, as the saying goes. With better training data, the AI will make such mistakes less often. At least in writing aimed at a general audience. I’m more skeptical of its ability to produce acceptable writing aimed at experts in a field.

While I can easily see a future version of this AI producing solid, Wikipedia-style responses, its writing lacks style and affect. It’s bland and boring. I suspect this will be a harder problem to solve than just getting the facts right. Perhaps it will eventually be able to produce something that someone wants to read. At present, the AI seems to be programmed to avoid writing creative pieces. I asked it to write me a poem about Christmas and it spit out Clement Moore’s A Visit from St. Nicholas (without crediting Moore). Others, however, report that the AI does produce original, albeit bad, poetry.

Crediting sources will also be a tricky problem for the AI to solve. It’s one thing to give Hamilton the credit for Gaslight or Hoyle for big bang, but to actually follow a trail of scholarly works and properly credit the ideas within that discourse is quite another.

To be sure, the AI poses a problem for teachers of writing in that students will undoubtedly use it as a vehicle for plagiarism, but in its present incarnation it is not all that scary and does not presage the “end of the college essay” as some have predicted. The AI’s answers are very superficial, and a bit more care in phrasing essay prompts will go a long way toward defeating any attempts at plagiarism. The AI also gives the same response each time a question is posed, and the developers have produced a tool to help identify texts that have been produced by the AI (I have not tested this tool). And teachers can embrace it as an in-class tool to help students identify and edit sub-standard writing and superficial arguments. With a little bit of ingenuity on the part of composition instructors, the AI as it exists today and in the immediate future will be no more a threat to academic integrity than the myriad human-based essay mills already are.

All problems aside, it's an impressive achievement, and it will get better with time, but the doomsaying is overblown.

 Discuss this post


From Barflies to Bitcoin: How Words Map Our Modern Era

Cover of From Barflies to Bitcoin: How Words Map Our Modern Era

by David Wilton, cover and illustrations by Ania Wilton

Did you know that the word musher, meaning a person who travels the Arctic by dog sled, makes its English language debut in 1900, at the height of the Klondike gold rush? And in the 2010s people were engaging in a different kind of treasure hunt by swiping right and inviting someone over to Netflix and chill? The words that enter the language in any given year can be a key to the zeitgeist[*]of the era.

From Barflies to Bitcoin: How Words Map Our Modern Era[†] is a linguistic treasure hunt of sorts, examining each year of the twentieth century, plus the opening decades of the twenty-first, for the neologisms that make their appearance during them. In so doing, the book provides a chronological history of the period through the words that were coined during it. For example, here are some words that appear in various years:

1901: dognapper, jai alai, motorism, Ms, pizzeria, pointillism

1950: aqualung, brainwashing, LSD, microfiche, Orwellian, Scrabble

2001: Blu-ray, bromance, CRISPR, Homeland Security, iPod, sext

Word lovers will enjoy combing through it for odd terms or words that appear surprisingly early or late, while history buffs will see events unfold before them, and looking up the particular years that the reader has lived through will evoke feelings of nostalgia and bring to mind the fads and fashions of that earlier era.

The book is written for the general reader in an accessible style, organized into short, annual sections that make it appropriate reading for a commute. Yet it is meticulously researched, with a notes section and bibliography at the end will provide references for those readers who want to follow up on the research. The primary reference is the Oxford English Dictionary, and I used the online version of that dictionary to assemble the preliminary lists of words for each year. I then went on to other reference works, such as Green’s Dictionary of Slang, the “Among the New Words” column of American Speech, and other sources to see where antedatings are available and modified the lists accordingly. The selection of words to represent each year is subjective, based on how they represent historical and cultural trends, whether they appear surprisingly early or late, or whether they are simply odd and interesting in and of themselves.

From Barflies to Bitcoin is available from Amazon.com.

 

[*] English borrowed Zeitgeist, literally “spirit of the times,” from German in 1848, a year that was racked with revolutions across Europe and during which Karl Marx and Friedrich Engels published the Communist Manifesto when “a spectre [was] haunting Europe.”

[†] Barfly (1906) is a habitual patron of a drinking establishment. Bitcoin (2008) is a form of cryptocurrency.

To Oxford or Not to Oxford? That is the question.

13 February 2022

Meme claiming the sentence After beating the Steelers, Tim Tebow thanked his parents, God and Ms. Trunchbull is confusing

Meme claiming the sentence After beating the Steelers, Tim Tebow thanked his parents, God and Ms. Trunchbull is confusing

A bunch of comments about the use, or lack thereof, of the Oxford comma have been appearing in my Twitter feed as of late, so I thought I’d put my thoughts on the subject here, rather than responding to a multitude of threads.

First, let me state for the record that I prefer to use the Oxford comma. I use it in my writing.

That said, unless I’m working as a copy editor where the house style is to use it, I do not insist that others do so. And I never think that someone who opts not to use it is a lesser writer. That is because whether or not use the Oxford comma is a style choice. Ultimately, it makes no difference.

For those who are uninitiated into the sacred rites of punctuation, the Oxford, or serial, comma is the comma that appears before the final item in a list. Example: the phrase bacon, ham, and eggs uses the Oxford comma; the phrase bacon, ham and eggs does not. It is called the Oxford comma because the house style of Oxford University Press is to use it. Most of the publishers in Britain do not call for its use (which is why Oxford is singled out); while most of those in the United States do call for its use in the material they publish.

For those who insist on its use, as well as for those who insist it not be used, should take note of this last fact. The vast majority of people in Britain and United States get along just fine without a single, hard-and-fast rule. If it truly made a difference, there would be a unified rule.

That said, there are individual instances where the use or lack thereof of the Oxford comma can create confusion. Hardcore advocates love to point these out, but most of the examples they choose to highlight are not really confusing. The instances where genuine confusion can arise are rare.

Meme claiming the sentence I had eggs, toast and orange juice is confusing

Meme claiming the sentence I had eggs, toast and orange juice is confusing

One meme that has circulated claimed that the following sentence is confusing:

After beating the Steelers, Tim Tebow thanked his parents, God and Ms. Trunchbull.

The problem with this example is that absolutely no one would seriously entertain the idea that Tim Tebow was the offspring of the supreme deity and Ms. Trunchbull. There is nothing confusing about this sentence.

Another even more absurd example is a meme that claims the following sentence is confusing:

I had eggs, toast and orange juice.

The meme says that this sentence is a claim about eating eggs that is addressed to toast and orange juice. Again, the idea that a person would be speaking to breakfast food is silly. There is nothing confusing about this sentence.

A somewhat more astute example can be seen in a meme that claims that the following phrase is confusing:

the US president, a racist and a misogynist

Meme stating that the phrase the US president, a racist and misogynist refer is confusing

Meme stating that the phrase the US president, a racist and misogynist refer is confusing

If one is referring to Jimmy Carter or Barack Obama, there is no confusion. No serious person is going to label either of these men as racist or misogynist; the sentence refers to three people. But if the president in question is, say, Woodrow Wilson or Donald Trump, then a reader may legitimately be confused as to what is meant. But note, the confusion results whether or not the Oxford comma is used. Using an Oxford comma in no way guarantees clarity of expression.

The confusion results not because of the present or absence of a comma but because a comma is too weak to support the weight of the meaning. The sentence needs to be recast to take the semantic weight off the comma.

A more mundane, and probably more likely, example is the following pair of sentences:

Alice, Betty’s grandmother and Charles attended the wedding.

Alice, Betty’s grandmother, and Charles attended the wedding.

Both of these sentences can be confusing if one does not know the relationship between Alice and Betty. There is no confusion if one knows whether or not Alice is Betty’s grandmother. What is missing here is context, specifically who the intended reader is. If one is writing to someone who knows Alice and Betty, then the sentence is perfectly clear. If not, then the sentence or passage needs to be rewritten to make the relationship clear.

Go ahead and use the Oxford comma or not, whichever is your preference. Ideally, you should be consistent in your style choice, but consistency is no guarantee of clarity. You can’t expect your readers to remember whether or not your practice is to use the Oxford comma. Unless you’re writing the next Finnegans Wake, it’s your job as writer to make things clear for your readers.

And whichever style you chose, do not admonish others for making a different choice. It’s wrong, pedantic and annoying.

Discuss this post

American Dialect Society 2021 Word of the Year

9 January 2022

On Friday, the American Dialect Society selected insurrection as its Word of the Year for 2021, winning out over the combining form vax/vaxx in a runoff. The ADS has been selecting words of the year for 32 years, making it the longest running of these endeavors. The ADS is made up of linguists, lexicographers, editors, and others who study language either professionally or as serious amateurs. But despite the organization’s professional orientation, the WOTY is open to the public and is undertaken in a spirit of fun. The ADS does not claim that this is any kind of scholarly endeavor.

For the second year in a row, the WOTY selection was conducted online, and over three hundred people actively participated via Zoom. The day prior to the public meeting, a smaller, invited group of about fifty met online to nominate words in the various categories. Nominations “from the floor” were also accepted in the public meeting. I participated in both meetings. The full list of nominees and vote totals is in the ADS press release. You can also watch a recording of the proceedings on YouTube.

Other organizations take a more data-driven approach to their WOTY selections. For instance, both Merriam-Webster and Oxford University Press base their selection on the word that received the greatest spike in lookups on their online dictionaries, after filtering out the “evergreen” words that are looked up year after year. Both of these dictionaries selected variations on vaccine as their WOTY, vaccine for Merriam-Webster and vax for Oxford. My own take on WOTY 2021 is here.

I’m not going to give a comprehensive run-down of the ADS selections as I’ve done in past years. You can read the ADS press release for that. Instead, I’m offering some observations on the process and the choices.

When ADS kicked off the WOTY phenomenon over thirty years ago, it was intended as a fun exercise that might generate some publicity for the organization. As to the latter, it has succeeded beyond all expectations of that original group. It was never intended as a scholarly exercise, and that needs to be kept in mind. The ADS process is highly subjective, reflecting the tastes, biases, and agendas of the participants.

This year, I paid attention to the chat that was happening in Zoom alongside the audio/video feed. For the first time, I got a glimpse into what a large number of the participants were thinking. Very few were thinking as linguists, as professional scholars of language. That’s an observation, not a criticism. The process isn’t one of scholarly rigor, so why should people pretend that it is? Besides, it was clear that everyone was having a lot of fun. (The discussion of glizzy gobbling had me literally laughing out loud.)

For instance, one person said that while CRT/critical race theory was clearly an important term, they wouldn’t be voting for it as political term of the year because they didn’t want to highlight yet again the racist appropriation of the term. That’s an opinion I share and support, but it’s an opinion based on political rather than linguistic factors. CRT did not win, not even making it into the runoff. The winner in that category was insurrection in a runoff with Big Lie. Both of these are better choices on both political and linguistic grounds.

Another voiced the opinion that they were voting for variant because the experience with COVID-19 variants made it easier for them to explain the concept of linguistic variation to their students. I’m not sure how the imprimatur of the ADS would make the teaching any easier, but okay, that’s as good a reason as any.

Another surprise to me was the groundswell of support for Great Resignation and antiwork. While I was aware of the growing opposition to exploitative labor practices, I did not expect so much vocal support for these terms. I was unfamiliar with antiwork as a term (but not as a concept) before Friday, and it was nominated for WOTY “from the floor.” It probably shouldn’t have been a surprise though. ADS, like many organizations made up of university professors and grad students, skews more liberal than the population as a whole, plus many of those participating were younger than me.

But as an old fart, the biggest shock was the informal (in past years slang) category. I had never heard of any of the nominees. I guess I’m just cheugy.

Another new term to me was hard pants, but I fully support it. It’s simultaneously useful and silly. Ditto for copium, useful and playful.

There were no separate hashtag or emoji categories this year, another good change. I don’t object to hashtags or emojis. To the contrary, I like seeing them in contention as regular “words.” No emojis were nominated this year, but #FreeBritney won the digital word category, a worthy win.

I had my own biases and blind spots, and I’m sure every other participant had the same. I guess my point is that the WOTY should be taken as a bit of fun and a launching point for discussion of the past year and the vocabulary it generated and not as a serious pronouncement by experts from on high.

Discuss this post