This article is from the source 'guardian' and was first published or seen on . It last changed over 40 days ago and won't be checked again for changes.

You can find the current article at its original source at http://www.guardian.co.uk/news/datablog/2013/feb/04/calculating-crossword-answers

The article has changed 2 times. There is an RSS feed of changes available.

Version 0 Version 1
Calculating 'crosswordiness' of answers: how to do it and what it shows us Calculating 'crosswordiness' of answers: how to do it and what it shows us
(about 2 hours later)
Every budding crossword puzzler quickly learns that crosswords have their own strange vernacular. The need to interweave lots of words seamlessly puts vowels, short words, and unusual letter combinations at a premium, and puzzle constructors don't want to make things too easy by sticking to the familiar. The end result is that archaic, technical, and just plain exotic terms that would never come up in conversation routinely show up in crossword puzzles.Every budding crossword puzzler quickly learns that crosswords have their own strange vernacular. The need to interweave lots of words seamlessly puts vowels, short words, and unusual letter combinations at a premium, and puzzle constructors don't want to make things too easy by sticking to the familiar. The end result is that archaic, technical, and just plain exotic terms that would never come up in conversation routinely show up in crossword puzzles.
Every serious devotee of American-style puzzles is an expert in certain Finnish architects (EERO), Great Lakes (ERIE), World War II battlegrounds (STLO), church altars (APSE), butter substitutes (OLEO), sons of Isaac (ESAU), and a whole lot more.Every serious devotee of American-style puzzles is an expert in certain Finnish architects (EERO), Great Lakes (ERIE), World War II battlegrounds (STLO), church altars (APSE), butter substitutes (OLEO), sons of Isaac (ESAU), and a whole lot more.
When you start to see an unfamiliar word pop up repeatedly in your daily crossword, it's hard to know which ones are genuinely obscure and which ones are just new to you; everyone has their own idiolect. Fortunately, we can apply some data to that question, now that Michael Donohoe of Quartz has published a set of New York Times crossword clues and answers spanning all puzzles from 1996-2012.When you start to see an unfamiliar word pop up repeatedly in your daily crossword, it's hard to know which ones are genuinely obscure and which ones are just new to you; everyone has their own idiolect. Fortunately, we can apply some data to that question, now that Michael Donohoe of Quartz has published a set of New York Times crossword clues and answers spanning all puzzles from 1996-2012.
To figure out which words really are the most peculiar to crosswords, we need to look at two things: how often it shows up as a crossword answer and how often it shows up in other usage.To figure out which words really are the most peculiar to crosswords, we need to look at two things: how often it shows up as a crossword answer and how often it shows up in other usage.
'Other usage' could be defined in lots of ways, but one of the most comprehensive and accessible measures is a Google Book N-Gram, which gives the percentage of all words in books scanned by Google (over 20 million books to date) a given word represents. For example, about 2.3% of all words in books since 1996 are the word AND, whereas only about 0.00001472% of them are the word AERIE (eagle nests: another crossword favorite). Calculating the ratio between an answer's crossword frequency and its n-gram thus gives us a rough idea of a word's 'crosswordiness', or how disproportionately often it's used in crossword puzzles:'Other usage' could be defined in lots of ways, but one of the most comprehensive and accessible measures is a Google Book N-Gram, which gives the percentage of all words in books scanned by Google (over 20 million books to date) a given word represents. For example, about 2.3% of all words in books since 1996 are the word AND, whereas only about 0.00001472% of them are the word AERIE (eagle nests: another crossword favorite). Calculating the ratio between an answer's crossword frequency and its n-gram thus gives us a rough idea of a word's 'crosswordiness', or how disproportionately often it's used in crossword puzzles:
By contrast, the most common crossword answer, ERA, shows up as an answer 323 times, or about once in every 18 puzzles, but it doesn't even crack the top 500 for crosswordiness because it's relatively common in other usage. To see the top ten crosswordiest answers, see the table below. For example, the most common crossword answer, ERA, shows up as an answer 323 times, or about once in every 18 puzzles, but it doesn't even crack the top 500 for crosswordiness because it's relatively common in other usage. To see the top ten crosswordiest answers, see the table below.
Note that this analysis is limited to recognized English dictionary words, because Google Book N-Grams are case sensitive and there's no consistent way to parse unrecognized words. The seemingly simple question, "what is a word?" can be quite complicated for even standard language analysis, and crossword puzzles are a particularly thorny dataset, full of abbreviations, slang, and devilish wordplay (Is it a person's name? A foreign phrase? A portmanteau? Sometimes all of the above!).Note that this analysis is limited to recognized English dictionary words, because Google Book N-Grams are case sensitive and there's no consistent way to parse unrecognized words. The seemingly simple question, "what is a word?" can be quite complicated for even standard language analysis, and crossword puzzles are a particularly thorny dataset, full of abbreviations, slang, and devilish wordplay (Is it a person's name? A foreign phrase? A portmanteau? Sometimes all of the above!).
The table below show the top ten crosswordiest answers and there is also a table showing the most common crossword answers with the number of appearances for each word. The tables below show the top ten 'crosswordiest' answers, as well as
/>the ten most common answers.
For more details, including analysis of links between answers and clue keywords, visit: http://noahveltman.com/crossword/ For more details, including analysis of links between answers and clue keywords, visit Noah Veltman's website.
Noah Veltman is a web developer and 2013 Knight-Mozilla OpenNews Fellow currently working with the BBC in London.Noah Veltman is a web developer and 2013 Knight-Mozilla OpenNews Fellow currently working with the BBC in London.
Data summaryData summary
Crosswordiest words (1996-2012, minimum 50 appearances)Crosswordiest words (1996-2012, minimum 50 appearances)
Click on the heading to sortClick on the heading to sort
Source: Noah VeltmanSource: Noah Veltman
Most common answers (1996-2012)Most common answers (1996-2012)
Click on the heading to sortClick on the heading to sort
Source: Noah VeltmanSource: Noah Veltman
NEW! Buy our bookNEW! Buy our book
• Facts are Sacred: the power of data (on Kindle)• Facts are Sacred: the power of data (on Kindle)
More open dataMore open data
Data journalism and data visualisations from the GuardianData journalism and data visualisations from the Guardian
World government dataWorld government data
• Search the world's government data with our gateway• Search the world's government data with our gateway
Development and aid dataDevelopment and aid data
• Search the world's global development data with our gateway• Search the world's global development data with our gateway
Can you do something with this data?Can you do something with this data?
Flickr Please post your visualisations and mash-ups on our Flickr group
• Contact us at data@guardian.co.uk
Flickr Please post your visualisations and mash-ups on our Flickr group
• Contact us at data@guardian.co.uk
• Get the A-Z of data
• More at the Datastore directory

• Follow us on Twitter
• Like us on Facebook
• Get the A-Z of data
• More at the Datastore directory

• Follow us on Twitter
• Like us on Facebook