Finding Lena Headley

January 2nd, 2009

Chris Wilson thinks Microsoft need to fix their spelling checker:

On April 30, 2007, with all the usual fanfare that accompanies a software update, Microsoft added Barack and Obama to Office's dictionary. […]

Of course, by April 2007 Obama was already a figure of some renown. He'd announced his bid for the Democratic nomination in mid-January and had been an object of intense fascination since his July 2004 speech at the Democratic Convention. But escaping the shackles of Microsoft Word's red corrugated line is no small feat, and the list of those who've made the cut can seem arbitrary: Why does it recognize the surnames of Matthew Broderick and Susan Sarandon but trip over DiCaprio and Blanchett? They've heard of Friendster, but not Facebook? Does Microsoft really want to start something with Mark Wahlberg? (Or, speaking of Entourage, with Jeremy Piven?)

There's no reason why spell-check dictionaries need to be so behind the times. All the technology to build a relevant, timely spelling database already exists in search engines like Google and Microsoft's own Live Search, which have a vast vocabulary of words and names and update their dictionaries in near real time. Microsoft Word may not have heard of Marky Mark, but a Live Search or a Google query for Mark Walberg includes results for the actor, who has an "h" in his last name. […]

For proper names, is this really a big problem? No off-line dictionary will ever hold details of every name you might want to type, but all you do the first time the spelling checker indicates that it doesn't know that name is add it to the dictionary. Problem solved.

My big problem with the idea that spelling checker dictionaries should adopt the approach taken by search engines is that the two apparently similar pieces of software are actually doing very different things. When a search engine attempts to suggest a word or phrase it's trying to suggest alternative ways to find the content you want, not to tell you that this is the correct way to spell a name or phrase or whatever. It's saying, in essence, look at this other bunch of content that might also relate to whatever it is you're looking for.

For example, after watching an episode of Terminator: The Sarah Connor Chronicles I might want to search for information on an actress by the name of "Lena Headley". Google's response shows a mix of results for 'Lena Headey' (which is the correct spelling of her surname) and 'Lena Headley'.1

All the results on the first page of results show links to content relating to the actress I was looking for, because clearly her surname is often misspelled: I'd imagine that Google's algorithm is, in effect, noticing that despite some of the content using a different spelling of the surname all the pages it finds tend to mentions of the same films and TV shows and character names and co-stars. That's pretty much what I'm doing when I consider whether a given page is useful; the fact that someone who puts up a page talking about Lena Headey gets her surname wrong might well cause me to be mildly sceptical about whether their content includes other factual errors, but making such a common spelling mistake wouldn't of itself cause me to reject their page as a useful source out of hand; the result of Google's fuzzy search is therefore still useful to me.

However, if I was writing about Lena Headey and her name was flagged by my spelling checker2 I'd expect it to come up with the correct spelling: to my mind, the function of a spelling checker's suggestion is to provide the right answer, not to point me to an answer lots of other people use that still might be useful. I'll settle for a spelling checker that admits it doesn't know the answer, instead of one that uses algorithms intended to solve a different problem that will sometimes give me the wrong answer.3

  1. It's worth noting that, unlike some Google searches where a misspelling is involved, this one doesn't bring up a "Did you mean 'Lena Headey'?" link at the top of the results. Google informs me that 'Lena Headley' returns 27,300 results, whereas 'Lena Headey' returns 'about 993,000 results'. I'm not sure how big the imbalance between spellings has to be before Google decides to offer a suggestion that I try a different spelling, but I'd have hoped that a 36:1 ratio might have been a sign that something was amiss with the spelling I tried.
  2. As it is being as I type this, since I haven't had reason to write about her before so I've never taught the OS X spelling checker her name.
  3. At this point I fully expect that Sod's Law will kick in and someone will post a comment pointing out some elementary spelling mistake I've made in this post. Bring it on…

This entry was posted on Friday, January 2nd, 2009 at 00:28. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.