Wednesday 7 January 2015

Frequency freak

Way back at the start of the year, I mentioned Zipf's law for the word frequencies in normal written English.  I don't think, despite the primacy of E and the vanishing rareness of JXQ&Z, that it works at the level of letters, which seem quite a bit flatter in distribution:
I've snagged the numbers from Wikipedia's estimate of letter frequencies, which puts C higher than U and so nullifies the famous etaoinshrdlu that compositors of linotype machines used to lash out when they wanted to run a test.  These commonest letters were down the left-hand side of the machine, so that operatives could type more words faster than if the letters were arranged either alphabetically of in the deliberately inefficient qwerty form familiar on English keyboards.  And don't get up on your grands chevaux French-people, your aserty keyboard is clearly a knock-off, and no more efficient.  Wikipedia has borrowed from Pavel Mička's website where the frequencies are given to 5 significant figures but without source.

That sets the scene. A year or two ago, I bought Books 1-4 of the Times2 Jumbo Crosswords when they were on offer at Alidli.  These are General Knowledge puzzles that The Times (of London) puts out on bank-holiday weekends and they make a rather good fit with my expensive education.  So much so that I'm sure, if I didn't go to school with the composer, he (surely a He!) was born +/-5 years and <100km from me, because I get almost all the references!  The puzzles make a way of passing the time which is marginally more stimulating than staring vacantly at the ceiling. When Dau.II comes home she regards doing crosswords with the parents as mildly entertaining and she's not going to stare at our  ceiling when she has a perfectly good one to stare at in Cork.  As a way of ringing the changes a few weeks ago, I started filling in only the letters that appeared at the intersections in the crossword rather than writing the whole word. This seemed to annoy other people more than it warranted, but I "completed" 4 or 5 of the puzzles in this ink-saving way.

At Christmas, The Beloved was given Book 8 in the Times2 series and I got Book 9 from an ardent puzzle solver whom I shall here call McCantab. The puzzles in these more recent books are not markedly more up to the present than the others we had, so we have been motoring through some of the new ones. Then I noticed a strange thing: it seemed to me that the frequency of vowels was rather lower in my intersection-only solutions than in a) the completed puzzles or b) the baseline letter-frequency in English.  I imagined there might be an unconscious bias in word choice under the constraint that these letters had to fit two words. I had a null hypothesis: that the letter-frequencies would be the same at intersections as in the whole universe of words - and I meant to test it scientifically.

Accordingly I tallied up the letter-count for 650+ letters from the intersections of 4 partially completed puzzles, calculated the % and plotted that against a cryptography dataset from Cornell based on 40,000 words of English prose; and here are the results:
This Cornell dataset is really close to the Wikipedia numbers but reverses C and U, to give standard shrdlu rather than disparate shrdlc. I think I can shelve my hypothesis about intersectional vowel-scarcity [IVS] as the frequencies seem to track each other rather well and the scatter is probably due to the small sample that I had patience to count up. FH&T are less and CIA&S more common at intersections but not in a way that reaches statistical significance. A negative result is still a result and I'm publishing it here lest some poor girl from Toulouse or Montpellier has a similar idea and wastes time counting letters.  I'm glad I've laid it to rest, in the same way as I calculated the odds of getting Klondike patience/solitaire "out" by playing the game 100x and counting.



No comments:

Post a Comment