Written by Bong Hyun-Shik
In this article I have discussed a very complex yet intriguing topic, understanding and analyzing the commonly used English words and their pronunciations, I have tried to use as many visualizations as possible. So let's get started!
+ more (less common) ones
This is a commonly-cited example of English spelling being bad. Learning to pronounce the same set of letters differently requires lots of experience, and is frustrating when learning new words.
Another example of this is the ‘correct’ way to spell potato.
Although this pushes it to the extreme, it’s still a valid example of this phenomenon. And it gets you thinking.
It’d be way better if every sound we could make had only one way of spelling, wouldn’t it?
Enter the International Phonetic Alphabet.
This alphabet attempts to standardize “representation of speech sounds in written form”, which sounds great! Finally, no ambiguity!
The trouble is, the amount of distinct sounds that humans can make is immense.
Is it really worth having unique letters for the 'th' sounds in then and thin? In fact, there’s a continuous spectrum between those sounds; where does one start and the other begin?
The IPA Pulmonic Consonant Chart charts all possible consonant sounds a human can make, by identifying the manner and place of the sound | Screenshot from Wikipedia
Diacritics (little notations around each letter) attempt to represent even more subtle differences in pronunciation. The result of this is well over a hundred different letters for representing sounds.
English has too few letters for the number of word sounds used, which forces letters to stand for multiple sounds, making learning the language difficult.
The International Phonetic Alphabet has many more letters to represent word sounds, resulting in incredibly bloated written text (which to be fair isn’t why it exists).
Visualizing the Problem
A section of the International Speech Lexicon dictionary (ISLE) | Screenshot by Author
As part of the development of Bad Spelling Generator, I came up with a way of breaking down spellings and pronunciations into syllables— providing a consistent link between written text and speech.
For example, persuade has pronunciation of p ə ɹ . s w ei d, so I break the word into per and suade.
The data source here is the International Speech Lexicon dictionary, taken from the Python package Pysle. It contains hundreds of thousands of words and their pronunciations.
As a result of the single datasource, the words heavily favour American pronunciations. Be aware of this bias during the data analysis.
I ran some Python scripts to create databases of information about syllable pronunciation and spelling from over 80,000 English words. Let’s explore the data together and see what we find.
Most Variably-Pronounced Syllables
In the same way that ough has many pronunciations, let’s see what group of letters has the most pronunciations.
How many ways of pronouncing a syllable are there? | Image by Author
17 unique pronunciations for the syllables su and cha!? And I thought ough was bad!
Looking at the data for su, we can get a sense of what this actually means. This is the print-out of each pronunciation and an example of a word it’s found in:
Pronunciations for the letters 'su': 0 | ʃə | censurable 1 | sə | capsular 2 | zə | resurrect 3 | ʃu | censual 4 | ʒə | clausula 5 | su | basuto 6 | ʃʊ | assurance 7 | ʒu | audiovisual 8 | zu | jesu 9 | sʌ | bloodsucker 10 | ʒʊ | caesura 11 | sjʊ | consular 12 | zjə | chasuble 13 | sju | disunite 14 | sʊ | esurient 15 | zʊ | usufruct 16 | zju | unpresuming
I was amazed at the variety. It’s not something you think about in daily life as an English-native speaker, but almost all of those words you’d be able to correctly pronounce from experience.
Admittedly some are very similar, but try pronouncing su in each of the words the same as bloodsucker and you’ll realize what a big difference a subtle change makes.
Most Variably-Spelt Syllables
Flipping the previous chart on its head, let’s have a look at what pronunciations have many spellings.
How many ways of spelling each sounds are there? | Image by Author
The pronunciation si (pronounced see) has over 40 different spellings? That seems unreasonable. Let’s have a look at the data.
Here’s the list of the syllable spellings and an example word in which it can be found.
Spellings for pronunciation 'si': 0 | c | c 1 | cae | caecilian 2 | ce | abecedarian 3 | cea | ceases 4 | cee | divorcee 5 | cei | ceiling 6 | ceip | receiptor 7 | cey | pricey 8 | ci | acierate 9 | cie | facie 10 | cis | precis 11 | coe | biocoenosis 12 | cy | abbacy 13 | pse | psephological 14 | sai | saiva 15 | scae | muscae 16 | sce | ascesis 17 | scei | transceiver 18 | sci | biosciences 19 | scie | bioscience 20 | se | antiserum 21 | sea | battersea 22 | see | endorsee 23 | sei | caseinogen 24 | seig | seigneur 25 | seu | transeunt 26 | sey | anglesey 27 | si | albigensian 28 | sie | besieging 29 | sig | monsignor 30 | ssae | fossae 31 | sse | colosseum 32 | ssee | addressee 33 | ssey | odyssey 34 | ssi | aglossia 35 | ssie | aussie 36 | ssis | chassis 37 | ssy | bessy 38 | sy | apostasy 39 | ze | yangtze 40 | zi | ritziest 41 | zy | chintzy
Apart from 18 and 19 (which are a bit suspicious), I’d say they’re all valid.
Would you say having 40 different ways to write a sound is reasonable? English seems to think so.
Most Common Syllables
As a bit of an aside, I was interested in seeing what syllable sounds appear most often in the word list I was using.
The trouble with simply counting the number of words in which each syllable is used, is that it doesn’t give a fair representation of the most ‘common’ sounds. This is because we have no information about the relative frequencies with which each word is used in day-to-day life.
With this in mind, have a look at the graph below.
What are the top 20 most common syllable sounds? | Image by Author
ə, as in abandon, is in most words, followed closely by ri (ree) and li (lee). After the first three, there’s a big drop off.
Let’s zoom out to see the top 200 syllables.
What are the top 200 most common syllables? | Image by Author
By around the 100th most common syllable, there are less than 500 words in which that syllable appears (out of 80,000+), which seems to suggest that the vast majority of words use the same few sounds.
This reminded me of the Pareto Principle, otherwise known as the 80/20 rule, where 20% of the causes make up 80% of the effects. Here, the causes are the sounds and the effects are the words in which they’re used.
The following is the ‘Pareto plot’ for this data.
What weight do the most common syllables have in language, compared to the less common ones? | Image by Author
In this case, the effect is even more skewed. About 20% of the syllable sounds make up 90% of all the sounds we make.
The Network Of Speech
The next question I wanted to be answered was about how these syllables link together. How likely would it be for each syllable to appear in a word, given another syllable is known to be in it?
For example, if I knew a word began with the sound si, what sounds are the most likely to come next? The resulting network graph is one of my favorite visualizations I’ve produced.
I calculated how often each of the top 20 most common syllable sounds appears with every other. The node size represents the overall number of words the syllable appears in.
What syllables are frequently found together? | Image by Author
The strong links between ə and ɹi suggest that they’re commonly found in words together, and the very thin links between di and tI show those syllables are rarely found in words together.
Looking generally, it seems that the more similar two syllables are, the smaller the link is between them.
The Absurdity of Vowels
Finally, let’s remember how we started this article.
By comparing syllable sounds and spelling, we were able to identify spellings with massively varying pronunciations, as well as sounds that can come from a huge variety of spellings.
We’ve been focussing on syllables, but it’s vowels that are some of the worst offenders in this world of inconsistent pronunciation.
Let’s look at all the different sounds each vowel can make. In the heatmap below, a dark color represents a large number of words with that letter/sound combination.
What sounds come from the five vowels? | Image by Author
What we see is that there are certain sounds that can be made by most, if not all, vowels. The sounds i, ʌ, ɑ, ə, I can all be made by 3 or more vowels!
And this isn’t even taking into account vowel combinations (like ai), and different accents and dialects, which would make vowel sounds even more confusing!
What I’m trying to say is that the vowel used in a word doesn’t unambiguously identify the sound it represents, and that could be frustrating for anyone trying to learn the language.
If this article were to be summarised in a sentence it would be: Written language provides very little intuitive understanding of spoken language.
Sure, it can point you in the right direction, but only through years of experience. Even then the result is ambiguous.
If you want to see this phenomenon in action, check out Bad Spelling Generator, which re-spells words by swapping out equivalently-pronounced syllables from other words. The results can be absurd. We can use this analysis further for training Machine Learning and Natural Language Processing models, specifically for the sentimental analysis of the text.
If you liked this article, do share and show your love! (This took a while to write :D)