A Deep Analysis of Over 80 Thousand English Words And Their Pronunciations

Written by Bong Hyun-Shik

In this article I have discussed a very complex yet intriguing topic, understanding and analyzing the commonly used English words and their pronunciations, I have tried to use as many visualizations as possible. So let's get started!

We all know that the word-segment ‘ough’ can be pronounced in many different ways.

  • Cough

  • Dough

  • Plough

  • Thought

  • Through

  • + more (less common) ones

This is a commonly-cited example of English spelling being bad. Learning to pronounce the same set of letters differently requires lots of experience, and is frustrating when learning new words.

Another example of this is the ‘correct’ way to spell potato.

Although this pushes it to the extreme, it’s still a valid example of this phenomenon. And it gets you thinking.

It’d be way better if every sound we could make had only one way of spelling, wouldn’t it?

Enter the International Phonetic Alphabet.

This alphabet attempts to standardize “representation of speech sounds in written form”, which sounds great! Finally, no ambiguity!

The trouble is, the amount of distinct sounds that humans can make is immense.

Is it really worth having unique letters for the 'th' sounds in then and thin? In fact, there’s a continuous spectrum between those sounds; where does one start and the other begin?

The IPA Pulmonic Consonant Chart charts all possible consonant sounds a human can make, by identifying the manner and place of the sound | Screenshot from Wikipedia

Diacritics (little notations around each letter) attempt to represent even more subtle differences in pronunciation. The result of this is well over a hundred different letters for representing sounds.

In Summary

English has too few letters for the number of word sounds used, which forces letters to stand for multiple sounds, making learning the language difficult.

The International Phonetic Alphabet has many more letters to represent word sounds, resulting in incredibly bloated written text (which to be fair isn’t why it exists).

Visualizing the Problem

A section of the International Speech Lexicon dictionary (ISLE) | Screenshot by Author

As part of the development of Bad Spelling Generator, I came up with a way of breaking down spellings and pronunciations into syllables— providing a consistent link between written text and speech.

For example, persuade has pronunciation of p ə ɹ . s w ei d, so I break the word into per and suade.

The data source here is the International Speech Lexicon dictionary, taken from the Python package Pysle. It contains hundreds of thousands of words and their pronunciations.

As a result of the single datasource, the words heavily favour American pronunciations. Be aware of this bias during the data analysis.

I ran some Python scripts to create databases of information about syllable pronunciation and spelling from over 80,000 English words. Let’s explore the data together and see what we find.

Most Variably-Pronounced Syllables

In the same way that ough has many pronunciations, let’s see what group of letters has the most pronunciations.

How many ways of pronouncing a syllable are there? | Image by Author

17 unique pronunciations for the syllables su and cha!? And I thought ough was bad!

Looking at the data for su, we can get a sense of what this actually means. This is the print-out of each pronunciation and an example of a word it’s found in:

Pronunciations for the letters 'su':
0 	| ʃə 	| censurable
1 	|| capsular
2 	|| resurrect
3 	| ʃu 	| censual
4 	| ʒə 	| clausula
5 	| su 	| basuto
6 	| ʃʊ 	| assurance
7 	| ʒu 	| audiovisual
8 	| zu 	| jesu
9 	|| bloodsucker
10 	| ʒʊ 	| caesura
11 	| sjʊ 	| consular
12 	| zjə 	| chasuble
13 	| sju 	| disunite
14 	|| esurient
15 	|| usufruct
16 	| zju 	| unpresuming

I was amazed at the variety. It’s not something you think about in daily life as an English-native speaker, but almost all of those words you’d be able to correctly pronounce from experience.

Admittedly some are very similar, but try pronouncing su in each of the words the same as bloodsucker and you’ll realize what a big difference a subtle change makes.

Most Variably-Spelt Syllables

Flipping the previous chart on its head, let’s have a look at what pronunciations have many spellings.

How many ways of spelling each sounds are there? | Image by Author

The pronunciation si (pronounced see) has over 40 different spellings? That seems unreasonable. Let’s have a look at the data.

Here’s the list of the syllable spellings and an example word in which it can be found.

Spellings for pronunciation 'si':
0 	| c 	| c
1 	| cae 	| caecilian
2 	| ce 	| abecedarian
3 	| cea 	| ceases
4 	| cee 	| divorcee
5 	| cei 	| ceiling
6 	| ceip 	| receiptor
7 	| cey 	| pricey
8 	| ci 	| acierate
9 	| cie 	| facie
10 	| cis 	| precis
11 	| coe 	| biocoenosis
12 	| cy 	| abbacy
13 	| pse 	| psephological
14 	| sai 	| saiva
15 	| scae 	| muscae
16 	| sce 	| ascesis
17 	| scei 	| transceiver
18 	| sci 	| biosciences
19 	| scie 	| bioscience
20 	| se 	| antiserum
21 	| sea 	| battersea
22 	| see 	| endorsee
23 	| sei 	| caseinogen
24 	| seig 	| seigneur
25 	| seu 	| transeunt
26 	| sey 	| anglesey
27 	| si 	| albigensian
28 	| sie 	| besieging
29 	| sig 	| monsignor
30 	| ssae 	| fossae
31 	| sse 	| colosseum
32 	| ssee 	| addressee
33 	| ssey 	| odyssey
34 	| ssi 	| aglossia
35 	| ssie 	| aussie
36 	| ssis 	| chassis
37 	| ssy 	| bessy
38 	| sy 	| apostasy
39 	| ze 	| yangtze
40 	| zi 	| ritziest
41 	| zy 	| chintzy

Apart from 18 and 19 (which are a bit suspicious), I’d say they’re all valid.

Would you say having 40 different ways to write a sound is reasonable? English seems to think so.

Most Common Syllables

As a bit of an aside, I was interested in seeing what syllable sounds appear most often in the word list I was using.

The trouble with simply counting the number of words in which each syllable is used, is that it doesn’t give a fair representation of the most ‘common’ sounds. This is because we have no information about the relative frequencies with which each word is used in day-to-day life.

With this in mind, have a look at the graph below.

What are the top 20 most common syllable sounds? | Image by Author

ə, as in abandon, is in most words, followed closely by ri (ree) and li (lee). After the first three, there’s a big drop off.

Let’s zoom out to see the top 200 syllables.

What are the top 200 most common syllables? | Image by Author

By around the 100th most common syllable, there are less than 500 words in which that syllable appears (out of 80,000+), which seems to suggest that the vast majority of words use the same few sounds.

This reminded me of the Pareto Principle, otherwise known as the 80/20 rule, where 20% of the causes make up 80% of the effects. Here, the causes are the sounds and the effects are the words in which they’re used.

The following is the ‘Pareto plot’ for this data.

What weight do the most common syllables have in language, compared to the less common ones? | Image by Author

In this case, the effect is even more skewed. About 20% of the syllable sounds make up 90% of all the sounds we make.

The Network Of Speech

The next question I wanted to be answered was about how these syllables link together. How likely would it be for each syllable to appear in a word, given another syllable is known to be in it?

For example, if I knew a word began with the sound si, what sounds are the most likely to come next? The resulting network graph is one of my favorite visualizations I’ve produced.

I calculated how often each of the top 20 most common syllable sounds appears with every other. The node size represents the overall number of words the syllable appears in.

What syllables are frequently found together? | Image by Author

The strong links between ə and ɹi suggest that they’re commonly found in words together, and the very thin links between di and tI show those syllables are rarely found in words together.

Looking generally, it seems that the more similar two syllables are, the smaller the link is between them.

The Absurdity of Vowels

Finally, let’s remember how we started this article.

By comparing syllable sounds and spelling, we were able to identify spellings with massively varying pronunciations, as well as sounds that can come from a huge variety of spellings.

We’ve been focussing on syllables, but it’s vowels that are some of the worst offenders in this world of inconsistent pronunciation.

Let’s look at all the different sounds each vowel can make. In the heatmap below, a dark color represents a large number of words with that letter/sound combination.

What sounds come from the five vowels? | Image by Author

What we see is that there are certain sounds that can be made by most, if not all, vowels. The sounds i, ʌ, ɑ, ə, I can all be made by 3 or more vowels!

And this isn’t even taking into account vowel combinations (like ai), and different accents and dialects, which would make vowel sounds even more confusing!

What I’m trying to say is that the vowel used in a word doesn’t unambiguously identify the sound it represents, and that could be frustrating for anyone trying to learn the language.

Key Takeaways

If this article were to be summarised in a sentence it would be: Written language provides very little intuitive understanding of spoken language.

Sure, it can point you in the right direction, but only through years of experience. Even then the result is ambiguous.

If you want to see this phenomenon in action, check out Bad Spelling Generator, which re-spells words by swapping out equivalently-pronounced syllables from other words. The results can be absurd. We can use this analysis further for training Machine Learning and Natural Language Processing models, specifically for the sentimental analysis of the text.

If you liked this article, do share and show your love! (This took a while to write :D)