Information Theory and Writing


I was thinking about what we mean by “wordiness.” We don’t mean having “too many” words. Then we would just say “long.” We mean having words that don’t do much.

High-entropy writing

In 1948, Claude Shannon published “A Mathematical Theory of Communication,” an essay (or very short book) that’s surprisingly quick and easy to read for something with such profound mathematical content. It’s one of the three cornerstones of science, along with Euclid’s Elements and Newton’s Principia. It provided equations to measure how much information words convey. Let me repeat that, shouting this time, because the implications surely didn’t sink in the first time: It provided EQUATIONS to measure HOW MUCH INFORMATION WORDS CONVEY.

These measurements turn out to be isomorphic (that’s a big word, but it has a precise meaning that is precisely what I mean) to the concept of thermodynamic entropy. The exact method Shannon used to measure information per letter in English is crude, but it’s probably usually within 20% of the correct answer. The important point is that, for a given text and a given reader, there is a correct answer.

The implications of being able to measure information are hard to take in without thinking about it for a few decades [1]. For writers, one implication is that the question “Is this story wordy?” has an answer. I could write a simple program that would analyze a story and say how wordy it was.

The caveat is simple, subtle, and enormous: A given text conveys a well-defined amount of information to a given reader, assuming infinite computational resources [2]. Without infinite computational resources, it depends on the algorithms you use to predict what’s coming next, and there are probably an infinite number of possible algorithms. I could easily compute the information content of a story by predicting the next word of each sentence based on the previous two words. This would warn a writer if their style were cliched or vague. But it would miss all the information provided by genre expectations, our understanding of story structure and theme, psychology, and many other things critical in a story.

But you can be aware of the information content of your story without writing that program or understanding how to measure entropy. One simple way is to be aware of the information content of the words you use. Writers say to use precise words and avoid vague ones. Maybe better advice is, use high-entropy words. A high-entropy word is one that can’t be easily predicted from what came before it. The word “fiddle” is usually unexpected, but is expected if you just said “fit as a”.

High-entropy writing can simply mean putting things together that don’t usually go together:

The ships hung in the sky in much the same way that bricks don’t.

— Douglas Adams, The Hitchhiker’s Guide to the Galaxy

An AMERICAN wearing a jungle hat with a large Peace Sign on it, wearing war paint, bends TOWARD US, reaching down TOWARD US with a large knife, preparing to scalp the dead.

— From a 1975 draft of the screenplay for Apocalypse Now by John Milius and Francis Ford Coppola

When you use a word that’s true and unexpected, it’s poetry. When you tell a story that’s true and unexpected, it’s literature [3]. So aim for the unexpected plot and the unexpected word.

Meaning-dense writing

This is taken a bit too far in modernist poetry, which has very high entropy:

               dead every enourmous [sic] piece

of nonsense which itself must call

a state submicroscopic is-

compared with pitying terrible

some alive individual

— E.E. Cummings, dead every enourmous piece

The problem with measuring information content is that you would produce the most-unpredictable sequence of words by choosing words at random. Meaningless text has maximum information density.

What you want to measure is true, or, better, meaningful, information [4]. Writers often use words and tell stories that are technically low-entropy (the words aren’t unexpected). But whenever they do, if it’s done well, it’s because they convey a lot of extra, meaningful information that isn’t measured by entropy.

To convey a mood or a metaphor, you choose a host of words (and maybe even punctuation) associated with that mood. That makes that cluster of words appear to be low-entropy: They all go together, and seeing one makes you expect the others.

The sky above the port was the color of television, turned to a dead channel.

— William Gibson, Neuromancer

All the world’s a stage, and all the men and women merely players;

They have their exits and their entrances;

And one man in his time plays many parts, His acts being seven ages.

— William Shakespeare, As You Like It

In a metaphor or a mood, the words convey more information than you see at first glance. That someone would compare the sky to a television channel, and that the world’s channel is dead, tell you a lot about Gibson’s world. That men and women are “merely players” conveys a philosophy. An extended metaphor doesn’t just tell you the information in its sentences. It points out which parts of the two things being compared are like each other, in a way that lets you figure out the different similarities from just a few words. That is extra meaning that isn’t measured by entropy (but would be by Kolmogorov complexity). It may be low-entropy, but it’s meaning-dense.

Rhyme greatly decreases the entropy of the rhyming words. Knowing that you need to say something about a frog that rhymes with frog reduces the number of possible final words for this poem to a handful. Yet it’s still surprising—not which word Dickinson picked, but all the things it meant when she suddenly compared public society to a …

How dreary—to be—Somebody!

How public—like a Frog—

To tell one’s name—the livelong June—

To an admiring Bog!

— Emily Dickinson, I’m Nobody! Who are You?

Sometimes you use repetition to connect parts of a story:

        ‘Twas the day before Christmas, and a nameless horror had taken residence in John’s chimney. Again.

        ‘Twas the day before Christmas, and a nameless horror had taken residence in Jack’s chimney.

… or to focus the reader’s attention on the theme:

                    “It’s just that I’ve plans for Christmas and—”

… “Don’t you worry about me. I’ve plans for this Christmas.”

… “Indeed, Your Excellency. I’ve plans for Christmas.”

… “Yes. I am. Now go. I’ll keep. Don’t you worry. I’ve plans for Christmas.”

… He had plans this Christmas.

… or to make a contrast:

              Smash down the cities.

Knock the walls to pieces.

Break the factories and cathedrals, warehouses and homes

Into loose piles of stone and lumber and black burnt wood:

You are the soldiers and we command you.

Build up the cities.

Set up the walls again.

Put together once more the factories and cathedrals, warehouses and homes

Into buildings for life and labor:

You are workmen and citizens all: We command you.

— Carl Sandburg, And They Obey

That’s okay. The repetition is deliberate and is itself telling you something more than the sum of what the repeated parts would say by themselves.

Predictable words are no better than vague words

Some words have lots of meaning, yet convey little information because we’re always expecting someone to say them.

What words do I mean? I refer you to (Samsonovic & Ascoli 2010). These gentlemen used energy-minimization (one use of thermodynamics and information theory) to find the first three principal dimensions of human language. They threw words into a ten-dimensional space, then pushed them around in a way that put similar words close together [5]. Then they contrasted the words at the different ends of each dimension, to figure out what each dimension meant.

They found, in English, French, German, and Spanish, that the first three dimensions are valence (good/bad), arousal (calm/excited), and freedom (open/closed). That means there are a whole lot of words with connotations along those dimensions, and owing to their commonality, they seldom surprise us. Read an emotional, badly-written text—a bad romance novel or a political tract will do—and you’ll find a lot of words that mostly tell you that something is good or bad, exciting or boring, and freeing or constrictive. Words like “wonderful”, “exciting”, “loving”, “courageous”, “care-free”, or “boring”. Read a badly-written polemical or philosophy paper, and you’ll find related words: “commendable”, “insipid”, “bourgeois”, “unforgivable”, “ineffable”. These are words that express judgements. Your story might lead a reader toward a particular judgement, but stating it outright is as irritating and self-defeating as laughing at your own jokes.

Our most-sacred words, like “justice”, “love”, “freedom”, “good”, “evil”, and “sacred”, are these types of words. They are reifications of concepts that we’ve formed from thousands of more-specific cases. But by themselves, they mean little. They’re only appropriate when they’re inappropriate: People use the words “just” or “evil” when they can’t provide a specific example of how something is just or evil.

Avoid these words. Don’t describe a character as an “evil sorceress”; show her doing something evil. Sometimes they’re the right words. Most of the time, they’re a sign that you’re thinking abstractly rather than concretely. More on this in a later post.

It’s meaningful for characters to be vague!

The flip side is, have your characters use these words to highlight their faulty thinking! A character may refer to someone as an evil sorceress to show that they are jumping to conclusions. A character might calls things “boring” to show that she’s just expressing her prejudices and isn’t open to some kinds of things.

[1] 70 years later bioinformatics is crippled because biologists still won’t read that book and don’t understand that when you want to compare different methods for inferring information about a protein, there is EXACTLY ONE CORRECT WAY to do it. Which no one ever uses. Same for linguistics. Most experts don’t want to develop the understanding of their field to the point where it can be automated. They get upset and defensive if you tell them that some of their questions have a single mathematically-precise answer. They would rather be high priests, with their expertise more art and poetry than science, free to indulge their whimsies without being held accountable to reality by meddling mathematicians.
[2] And assuming some more abstruse philosophical claims, such as that Quine’s thesis of ontological relativism is false. Which this seems to coincidentally prove false.
[3] When you tell a story that’s false and expected, it’s profitable.
[4] The best way I know to define how much meaning a string of text has is to use Kolmogorov complexity. The Kolmogorov complexity of a text is the number of bits of information needed to specify a computer program that would produce that text as output. But this still fails completely to penalize random strings for being random. A specific random sequence still has Kolmogorov complexity equal to its length if you need to re-produce it. But you don’t need to reproduce it. There’s nothing special about it. The amount of meaning in a text is the amount of information (suitably compressed) that is required to produce that text, or one sufficiently like it for your purposes. For any purpose you can have for a random text, there are a vast number of other random texts that will serve just as well; the length of a computer program to produce a suitably random text is short.
[5] People usually do this by putting words close to each other that are often used in the same context (the same surrounding words), so that “pleasant” and “enjoy” are close together, as are “car” and “truck”. This work instead took antonyms and synonyms from a thesaurus, and pushed synonyms towards each other and pulled antonyms apart from each other.

Alexei V. Samsonovic & Giorgio A. Ascoli (2010). Principal Semantic Components of Language and the Measurement of Meaning. PLoS One 5(6):e10921, June 2010.


2 thoughts on “Information Theory and Writing

  1. Wonderful post. A few years ago I wrote something similarly minded about literary compression, though I knew much less about information theory and predictive systems than I do now. (For instance, it took me finding Peli Grietzer and Owain Evans’ work to learn of Kolmogorov complexity, though I had a grasp on Shannon and Jeurgen Schmidhuber at the time.) Your points about coordinated meaning & mood, rhyme and entropy, and randomness are all well-taken.

    In the past few years, I’ve started taking this work in the direction of predictive processing, a theory from cognitive science that imports many information-theoretic, Bayesian statistical, and machine learning concepts. I have a hunch that ‘predictive hermeneutics,’ a lens of looking at art that treats it as a communication full of cues, signals, guiding hands, predictions about predictions, that leads readers to one conclusion and then subverts it, might be a fruitful one–it’s an approach I think you exemplify in your 2014 post, “Lead Your Readers” — perhaps there are others you recommend in a similar vein?

    I think there’s a productive dialogue here; I’d be very curious your thoughts on our overlapping inquiry.

    Literary compression:

    Predictive processing and the avant-garde:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s