The Principal Dimensions of English

Standard

I want to talk about the principal dimensions of theories of art, but to do that, I must explain what I mean by “principal dimensions”. Besides, you should learn this stuff anyway.

How to Recommend Stories if you’re a computer

Suppose you want to decide, after reading the first thousand words of a story on your kindle, whether to recommend that Bob read it. Also suppose you’re a computer, so you must summarize the story in numbers. You can list some numbers for each story: its number of words, if it’s a series whether it’s complete (1) or incomplete (0), whether it listed in the romance genre, the action/adventure genre, etc., how good the style is on a scale from 0 to 10, and how often the words “kiss”, “snuggle”, “sesquipedalian”, and “bloody” appear.

You end up with tens of thousands of numbers about the story. What do you do with them?

First, you make up terminology. Let N  be the number of numbers (say, 23724). Call each of the things being counted a “dimension”, and the whole set of N numbers a “point” in “N-dimensional space”. (Don’t worry that you can’t picture N-dimensional space. Nobody can.)

Next, you need to get similar sets of N numbers for a bunch of other stories. Then, you need to know which of those stories Bob likes. Then, you recommend Bob read the story if its values for those numbers are similar to those for the stories Bob likes.

But you might only know 100 stories that Bob likes. Hardly enough to determine exactly how he feels about the word “sesquipedalian”.

What you could do is look at the other stories, and group counted things together that seem to go together most of the time. So, you might notice stories that use the word “snuggle” a lot also use “kiss” more than the others, but use “wart” and “angst” less often. So you make a single new, fake dimension, which for the benefit of any humans reading this I will call Kissiness, like this:

Kissiness = 30*Romance – 10*Dark + 3*count(kiss) + 3*count(cuddle) + count(close) + count(smooth) … – count(angst) – 2*count(wart)

You make some other fake dimensions that group together words that seem more or less likely to be found together:

Sexiness = 30*Sex + 10*Mature – 300*Everyone + count(grope) + count(hard) + count(soft) + 3*count(erect) + …

Violence = 10*Gore + 5*Mature + 3*count(bloody) + 5*count(battle) + count(hard) – count(soft) – count(fuzzy)

Second_Person = 10*count(you) + 5*count(your) – 10*count(I)

Superheroics = 5*count(power) + 2*count(mighty) + 3*count(evil) + count(cape) + 3*count(costume) + 2*count(mask)

Obscenity = count(@$!) + 5*count(@#$@#$) – count(fudge) – count(hay)

Bigwordiness = count(assiduous) + count(voracious) + count(punctilious) + …

(I’ve listed several words each, but more realistically there would be hundreds making up each fake dimension.)

Then you can build categories using just the fake dimensions.  In fact, I think this is very similar to what your brain does for you.

Principle Component Analysis (PCA)

There’s a signal processing technique called Principle Component Analysis (PCA) which is one of the Deep Insights into Everything that philosophy students should study instead of Plato’s forms. It does all this for you automatically, optimally, for any category of things described by points in an N-dimensional space. It looks at a whole bunch of such things, then figures out the one single best fake dimension that gives the data the widest spread [0]. Then it removes that dimension from the data, and does the same thing again, figuring out the second-best summary dimension. Do that 10 times, and you get 10 summary dimensions. Compute the values along those 10 dimensions for all the points in your N-dimensional space, throw away the original points, and you’ll still have most of the information that was in the original N dimensions. [1]

Then, predict that the probability that Bob will like a story is the probability that he likes other stories near it in 10-space.

This is the technique that won over all other approaches in the $1,000,000 Netflix contest, which was probably the biggest experiment in predicting ratings ever. The key innovation in the contest was using a fast way to approximate PCA [2]–a way which, incidentally, can be done by neurons.

Plus, once you’ve got the things you’re dealing with down to 10 dimensions or so, you can use logic or computation on them. You can have complex rules like, “If each line of the story has a similar repeated pattern of stresses, it’s probably poetry” if your analysis has discovered dimensions corresponding to “trochee” and “spondee” [3]. (Which it might, if your training stories had a lot of poetry in them.)

A lot of what cultures do, and what your brain does, is basically PCA followed by categorization and then thinking with those categories. All this crazy high-dimensional stuff happens, and people try to come up with concepts to simplify and explain it. The intermediate-level concepts produced, like “pretty”, “harmonious”, and “cruel”, are not real things. They are fake summary dimensions, each a sum (or function) of lots of real dimensions, that capture a lot of the differences between real things.

Then people build more concepts out of that smaller number of intermediate concepts. Because there are fewer of them, they can use more-powerful ways to combine them, like logical rules or lambda functions, to say whether something is “just”, “virtuous”, “beautiful”, or “sublime”. [4]]

Finding Data Points in the Real World

If you don’t have the N-dimensional data for all your objects, don’t worry. You don’t need it. If you can take any 2 objects and say how similar or different they are, or even just whether they’re similar or different, you can jump straight to the lower-dimensional space that PCA would produce. Call it M-dimensional space, M << N. Here’s how:

Compare a bunch of object pairs.  Say for instance difference(kind, compassionate) = 1, difference(kind, hurtful) = 6. Then use the differences between them as distances in a low-dimensional space. Start each of the objects at a random point in M-dimensional space (a popular choice is to distribute them on the surface of an M-1 dimensional sphere around the origin), then repeatedly push pairs apart if they’re too close, and pull them nearer each other if they’re too far (keeping them on the surface of that sphere if you’re doing it that way), until most pairs are about as far apart in your M-dimensional space as the distance says they should be.

(How do you choose M? You make it up. Everything left over gets mashed together in the Mth dimension, so if you want 10 meaningful dimensions, set M = 11.)

The principal dimensions of English

In fact, we can just do this with the English language, using lists of synonyms and antonyms, and see what our M summary dimensions are. In fact, somebody already did. Given some reasonable assumptions and one particular thesaurus, the 10 most-important dimensions of the English language are, roughly:

1. Good/increase vs. bad/diminish

2. Easy vs. hard

3. Start/open/free vs. finish/close/constrain

4.  Peripheral vs. central

5.  Essential vs. extra

6. Pull vs. push (sort of)

7. Above vs. below

8. Old vs. young

9. Normal vs. strange

10. Specific vs. general

I call these the principal dimensions. If we were doing PCA, we’d call them the principal components. Same thing. [5]

By contrast, if you do the same thing for French with a French thesaurus, these are the first 3 dimensions:

1. Good/increase vs. bad/diminish

2. Easy vs. hard

3. Start/open vs. finish/close

Whoops! Did I say by contrast? They’re the same. Because the dimensions that fall out of this analysis aren’t accidents of language. Languages develop to express how humans think. And that’s how humans think, at least Western Europeans. [6, 7, 8]

…but you said this had something to do with art

Here’s how all this is relevant to art: I want to claim I’ve discovered the first principal dimension of theories of art. I’m going to show (hopefully) that the position of different cultures on this dimension predicts something important about what type of art they value. But you need to understand what I mean by their position on this dimension, and what I mean by a type of art.

A type of art is like a mental disease. You diagnose it by noting that it contains, say, any 5 out of a list of 12 symptoms. The art type, or disease type, is a category. Its “symptoms” are measurements on summary (principal) dimensions. The actual data for a culture are going to be things like the degree to which power and wealth are centralized, the level of external threats, the heterogeneity of social roles, and the education level. [9] The principal dimension I’m going to talk about is not a real thing-in-the-world, though it is real. It’s determined by a statistical correlation between actual things in the world.


[0] Technically, the largest variance.

[1] There are many ways of doing PCA, and many related dimension-reduction techniques like “non-linear PCA” and factor analysis. Backpropagation neural networks are doing non-orthogonal PCA, though this wasn’t realized for many years after their invention.

[2] Except that they didn’t technically do PCA because they didn’t have the N-dimensional points. They assumed that each movie was described by an N-dimensional point, and that each user had an N-dimensional preference vector saying how much he liked high values on each dimension, and that their ratings were the dot products of these two vectors. Then they used singular value decomposition (SVD) to construct low-dimensional approximations to both kinds of vectors. So they ended up with the low-dimensional points without ever knowing the “real” original N-dimensional points. If anyone understands how to do this with PCA, please tell me.

[3] If all you want to do is recommend stories to Bob, it turns out it isn’t helpful for a computer program to construct the final genre categories.  It’s already got the point in N-space for a story; saying which genre that point lies within just throws away information. Just do your PCA and predict whether Bob likes that point in N-space. (Reference: The Netflix contest winners and losers.) But if you want to use logic to reason about genres (say, what themes are common in which genres), then you’ll have to categorize them.

[4] Many of the supposed proofs that meaning cannot be compositional (compositional: a term can be defined without reference to the entire dictionary) stem from the fact that philosophers don’t understand that first-order logic is strictly weaker than a Turing machine (lambda functions). “Logic” is a weak form of reason compared to computation.

[5] The fact that you can reconstruct these dimensions, and will get the same answer every time even with significant changes in the data, refutes the cornerstone of post-modern philosophy, which is that scientific theories, social structures, and especially language, are underdetermined by the world. That is, they claim that any one of an approximately infinite number of other ways of doing things, or categorizing things, or thinking about things, would work equally well, and the real world underlying the things we say can never be known. But in fact, casual experimentation proves that language is astronomically overdetermined. (The number of constraints we get from how linguistic terms relate to each other and to sense data is much larger than the number of degrees of freedom in the system.)

[6] Contrary to what Ferdinand de Saussure said, and post-modern philosophers after him assented to, thought came first, language, second. We can excuse him for making this mistake, because he was writing before Darwin’s theories were well-known, except oops no he wasn’t.

[7] Some “synonyms” are words that are opposite on one dimension, and the same on all the others, allowing people to invert a particular dimension. Examples: challenge / obstacle, abundant / gratuitous (differ on good/bad), tug / yank (on easy/hard, funny / peculiar (on normal/strange).

[8] If you feed the algorithm radically different data, you’ll come up with different dimensions after the first few dimensions, as I suppose they did for French in that study.

So what happens if two people had different life experiences, and their brains came up with different principal dimensions?

It turns out we have a word for this, an old word that predates the math needed to understand it this way: we say they have different paradigms. They classify things in the world using a different set of dimensions. When they think about things, they come up with different answers. When they talk to each other, they each think the other is stupid. This is why political debates rarely change anyone’s mind; the people on opposite sites literally cannot understand each other. Their brains automatically compress their opponents’ statements into dimensions in which the distinctions they’re making are lost.

This is, I think, the correct interpretation of Thomas Kuhn’s observation that scientists using different paradigms can’t seem to communicate with each other.  It doesn’t mean that the choice of paradigm is arbitrary. Different paradigms are better at making distinctions in different data sets. Someone who’s grown up with one data set can’t easily switch to a different one; she would have to re-learn everything. But, given agreement on what the data to explain is, paradigms can be compared objectively.

[9] Yes, it turns out I’m a literary Marxist. Sorry.

Advertisements

Linguistic Puzzle for the Day

Standard

To me:

“He wasn’t worse than many people” means there were not many people that he was worse than. That is, he was a very good person.

“He wasn’t any worse than many people” means there are many, but still a minority, of people that he wasn’t worse than. That is, he was a pretty bad person.

Do they sound that way to you? Why the difference?