Let’s say you’re in a room, alone.

You look around. The room doesn’t have any windows or doors, just a small slot in one of the walls, and shelves heaped with books and tomes. They’re filled with strange symbols and accompanied by English instructions that help you convert them into new, equally strange ones.

Eventually, a scrap of paper falls through the slot. It contains a string of symbols which you don’t understand, similar to the ones you found in the books. You go through all the pages of all the books until you find the matching characters, and follow the instructions, telling you what to write below the original message, and to slip the paper back through the slot, which you do.

Those aren’t meaningless symbols scribbled on a piece of paper. On the other side of the room, there are native Chinese speakers, who have just received a perfect answer to the question they asked. They naturally conclude there’s a native speaker of Chinese in the other room. Only we know this isn’t the case.

This is the Chinese Room, a much debated thought experiment proposed in 1980, by philosopher John Searle. In the experiment, the books aren’t Chinese-English dictionaries — at no point do they make any attempt to explain the meaning of the symbols. They merely give you instructions on how to take an input, manipulate the characters based on their relations to one another, and provide an output. The books act like a computer program.

Searle argued that the Turing test wasn’t a reliable test for machine intelligence. Any effective AI program could eventually learn the rules that govern a particular language and give the illusion that it has the same understanding as a native speaker.

These interactions between computers and human languages are at the center of the field of Natural Language Processing, or NLP.  Speech recognition, text summarization, sentiment analysis, machine translation — NLP is everywhere. And it has seen massive improvements since the 80s. Though there’s still enough humorous machine translation to go around, we’ve come a long way since the days of BabelFish. We —  myself especially, a monolingual Yankee living in Lisbon —  take access to fast, fairly accurate translations for granted.  

Have you ever wondered how computer programs are able to process words?

After all, even the simplest sentences can be semantic minefields, littered with connotations and grammatical nuance that only native speakers can instantly make sense of. Take, for example, the following sentence: “Yesterday, I went to the bank and ran into my friend.” How might a computer translate this into French? The crudest way, perhaps, would be a word-for-word replacement with a bilingual dictionary. We would probably end up with the translation, “Hier, je allé à le banque et couru dans mon ami.” For starters, there are problems of verb conjugation and article-noun gender agreement (“le banque” should be “la banque”). But if we really wanted to translate this way, we could devise a set of rules that would take care of these grammar snafus. After implementing them, we might end up with the following translation: “Hier, je suis allé à la banque et j’ai couru dans mon ami.” Definitely an improvement — but, in this version, I’m still charging my friend like a defensive tackle.

At time of writing, Google Translate — which isn’t exactly known for perfect translation — gives us this:  “Hier, je suis allé à la banque et suis tombé sur mon ami.” It correctly translates the idiomatic expression meaning to meet someone unexpectedly. This can’t be the work of word-for-word replacement alone.

So what’s happening here? First we should consider the way we humans learn to sort out verbal ambiguity. When we experience enough examples as children, we begin to assign semantic value to words, and abstract and extrapolate these semantic values given combinations of words. In simpler terms, without explicitly being taught, we understand what words mean and how they are affected by their context. Referring to the prior English to French example, our first instinct is to think that we met our friend and not that we collided with him. We have experiences of the physical world and a lifetime of linguistic input to help us put things in context.

What do words and phrases mean to a computer, which can only understand zeroes and ones?

Although computers cannot actually “understand” language in a human sense, training them to generate useful information from text isn’t all that different from our own experience of language acquisition. Show a computer enough examples, and it will begin to recognize patterns. But what’s the substitute for human understanding? Word embeddings, the fundamental units of any natural language processing task.

A word embedding is essentially a sequence of numbers—a vector—that stores information about the word’s meaning.

The goal of creating word embeddings is twofold: to improve other NLP tasks, such as machine translation, or to analyze similarities between words and groups of words.

Due to a number of breakthroughs in word embedding methods, 2018 has been hailed as the golden age of NLP. These new methods yielded significant improvements in our ability to model language, which should soon manifest themselves in consumer products and businesses alike.

Let’s look at a simple example to gain a feeling for what a word embedding is.

Suppose we want to create 2-dimensional word embeddings (that is, each word is represented by a set of two numbers) for certain animals: hippopotamus, snake, butterfly, platypus. Further, let’s suppose our two dimensions represent two characteristics animals may exhibit to varying degrees: “dangerousness” and “furriness.”

Fake word embeddings for these animals might be the following:

AnimalDangerousFurry
Hippo0.850.13
Snake0.88-0.97
Butterfly-0.91-0.86
Platypus0.610.79

In this example, “hippo” is represented by the vector [0.85, 0.13], snake by [0.88, -0.97], and so on. We now have a numerical representation, albeit oversimplified, of each of these animals in terms of these two characteristics. All sorts of mathematical operations can be performed on these vectors to give us new information.

A frequently cited example highlighting the power of word embedding operations is kingman + woman = queen. In this diagram, the blue arrows could be said to represent gender, and the orange arrows royalty.

One such operation is comparison. How similar is one word to another? In our example with the animals, we can imagine the numerical representation of “hippo,” “snake,” “butterfly,” and “platypus” in a 2D graph, as a line stretching from the origin and passing through the points indicated by the numbers. The similarity of these words can therefore be determined by the angle between their vectors (we call this cosine similarity). Essentially, words that are separated by a 90° angle have no semantic relation, while words that are separated by a 180° angle are exact opposites.

In this example, the distance between hippo and platypus is around 16°, while the distance between hippo and butterfly is 104°.

Of course this is just a fun, hypothetical example, merely highlighting what word embeddings are and beginning to describe how they might be useful. In practice, much higher dimensional vectors are used (typically in the hundreds), and trying to assign semantic fields such as “dangerous” and “furry” to these dimensions would be difficult as well as unfair, since the algorithm does not truly know the words’ meaning. Furthermore, very large corpora on the order of tens of millions of words are needed to “learn” satisfactory word embeddings.

But first, we need to quickly touch on a field of linguistics called distributional semantics.

Word embeddings efficiently capture something called the “distributional hypothesis,” aptly summarized by British linguist John Rupert Firth in his 1957 work, A synopsis of linguistic theory:

“You shall know a word by the company it keeps.”

The field of distributional semantics posits that words and phrases that occur in similar contexts—similar distributions—have similar meanings.

For example, suppose we have a corpus of several sentences:

  • He petted the fuzzy dog.
  • He petted the fuzzy cat.
  • He played fetch with the dog.

For a human, it is immediately apparent that cats and dogs are related (both pets), while “petted” and “played fetch” are related (both pet activities). But it’s also immediately apparent that you can’t get a cat to play fetch, though not for lack of trying!

Computational word embedding methods take advantage of this idea — that the context of a word can help tell us what the word means if we have seen a sufficient number and variety of examples.

This finally brings us to the actual algorithms for calculating word embeddings, such as Word2Vec, introduced by Tomas Mikolov and his fellow researchers at Google, in 2013. The main idea behind the algorithm is to predict, for any given word, neighboring terms, using a lot of text as training data.

Let’s go back to a slight variation of our original sentence: “Yesterday, I went to the bank and I read the newspaper.”

With Word2Vec, we first want to define a context window of words — let’s say two. So, for every word in our training data, we will look at the two words before and the two words after it and create pairs with the current word and each of the four context words. The center word is the input and the context word is the output, where we want to use the input word to predict the output word.

For example, if our center word is “bank,” the training pairs are: (bank, to), (bank, the), (bank, and), (bank, I).

This process of creating pairs is repeated for every word in the sentence, and using the input word to predict the output word of every pair is what eventually generates the word embeddings.

If you can imagine that we have millions of lines of text to train on, more common pairings will occur more frequently in these examples, which will make the model more likely to learn these combinations. Even in this one sentence we can see that (bank, the) / (newspaper, the) and (went, I) / (read, I) are similar pairs, as they follow the paradigms noun-article and verb-subject pronoun, respectively.

The next step is to transform these words into distributional vectors, as in the example with the animals, where the numbers carry meaning in relation to each other.

To do this, we go through each training pair we’ve created and apply the following procedure. We first initialize the input word’s embedding as a vector of random numbers. Then a series of mathematical functions are applied to the vector. The result of these operations is another vector that in turn represents a word, which we want to be the output word of our training pair. If the predicted word is not our output word, we adjust the numbers of the input word embedding slightly so the result of the operations looks closer to the vector of the output word.

In this example, our training pair is (bank, the). After applying the output functions to the word embedding for “bank” (in orange), we look at the predicted word vector (green) to see if it corresponds to the target word, “the.” Finally, we modify the orange embedding of “bank” so that the green predicted vector is closer to “the.”

We update these word embeddings to maximize the likelihood that, given an input word, the produced output word frequently appears as a context word in the data. Similar words will have similar contexts and therefore similar word embeddings.

To highlight the robustness of these models, here are some examples from the original paper introducing Word2Vec.

RelationshipExample 1Example 2Example 3
France: ParisItaly: RomeJapan: TokyoFlorida: Tallahassee
Einstein: scientistMessi: midfielderMozart: violinistPicasso: painter
Microsoft: BallmerGoogle: YahooIBM: McNealyApple: Jobs

Note: As it can be seen, accuracy is quite good, although there is clearly a lot of room for further improvements.

Through word embeddings, we can ask the model to perform analogies such as “France is to Paris as Italy is to ___?” While the computer still doesn’t understand, it’s as if it knows that Rome is the capital of Italy and that Picasso was a painter. Based on the distributional hypothesis, it would translate the name “Steve Jobs” as “Steve Jobs” in a tech article, but “jobs” as “emplois” in an economic report. It would detect that I didn’t tackle my friend to the ground; I merely ran into her at the bank.

Using this kind of word embedding technique, we have seen amazing results on different tasks, such as sentiment analysis, text generation, and, most importantly for Unbabel, machine translation.

Still, AI’s language problem isn’t solved — we are far from getting computers to reach true natural language understanding. After all, language is an inherently human activity, one that sets our intelligence apart. It takes a blend of abstraction and association, interaction with the physical world around us, and perhaps a bit of human ingenuity, something we have in abundance but AI lacks.

In the quest for truly intelligent machines, it’s hard to imagine any complex AI system that doesn’t have language at its core. But despite what some companies and headlines would have you believe, that isn’t going to happen anytime soon.

Sources:

Herbelot, A. Distributional semantics: a light introduction. https://aurelieherbelot.net/research/distributional-semantics-intro/.

McCormick, C. (2016, April 19). Word2Vec Tutorial – The Skip-Gram Model. Retrieved from http://www.mccormickml.com

Mikolov, T. et al. Efficient Estimation of Word Representations in Vector Space. 2013. https://arxiv.org/pdf/1301.3781.pdf. Mikolov, T. et al. Distributed Representations of Words and Phrases and their Compositionality. 2013. https://arxiv.org/pdf/1301.3781.pdf.