Natural Language Processing, or NLP for short, is all the rage in artificial intelligence now-a-days. Everyone wants to do it, but very few really know what it entails. One of the first things to come to grips with is that computers don’t use words; they use numbers, specifically 0’s and 1’s to communicate. So if computers can only speak in 0’s and 1’s, how in the world do you turn words into numbers? Lets talk about it…
Text is everywhere and we want to use the latest machine learning techniques to understand, analyze, manipulate, and even generate more of it. But first, we’ve got to get that text into a format that a computer can deal with. That means turning it into a series of numbers, something called a vector. Then, once each word is represented by a vector, the computer can do what it really excels at… using math at very high speeds and perfect precision to make something useful.
The first thing we need to do with our text is tokenize it. That’s just fancy speak for breaking it down to it’s core parts, namely words. Then we need to do some house-cleaning before we even think about turning those words into numbers. We need to remove all that useless punctuation and any odd special characters; make everything the same case (lower is preferred); filter out any words that add little value but a lot of noise (what we call stop words in the business); and finally bring words back to their root meaning using techniques such as stemming and lemmatization. So words like “am”, “are”, and “is” all mean the same thing: “to be”.
Now that we’ve got a clean corpus, or collection, of words to work with, we can start turning them all into numbers. The naively simplistic approach is to use a simple letter-number substitution cypher. Assign a number to each of the 26 letters in the alphabet and just turn each word into a vector of numbers based on their equivalent position in the alphabet. “Hello” goes from h-e-l-l-o to 8-5-12-12-15. Unfortunately, all context and interpretability goes out the window and you’re left with a lot of seemingly random digits.
Instead of going the letter-by-letter route, lets do some counting. The “bag-of-words” idea is based on measuring the presence (or absence) of a word in a certain piece of text. We can do this with a simple Yes-or-No the word is-or-isn’t there, or we can actually count how many times it’s there. Believe it or not, this simple approach actually produces pretty startling results. Think about it, words have meaning and their presence (or absence) can influence the meaning of the document they came from. Any e-mail with words like “congratulations”, or “Nigerian prince” probably is spam; while one with words like “corporate balance sheet” or “analysis” is probably not. Oh, and no one said you have to look at just one word at-a-time. You could look at pairs, triplets, or n-sized groups of words (call “n-grams”). By accounting for adjacent words we’re starting to move out of just individual meaning and into the realm of context.
Speaking of context, why not count the relative frequency of words rather than their absolute frequency. You see, the problem with counting words is that some words are just used more than others. When was the last time you used the word juxtaposition? But if a particular word is used in a particular type of document more frequently than in other documents, well then you have some pretty powerful information. Dare I say, the juxtaposition of such a unique word around other common words will definitely grant you more knowledge about the content of the document it was found in. Welcome to the notion of Term Frequency Inverse Document Frequency, or TFIDF. Using this technique, we’re going to determine the relative frequency that a word appears in a document compared to its frequency across all the documents you’re looking at. TFIDF is a powerful method that’s still used in search engine scoring, text summarization, and document cleaning, just to name a few.
TFIDF starts to give us some context around the numbers were turning our words into, but the granddaddy of all these contextual methods involves a little more math and the magic of neural networks. Not even 10 years ago, Tomas Kikolov at Google came up with the groundbreaking idea of turning every word in the English language into a vector whose value was based on a small set of words around it. His word-to-vec algorithm revolutionized Natural Language Processing, and led the way to all kinds of other technological breakthroughs. Siri wouldn’t exist without his work!
The essence of word-to-vec is the idea that similar words occur more frequently together than dissimilar words. Thanks to the incredible increase in computing power and the availability of huge samples of written text, word-to-vec had all the data and processing horsepower it needed to build a very good numerical representation of the bulk of the English language. And just like that, all those words were turned into vectors that actually meant something and had context. Now that each word was a vector, the computer could use those vectors in mathematical expressions. For example, you could use the vectors to build analogies such as “man is to woman as boy is to… girl”. Or “Yen is to Japan as Ruble is to… Russia”. The original word-to-vec model was built around a very simple one hidden layer neural network, but since then all kinds of improvements have been made that can truly differentiate nuances in words that even humans have trouble with. Consider the word median. In the context of driving, it’s the dividing line down the center of a highway. But it’s meaning is something different from a statical standpoint where it represents the point at which half the data is less than it and the other half is greater than it. Both mean “middle”, but in very different contexts.
So, turning words into numbers isn’t that mysterious after-all. And now that you’ve turned those words into something a computer can work with, we can really rock-and-roll. We can summarize huge documents, categorize articles into topics, or even generate new text on our own. Oh the places we could go… as always… happy analyzing.