Demystifying AI: How Artificial Intelligence Understands Our Language

Exploring the World of Word Embedding and Attention

Mar 14, 2024

How would you explain the concept of happiness to someone who speaks an entirely different language with which you have no words in common?

Unlike humans, machines do not understand words, emotions, or context innately. But luckily, they understand numbers.

In artificial intelligence, particularly machine learning and natural language processing (NLP), we need a way to translate our rich, complex language into a format that computers can process.

This is where the concept of Word Embedding comes in. Word Embedding is essentially the conversion of words into numbers, but not just any numbers – these are numbers that capture words' essence, context, and relationships.

Before we embark on exploring the principles of Word Embedding, don’t forget to check out the previous article in the Demystifying AI series if you haven’t yet:

Demystifying AI: The Age-Old Math Behind Today's Machine Learning

Samuel Kollát

March 6, 2024

Read full story

Step 1: The Naive Approach

A straightforward approach we can employ to represent words for computer processing is One-Hot Encoding.

Imagine you have a very limited vocabulary — let’s say, five words: Apple, Banana, Lemon, Orange, and Strawberry. In one-hot encoding, each word is represented as a binary vector corresponding to its position in our vocabulary.

For example, “Apple” might be [1, 0, 0, 0, 0], “Banana” would be [0, 1, 0, 0, 0], and so on.

While simple, this method has significant drawbacks:

Firstly, it treats each word as entirely independent from others, ignoring any form of relationship or similarity. This limitation becomes particularly problematic in machine learning tasks that depend on understanding context and relationships, such as language translation, sentiment analysis, or content recommendation.
Secondly, as the vocabulary grows, these vectors become excessively long and inefficient. Oxford Dictionary registers 171,476 English words as being commonly used. That means that each word would be represented by a vector with 171,475 zeros and one number one.

Step 2: The Evolution into Word Embedding

Understanding the limitations of approaches like one-hot encoding, we need a more sophisticated approach called Word Embedding.

At the heart of word embedding lies the concept of vector spaces. Each vector represents a word in this context, and the space reflects potential word meaning and relationships.

The choice of dimensionality of the vector space (typically between 50 to 300 dimensions for practical purposes) balances between capturing enough meaning of words and computational efficiency. Higher dimensions can capture more subtle relationships but require more data and computational power.

Our goal is to position similar words closer together. For instance, “Apple” and “Banana” would have vectors that are more similar to each other than a vector representing the word “Electricity.”

How do we measure the similarity between them?

We need to determine how closely they are aligned in the vector space. We can do it by calculating the cosine of the angle between each pair of vectors. This method is called Cosine Similarity. A cosine similarity of 1 means two vectors are identical, 0 indicates no similarity (orthogonality), and -1 means opposite.

Mathematical Foundations (Optional)

For our example, let’s represent our three words in 2-dimensional space, making plotting easier.

The equation for Cosine Similarity is:

\(\text{CS}(A, B) = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} \)

Even though the expression looks slightly intimidating, it’s pretty easy if we make it more visual.

Let’s assign the following points (coordinates) to the words:

Apple: [25, 8]
Banana: [22, 12]
Electricity: [10,20]

and their respective vectors start from the coordinate [0,0]. Now, plugging these coordinates into the equation, we get:

\(\text{CS}(Apple, Banana) = \frac{25*22 + 8*12}{\sqrt{25^2 + 8^2} * \sqrt{22^2 + 12^2}} = 0.982 \)

As a result, we can see that the words “Apple” and “Banana” have higher similarity than the word “Electricity”, which is also represented on the graph by having a smaller angle between their vectors.

But how did we get the vectors in the first place? Let’s look into it in the next section, but before we do, if you are enjoying the post, please spread the word and share the Enginuity publication with others :)

Share Enginuity

Step 3: Neural Networks and Word2Vec

This is where machine learning comes in. By analyzing vast amounts of text, algorithms like Word2Vec learn the context of words, how they are used, what words they appear next to, and so on.

Let’s examine how Word2Vec efficiently captures semantic relationships between words. It's built on two architecturally distinct models: Continuous Bag of Words (CBOW) and Skip-Gram. Each is suited to different tasks, but both work with the context of surrounding words.

Continuous Bag of Words (CBOW)

The CBOW model aims to predict a target word given a set of context words surrounding it. Think of it as trying to fill in the blanks in a sentence.

Given the words “The cat __ on the mat.” the model aims to predict that the missing word is likely “sits” or “lies”.

While the CBOW neural network structure is not complex, especially compared to deep neural networks used in other areas of AI, it effectively captures complex relationships between words thanks to the large amount of contextual data it processes.

CBOW’s Hidden Layer is a fully connected layer with a predefined number of neurons—the desired dimensionality of the word vectors. No activation functions are applied in this layer, making the architecture more straightforward than deep learning networks. The training process aims to improve the weights connecting the input layer to the hidden layer. After the training, the weights corresponding to a particular word can serve as the dense vector representation of that word.

The Output Layer is a SoftMax layer predicting the likelihood of each word in the vocabulary being present in a given context. The output layer has as many neurons as words in the vocabulary and outputs a probability distribution over all words.

Skip-Gram Model

The Skip-Gram model flips CBOW on its head. Instead of predicting a word based on its context, it predicts the context, that is, surrounding words, based on a word. Given the word “sits”, it would aim to predict “cat”, “on”, and “the mat”.

Step 4: The Introduction of Attention Mechanisms

While word embedding is a significant leap forward, it still has limitations. The challenge is that a word's meaning can change depending on the other words around it. The word “bank”, for example, can mean the edge of a river or a financial institution, depending on the context.

Attention is a more advanced concept that allows models to weigh the importance of different words in a sentence. Same as when you're reading a book and some words stand out more to you due to their relevance, attention mechanisms enable AI to give more “attention” to certain words based on the context.

At its core, the attention mechanism involves three main components: Queries, Keys, and Values, typically represented as vectors:

Queries correspond to the element we're trying to compute the output, such as the word currently generated in translation.
Keys are associated with the elements from the input sequence, such as words in the original sentence that we want to translate.
Values are also associated with the input elements and contain the actual content from the input.

The attention mechanism computes a score by:

Calculating scores for each input word by comparing each Query with all Keys to determine their relevance.
These scores are passed through a SoftMax function to convert them into probabilities.
Given probabilities, determine how much attention should be given to corresponding values when producing the output.

The introduction of attention mechanisms has led to the development of models like the Transformer, which relies entirely on attention to process text. This has paved the way for highly successful language models like GPT (Generative Pre-trained Transformer) and others, making complicated tasks such as text translation, summarization, question-answering, and sentiment analysis possible.

Top picks this week

Alex Xu

— A Deep Dive into Amazon DynamoDB Architecture

Practical Engineering Management

— Lightweight ADR

Gergely Orosz

— The “10x engineer:" 50 years ago and now

Let me know

If you enjoyed reading this article (I know I really enjoyed writing it), let me know what other topics you’d like to see in the “Demystifying AI” series.

Also, don’t forget to share this post if you’ve found it helpful. I’ll really appreciate it :)

Enginuity