The Transformer architecture doesn’t just process sequences; it rewrites them in parallel, enabling models to understand context far more deeply than ever before.
Let’s see how this plays out with a simplified example. Imagine we have a sentence: "The cat sat on the mat." A traditional Recurrent Neural Network (RNN) would process this word by word, maintaining a hidden state that gets updated sequentially.
Input: "The" -> Hidden State 1
Hidden State 1 + "cat" -> Hidden State 2
Hidden State 2 + "sat" -> Hidden State 3
...and so on.
This sequential processing is slow and struggles to remember long-range dependencies. If the sentence were "The cat, which was fluffy and white, sat on the mat," an RNN might forget "cat" by the time it reaches "mat."
The Transformer, however, uses a mechanism called Self-Attention. Instead of a sequential hidden state, it calculates an "attention score" for every word in relation to every other word in the input sequence, simultaneously.
Consider the word "sat" in our example. Self-attention allows "sat" to directly "look at" and weigh the importance of "cat" and "mat" (and all other words) when determining its own representation. This happens for every word in the sequence, all at once.
Here’s a peek at what the core of the Transformer, the multi-head self-attention layer, looks like conceptually. For each input word embedding (a vector representing the word), we derive three vectors: a Query (Q), a Key (K), and a Value (V).
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What information do I actually carry?
The attention score between two words is calculated by taking the dot product of the Query vector of one word with the Key vector of another word. This score is then scaled and passed through a softmax function to get attention weights. These weights are used to create a weighted sum of the Value vectors from all words.
# Simplified (and conceptual) attention calculation for a single word:
Q_word1 = input_embedding_word1 * W_Q # W_Q is a learnable weight matrix
K_word2 = input_embedding_word2 * W_K # W_K is a learnable weight matrix
V_word1 = input_embedding_word1 * W_V # W_V is a learnable weight matrix
# Calculate attention score between word1 and word2
score_1_to_2 = dot_product(Q_word1, K_word2) / sqrt(dimension_of_K)
# Apply softmax to get weights across all words for word1's query
attention_weights_word1 = softmax(score_1_to_2, score_1_to_3, ...)
# Output representation for word1 is a weighted sum of all Value vectors
output_embedding_word1 = sum(attention_weights_word1[i] * V_word_i for i in all_words)
"Multi-head" means we do this Q, K, V projection and attention calculation multiple times in parallel with different learned weight matrices. Each "head" can focus on different aspects of the relationships between words. The outputs from all heads are then concatenated and linearly transformed.
This allows the model to capture various types of dependencies. One head might focus on subject-verb agreement, another on pronoun resolution, and yet another on semantic relatedness. The Transformer then stacks these self-attention layers, along with feed-forward networks, in an encoder-decoder structure (or just an encoder/decoder for specific tasks).
The "encoder" processes the input sequence, building rich contextual representations. The "decoder" uses these representations (and its own previously generated output) to produce the output sequence, often in a language translation or text generation task. Crucially, the decoder also uses "masked self-attention" to ensure it only attends to positions it has already generated, preventing it from "cheating" by looking ahead.
The positional encoding is a critical component. Since self-attention itself is permutation-invariant (it doesn’t inherently know the order of words), we inject positional information into the input embeddings before they enter the first attention layer. This is typically done by adding a fixed sine and cosine wave at different frequencies based on the position of each word.
# Conceptual positional encoding addition
position = np.arange(sequence_length)[:, np.newaxis]
div_term = np.exp(np.arange(0, embedding_dim, 2) * -(np.log(10000.0) / embedding_dim))
pe = np.zeros((sequence_length, embedding_dim))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
input_with_pos_encoding = input_embedding + pe
This allows the model to learn to use word order as part of its understanding, even though the attention mechanism itself is order-agnostic. The genius is in how these position-aware embeddings are then processed by the attention layers, enabling them to distinguish between "The cat sat on the mat" and "The mat sat on the cat."
The residual connections and layer normalization are also vital. They help stabilize training for deep networks by allowing gradients to flow more easily and keeping activations within a reasonable range, preventing the vanishing/exploding gradient problem common in very deep neural nets.
The power of the Transformer lies in its ability to parallelize computation, its capacity for capturing long-range dependencies via self-attention, and its flexible, modular design that has spawned countless variations.
The next major hurdle is understanding how the feed-forward networks within each Transformer block contribute to the transformation of the attention-weighted representations.