This notebook is a code implementation of https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention.

In [12]:
import torch
torch.__version__

'2.6.0+cu124'

# Embedding an Input Sentence
For simplicity, here our dictionary dc is restricted to the words that occur in the input sentence. In a real-world application, we would consider all words in the training dataset (typical vocabulary sizes range between 30k to 50k entries).



In [13]:
sentence = "Life is short eat dessert first"
vocabulary = {s:i for i, s in enumerate(sorted(sentence.split()))}
vocabulary

{'Life': 0, 'dessert': 1, 'eat': 2, 'first': 3, 'is': 4, 'short': 5}

In [14]:
sentence_integer_encoding = torch.tensor([vocabulary[word] for word in sentence.split()])
sentence_integer_encoding

tensor([0, 4, 5, 2, 1, 3])

Now, using the integer-vector representation of the input sentence, we can use an embedding layer to encode the inputs into a real-vector embedding. Here, we will use a tiny 3-dimensional embedding such that each input word is represented by a 3-dimensional vector.

Note that embedding sizes typically range from hundreds to thousands of dimensions. For instance, Llama 2 utilizes embedding sizes of 4,096. The reason we use 3-dimensional embeddings here is purely for illustration purposes. This allows us to examine the individual vectors without filling the entire page with numbers.

Since the sentence consists of 6 words, this will result in a 6×3-dimensional embedding:

In [15]:
vocab_size = 50000
torch.manual_seed(123)
embed = torch.nn.Embedding(vocab_size, 3)
embedded_sentence = embed(sentence_integer_encoding).detach()
embedded_sentence.shape, embedded_sentence

(torch.Size([6, 3]),
 tensor([[ 0.3374, -0.1778, -0.3035],
 [ 0.1794, 1.8951, 0.4954],
 [ 0.2692, -0.0770, -1.0205],
 [-0.2196, -0.3792, 0.7671],
 [-0.5880, 0.3486, 0.6603],
 [-1.1925, 0.6984, -1.4097]]))

# Defining the Weight Matrices
Now, let's discuss the widely utilized self-attention mechanism known as the scaled dot-product attention, which is an integral part of the transformer architecture.

Self-attention utilizes three weight matrices, referred to as Wq, Wk, and Wv, which are adjusted as model parameters during training. These matrices serve to project the inputs into query, key, and value components of the sequence, respectively.

The respective query, key and value sequences are obtained via matrix multiplication between the weight matrices W and the embedded inputs x:

Query sequence: q(i) = x(i)Wq for i in sequence 1 … T

Key sequence: k(i) = x(i)Wk for i in sequence 1 … T

Value sequence: v(i) = x(i)Wv for i in sequence 1 … T

Here, both q(i) and k(i) are vectors of dimension dk. The projection matrices Wq and Wk have a shape of d × dk , while Wv has the shape d × dv .

(It's important to note that d represents the size of each word vector, x.)

Since we are computing the dot-product between the query and key vectors, these two vectors have to contain the same number of elements (dq = dk). In many LLMs, we use the same size for the value vectors such that dq = dk = dv. However, the number of elements in the value vector v(i), which determines the size of the resulting context vector, can be arbitrary.

So, for the following code walkthrough, we will set dq = dk = 2 and use dv = 4, initializing the projection matrices as follows:



In [16]:
torch.manual_seed(123)

d = embedded_sentence.shape[1]
d_q, d_k, d_v = 2, 2, 4

W_query = torch.nn.Parameter(torch.rand(d, d_q))
W_key = torch.nn.Parameter(torch.rand(d, d_k))
W_value = torch.nn.Parameter(torch.rand(d, d_v))

# Computing the Unnormalized Attention Weights
Now, let's suppose we are interested in computing the attention vector for the second input element -- the second input element acts as the query here:


In [17]:
x_2 = embedded_sentence[1]
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(query_2.shape)
print(key_2.shape)
print(value_2.shape)

torch.Size([2])
torch.Size([2])
torch.Size([4])


In [18]:
keys = embedded_sentence @ W_key
values = embedded_sentence @ W_value
print(keys.shape, values.shape)

torch.Size([6, 2]) torch.Size([6, 4])


All keys and queries have been created, so now we can move on to computing the unnormalised attention weights. From here, the query is the second element. Next, we will compute the attention weight between the query and 5th element (how much attention element 2 should pay to the value of element 5).

In [20]:
omega_24 = query_2.dot(keys[4])
print(omega_24)

tensor(1.2903, grad_fn=)


In [None]:
omega_2 = query_2 @ keys.T
# transpose keys to matrix multiply correctly
# this is the logits.
print(omega_2)

tensor([-0.6004, 3.4707, -1.5023, 0.4991, 1.2903, -1.3374],
 grad_fn=)


In [24]:
import torch.nn.functional as F
# We scale by 1 / sqrt(d_k) to keep the weight vectors roughly the same magnitude so they
# don't get too big or small. Then softmax the values to turn them into probs.
attention_weights_2 = F.softmax(omega_2 / d_k ** 0.5, dim=0)
print(attention_weights_2)

tensor([0.0386, 0.6870, 0.0204, 0.0840, 0.1470, 0.0229],
 grad_fn=)
