Skip to main content

Command Palette

Search for a command to run...

A deep learning mechanism called self-attention

Published
2 min read
A deep learning mechanism called self-attention
P

Hiya 👋 I'm Pranav. I'm a recent computer science grad who loves punching keys, napping while coding and lifting weights. This space is a collection of my journey of active learning from blogs, books and papers.

[47]

Self-attention

Self-attention is a mechanism in deep learning, particularly in the field of natural language processing (NLP) and computer vision, that allows a model to weigh the importance of different parts of the input when making predictions. It's a fundamental component of models like the Transformer, which has revolutionized NLP tasks.

How self-attention works

Input Representation

Given an input sequence (e.g., a sentence in NLP), it is first transformed into a set of vectors. Each element in the sequence (e.g., word or pixel) is represented as a vector.

Query, Key, and Value

For each element in the sequence, three vectors are derived: Query, Key, and Value.

Query (Q)

This represents the element in question. It's used to inquire about the importance of different elements.

Key (K)

These vectors are used to define how each element interacts with the others. They help establish relationships between elements.

Value (V)

These vectors contain information about the element. They serve as the actual information that is propagated through the network.

Scoring

For each element in the sequence, a score is computed with respect to all other elements. This score is determined by the dot product of the Query vector of one element and the Key vector of another element.

Attention Weights

The scores are then scaled and passed through a softmax function to produce attention weights. These weights indicate how much focus should be placed on each element when considering a particular element.

Weighted Sum

The attention weights are used to take a weighted sum of the Value vectors of all elements. This results in a context vector, which is a weighted combination of all elements in the sequence.

Output

The context vector is then used as input to the subsequent layer or module in the network.

Multiple Heads

In practice, multiple sets of Query, Key, and Value vectors are used in parallel (multiple "heads"). This allows the model to learn different relationships between elements.

More from this blog

Pranav's Place

63 posts