The Transformer architecture has revolutionized Natural Language Processing, enabling breakthroughs in machine translation, question answering, summarization, large-scale language modeling, and countless other tasks. However, one of the most fundamental and often misunderstood components of Transformers is positional encoding. Without positional information, a Transformer cannot determine the order of words in a sequence, and order is critical to understanding meaning in human language.
In architectures like RNNs and LSTMs, sequential order is built into the nature of the model: they read input word-by-word. But Transformers process all tokens simultaneously using self-attention. While this parallelization offers huge computational advantages, it also removes the natural sense of sequence order. Positional encoding solves this problem by injecting word-position information into the input embeddings.
In this in-depth article, you will learn everything about positional encoding: why it is needed, how it works mathematically, how sinusoidal encodings differ from learned positional embeddings, how Keras and Keras-NLP implement them, and how positional encodings influence transformer performance. This guide aims to give you both theoretical clarity and practical implementation insight.
1. Why Transformers Need Positional Encoding
Transformers rely on self-attention, a mechanism that computes relationships between all tokens in parallel. Unlike RNNs, this mechanism doesn’t read sequences step-by-step, so the model has no inherent understanding of:
- which word came first
- which word came next
- how far apart words are
- relative order or structure
Without explicit positional information, the model will treat sequences like:
"cats chase dogs"
and:
"dogs chase cats"
as the same bag of tokens, which would be disastrous for language understanding.
1.1 The self-attention limitation
Self-attention computes relationships based solely on content (word embeddings), not position. That is:
Attention = f(Q, K, V)
Where:
- Q = Query
- K = Key
- V = Value
If two words appear in different positions but share similar embeddings, the model may treat them as interchangeable.
1.2 Importance of order in human language
Words form meaning through order:
- “He ate the cake” ≠ “The cake ate him”
- “Not bad” ≠ “Bad, not”
- “I only saw her yesterday” ≠ “Only I saw her yesterday”
Order affects syntax, semantics, emphasis, and interpretation.
Without positional encoding, the Transformer loses this essential information.
2. What Is Positional Encoding?
Positional encoding is a method to add sequence order information to the input embeddings. It is essentially a vector that represents the position of each token in a sequence.
Transformers add this vector to the word embeddings:
InputEmbedding + PositionalEncoding = FinalEmbedding
This combined embedding provides both:
- semantic meaning (from the word)
- positional meaning (from the encoding)
Thus, the model learns relationships between words while also understanding their order.
3. Types of Positional Encodings
There are two major types:
3.1 Sinusoidal Positional Encoding (original Transformer)
Proposed in Attention is All You Need (Vaswani et al., 2017), these are fixed and deterministic.
3.2 Learnable Positional Embeddings
The model learns position vectors during training, similar to word embeddings.
We explore each in depth.
4. Sinusoidal Positional Encoding (Fixed Position Embeddings)
This is the original method used in the first Transformer model.
4.1 Mathematical Formula
For each position pospospos and each dimension iii:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
pos= token position in the sequencei= dimension indexd_model= embedding size (e.g., 512)
4.2 Why use sine and cosine?
Sinusoidal functions provide:
- smooth, continuous variation
- easy extrapolation to longer sequences
- unique encoding for each position
- relative position information
- multi-scale frequency representation
The encoding captures both fine-grained and long-range order relationships.
4.3 Properties
- Deterministic
No training is required; they are fixed. - Infinite generalization
Works for any sequence length, unlike learned embeddings. - Relative structure
The model can infer relative distances between tokens. - Smoothness
Adjacent positions have gradually changing values.
5. Learnable Positional Embeddings
Instead of using fixed functions, many modern Transformers learn positional vectors.
5.1 How they work
A matrix is created:
position_embedding_matrix = [max_position, embedding_dim]
Each row corresponds to a specific sequence position.
These embeddings are optimized during training.
5.2 Advantages
- Often achieve better performance
- Adapt to dataset-specific patterns
- Learn task-relevant position structures
5.3 Disadvantages
- Cannot generalize to unseen longer sequences
- Require training data diversity
5.4 Transformers that use learned positional embeddings
- BERT
- GPT-family
- RoBERTa
- DistilBERT
- ELECTRA
Most modern large-scale NLP systems use learned embeddings.
6. Relative Positional Encoding
Relative encodings model distance between tokens, rather than absolute position.
Examples:
- Transformer-XL
- DeBERTa
- T5
Advantages:
- better long-range understanding
- easier generalization
- length-invariant reasoning
Relative encodings often outperform absolute encodings in tasks requiring long sequences.
7. Keras and Keras-NLP: How Positional Encodings Are Implemented
Keras makes positional encoding simple using a variety of layers.
7.1 Token + Position Embedding Layer (Keras-NLP)
import keras_nlp
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
vocabulary_size=vocab_size,
sequence_length=max_len,
embedding_dim=256
)
This layer automatically provides:
- token embeddings
- positional embeddings
- addition of both
It works similarly to architectures like BERT.
7.2 Custom Sinusoidal Positional Encoding in Keras
import tensorflow as tf
from tensorflow.keras.layers import Layer
import numpy as np
class SinusoidalPositionEncoding(Layer):
def __init__(self, embed_dim):
super().__init__()
self.embed_dim = embed_dim
def call(self, inputs):
length = tf.shape(inputs)[1]
position = tf.range(length, dtype=tf.float32)[:, tf.newaxis]
i = tf.range(self.embed_dim, dtype=tf.float32)[tf.newaxis, :]
angle_rates = 1 / tf.pow(10000., (2 * (i // 2)) / tf.cast(self.embed_dim, tf.float32))
angle_rads = position * angle_rates
sines = tf.sin(angle_rads[:, 0::2])
cosines = tf.cos(angle_rads[:, 1::2])
pos_encoding = tf.concat([sines, cosines], axis=-1)
return inputs + pos_encoding
This matches the original Transformer paper.
8. How Positional Encodings Affect Model Performance
8.1 Without positional encodings
A Transformer performs no better than a bag-of-words model.
8.2 With positional encodings
The model:
- understands grammar
- maintains long-range coherence
- interprets word order
- disambiguates sentence structure
Performance improvements are large and immediate.
8.3 Impact on training stability
Positional encodings help the model converge faster by providing immediate structure.
9. Sinusoidal vs Learned Positional Encodings: A Detailed Comparison
| Aspect | Sinusoidal | Learned |
|---|---|---|
| Training | No training needed | Learned with gradient descent |
| Generalization | Excellent for long sequences | Poor for long unseen sequences |
| Flexibility | Hard-coded pattern | High adaptability |
| Popularity | Used in early Transformers | Used in modern models |
| Performance | Good | Often better |
Both types remain relevant, depending on application.
10. Positional Encoding in Modern Transformer Architectures
10.1 BERT and GPT
Use learned positional embeddings.
10.2 T5
Uses relative positional encoding.
10.3 Transformer-XL
Introduced recurrence + relative encoding.
10.4 DeBERTa
Separates content and positional attention—best-in-class.
10.5 ViT (Vision Transformer)
Uses learned positional embeddings for patches in images.
11. How Position Information Combines With Self-Attention
Transformers use attention weights across tokens:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) V
If Q and K had no positional information:
- word positions are indistinguishable
- attention collapses
- predictions become meaningless
Positional encodings ensure that:
- the model learns to attend based on both content and location
- word order influences attention patterns
- structural dependencies emerge naturally
12. Visualizing Positional Encodings
Visualization reveals interesting insights.
12.1 Sinusoidal patterns
Sin encodings produce smooth waves.
When plotted:
- odd dimensions = sine waves
- even dimensions = cosine waves
- frequencies increase with dimension
This creates a unique signature for each position.
12.2 Learned positional embeddings
Visualizations show:
- clusters
- discontinuities
- task-specific structures
Unlike sinusoidal patterns, these embeddings lack predictable shape but adapt well to the data.
13. Positional Encoding in Keras Attention Models
When building custom attention layers, positional encoding is essential:
x = TokenEmbedding(input_tokens)
x = x + PositionEmbedding(positions)
x = MultiHeadAttention()(x, x)
Transformers in Keras always combine token + position embeddings before applying attention.
14. Challenges and Limitations
14.1 Maximum sequence length
Positional embeddings usually require specifying a max length.
14.2 Poor adaptation to new lengths (learned embeddings)
Models like BERT cannot handle longer sequences without re-training embeddings.
14.3 Memory footprint
Relative encodings can be more expensive.
15. Best Practices for Using Positional Encodings in Keras
- Use TokenAndPositionEmbedding for standard NLP tasks.
- Prefer learned embeddings for fine-tuning tasks like classification.
- Prefer relative positional encodings for long sequences.
- For text generation tasks, consider models like GPT-2 that use learned absolute encodings.
- For long-range dependencies, choose Transformer-XL or DeBERTa style encodings.
16. Real-World Applications of Positional Encodings
16.1 Machine Translation
Maintaining word order is essential for accurate translation.
16.2 Text Summarization
Summaries depend on sequence flow.
16.3 Chatbots and Dialogue Systems
Positional encoding helps maintain coherence and turn-taking structure.
16.4 Document Classification
Understanding sentence structure improves accuracy.
16.5 Audio and Time-Series Transformers
Time steps act like positions.
16.6 Vision Transformers
Patch positions guide spatial reasoning.
17. Future Directions of Positional Encoding Research
17.1 Relative attention dominance
More models are moving towards relative positional encoding.
17.2 Rotary Position Embeddings (RoPE)
Used in GPT-NeoX and LLaMA; improves extrapolation.
17.3 ALiBi positional bias
Allows arbitrarily long sequences without retraining.
17.4 2D positional encodings for images
Extending to computer vision applications.
Positional encoding continues to evolve as Transformers expand across domains.
Leave a Reply