The Transformer architecture has revolutionized Natural Language Processing, enabling breakthroughs in machine translation, question answering, summarization, large-scale language modeling, and countless other tasks. However, one of the most fundamental and often misunderstood components of Transformers is positional encoding. Without positional information, a Transformer cannot determine the order of words in a sequence, and order is critical to understanding meaning in human language.

In architectures like RNNs and LSTMs, sequential order is built into the nature of the model: they read input word-by-word. But Transformers process all tokens simultaneously using self-attention. While this parallelization offers huge computational advantages, it also removes the natural sense of sequence order. Positional encoding solves this problem by injecting word-position information into the input embeddings.

In this in-depth article, you will learn everything about positional encoding: why it is needed, how it works mathematically, how sinusoidal encodings differ from learned positional embeddings, how Keras and Keras-NLP implement them, and how positional encodings influence transformer performance. This guide aims to give you both theoretical clarity and practical implementation insight.

1. Why Transformers Need Positional Encoding

Transformers rely on self-attention, a mechanism that computes relationships between all tokens in parallel. Unlike RNNs, this mechanism doesn’t read sequences step-by-step, so the model has no inherent understanding of:

which word came first
which word came next
how far apart words are
relative order or structure

Without explicit positional information, the model will treat sequences like:

"cats chase dogs"

and:

"dogs chase cats"

as the same bag of tokens, which would be disastrous for language understanding.

1.1 The self-attention limitation

Self-attention computes relationships based solely on content (word embeddings), not position. That is:

Attention = f(Q, K, V)

Where:

Q = Query
K = Key
V = Value

If two words appear in different positions but share similar embeddings, the model may treat them as interchangeable.

1.2 Importance of order in human language

Words form meaning through order:

“He ate the cake” ≠ “The cake ate him”
“Not bad” ≠ “Bad, not”
“I only saw her yesterday” ≠ “Only I saw her yesterday”

Order affects syntax, semantics, emphasis, and interpretation.
Without positional encoding, the Transformer loses this essential information.

2. What Is Positional Encoding?

Positional encoding is a method to add sequence order information to the input embeddings. It is essentially a vector that represents the position of each token in a sequence.

Transformers add this vector to the word embeddings:

InputEmbedding + PositionalEncoding = FinalEmbedding

This combined embedding provides both:

semantic meaning (from the word)
positional meaning (from the encoding)

Thus, the model learns relationships between words while also understanding their order.

3. Types of Positional Encodings

There are two major types:

3.1 Sinusoidal Positional Encoding (original Transformer)

Proposed in Attention is All You Need (Vaswani et al., 2017), these are fixed and deterministic.

3.2 Learnable Positional Embeddings

The model learns position vectors during training, similar to word embeddings.

We explore each in depth.

4. Sinusoidal Positional Encoding (Fixed Position Embeddings)

This is the original method used in the first Transformer model.

4.1 Mathematical Formula

For each position pospospos and each dimension iii:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

pos = token position in the sequence
i = dimension index
d_model = embedding size (e.g., 512)

4.2 Why use sine and cosine?

Sinusoidal functions provide:

smooth, continuous variation
easy extrapolation to longer sequences
unique encoding for each position
relative position information
multi-scale frequency representation

The encoding captures both fine-grained and long-range order relationships.

4.3 Properties

Deterministic
No training is required; they are fixed.
Infinite generalization
Works for any sequence length, unlike learned embeddings.
Relative structure
The model can infer relative distances between tokens.
Smoothness
Adjacent positions have gradually changing values.

5. Learnable Positional Embeddings

Instead of using fixed functions, many modern Transformers learn positional vectors.

5.1 How they work

A matrix is created:

position_embedding_matrix = [max_position, embedding_dim]

Each row corresponds to a specific sequence position.

These embeddings are optimized during training.

5.2 Advantages

Often achieve better performance
Adapt to dataset-specific patterns
Learn task-relevant position structures

5.3 Disadvantages

Cannot generalize to unseen longer sequences
Require training data diversity

5.4 Transformers that use learned positional embeddings

BERT
GPT-family
RoBERTa
DistilBERT
ELECTRA

Most modern large-scale NLP systems use learned embeddings.

6. Relative Positional Encoding

Relative encodings model distance between tokens, rather than absolute position.

Examples:

Transformer-XL
DeBERTa
T5

Advantages:

better long-range understanding
easier generalization
length-invariant reasoning

Relative encodings often outperform absolute encodings in tasks requiring long sequences.

7. Keras and Keras-NLP: How Positional Encodings Are Implemented

Keras makes positional encoding simple using a variety of layers.

7.1 Token + Position Embedding Layer (Keras-NLP)

import keras_nlp

embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
vocabulary_size=vocab_size,
sequence_length=max_len,
embedding_dim=256)

This layer automatically provides:

token embeddings
positional embeddings
addition of both

It works similarly to architectures like BERT.

7.2 Custom Sinusoidal Positional Encoding in Keras

import tensorflow as tf
from tensorflow.keras.layers import Layer
import numpy as np

class SinusoidalPositionEncoding(Layer):
def __init__(self, embed_dim):
    super().__init__()
    self.embed_dim = embed_dim
def call(self, inputs):
    length = tf.shape(inputs)&#91;1]
    position = tf.range(length, dtype=tf.float32)&#91;:, tf.newaxis]
    i = tf.range(self.embed_dim, dtype=tf.float32)&#91;tf.newaxis, :]
    angle_rates = 1 / tf.pow(10000., (2 * (i // 2)) / tf.cast(self.embed_dim, tf.float32))
    angle_rads = position * angle_rates
    sines = tf.sin(angle_rads&#91;:, 0::2])
    cosines = tf.cos(angle_rads&#91;:, 1::2])
    pos_encoding = tf.concat(&#91;sines, cosines], axis=-1)
    return inputs + pos_encoding

This matches the original Transformer paper.

8. How Positional Encodings Affect Model Performance

8.1 Without positional encodings

A Transformer performs no better than a bag-of-words model.

8.2 With positional encodings

The model:

understands grammar
maintains long-range coherence
interprets word order
disambiguates sentence structure

Performance improvements are large and immediate.

8.3 Impact on training stability

Positional encodings help the model converge faster by providing immediate structure.

9. Sinusoidal vs Learned Positional Encodings: A Detailed Comparison

Aspect	Sinusoidal	Learned
Training	No training needed	Learned with gradient descent
Generalization	Excellent for long sequences	Poor for long unseen sequences
Flexibility	Hard-coded pattern	High adaptability
Popularity	Used in early Transformers	Used in modern models
Performance	Good	Often better

Both types remain relevant, depending on application.

10. Positional Encoding in Modern Transformer Architectures

10.1 BERT and GPT

Use learned positional embeddings.

10.2 T5

Uses relative positional encoding.

10.3 Transformer-XL

Introduced recurrence + relative encoding.

10.4 DeBERTa

Separates content and positional attention—best-in-class.

10.5 ViT (Vision Transformer)

Uses learned positional embeddings for patches in images.

11. How Position Information Combines With Self-Attention

Transformers use attention weights across tokens:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

If Q and K had no positional information:

word positions are indistinguishable
attention collapses
predictions become meaningless

Positional encodings ensure that:

the model learns to attend based on both content and location
word order influences attention patterns
structural dependencies emerge naturally

12. Visualizing Positional Encodings

Visualization reveals interesting insights.

12.1 Sinusoidal patterns

Sin encodings produce smooth waves.

When plotted:

odd dimensions = sine waves
even dimensions = cosine waves
frequencies increase with dimension

This creates a unique signature for each position.

12.2 Learned positional embeddings

Visualizations show:

clusters
discontinuities
task-specific structures

Unlike sinusoidal patterns, these embeddings lack predictable shape but adapt well to the data.

13. Positional Encoding in Keras Attention Models

When building custom attention layers, positional encoding is essential:

x = TokenEmbedding(input_tokens)
x = x + PositionEmbedding(positions)
x = MultiHeadAttention()(x, x)

Transformers in Keras always combine token + position embeddings before applying attention.

14. Challenges and Limitations

14.1 Maximum sequence length

Positional embeddings usually require specifying a max length.

14.2 Poor adaptation to new lengths (learned embeddings)

Models like BERT cannot handle longer sequences without re-training embeddings.

14.3 Memory footprint

Relative encodings can be more expensive.

15. Best Practices for Using Positional Encodings in Keras

Use TokenAndPositionEmbedding for standard NLP tasks.
Prefer learned embeddings for fine-tuning tasks like classification.
Prefer relative positional encodings for long sequences.
For text generation tasks, consider models like GPT-2 that use learned absolute encodings.
For long-range dependencies, choose Transformer-XL or DeBERTa style encodings.

16. Real-World Applications of Positional Encodings

16.1 Machine Translation

Maintaining word order is essential for accurate translation.

16.2 Text Summarization

Summaries depend on sequence flow.

16.3 Chatbots and Dialogue Systems

Positional encoding helps maintain coherence and turn-taking structure.

16.4 Document Classification

Understanding sentence structure improves accuracy.

16.5 Audio and Time-Series Transformers

Time steps act like positions.

16.6 Vision Transformers

Patch positions guide spatial reasoning.

17. Future Directions of Positional Encoding Research

17.1 Relative attention dominance

More models are moving towards relative positional encoding.

17.2 Rotary Position Embeddings (RoPE)

Used in GPT-NeoX and LLaMA; improves extrapolation.

17.3 ALiBi positional bias

Allows arbitrarily long sequences without retraining.

17.4 2D positional encodings for images

Extending to computer vision applications.

Positional encoding continues to evolve as Transformers expand across domains.

Positional Encoding in Transformers