Positional Encoding in Transformers

The Transformer architecture has revolutionized Natural Language Processing, enabling breakthroughs in machine translation, question answering, summarization, large-scale language modeling, and countless other tasks. However, one of the most fundamental and often misunderstood components of Transformers is positional encoding. Without positional information, a Transformer cannot determine the order of words in a sequence, and order is critical to understanding meaning in human language.

In architectures like RNNs and LSTMs, sequential order is built into the nature of the model: they read input word-by-word. But Transformers process all tokens simultaneously using self-attention. While this parallelization offers huge computational advantages, it also removes the natural sense of sequence order. Positional encoding solves this problem by injecting word-position information into the input embeddings.

In this in-depth article, you will learn everything about positional encoding: why it is needed, how it works mathematically, how sinusoidal encodings differ from learned positional embeddings, how Keras and Keras-NLP implement them, and how positional encodings influence transformer performance. This guide aims to give you both theoretical clarity and practical implementation insight.

1. Why Transformers Need Positional Encoding

Transformers rely on self-attention, a mechanism that computes relationships between all tokens in parallel. Unlike RNNs, this mechanism doesn’t read sequences step-by-step, so the model has no inherent understanding of:

  • which word came first
  • which word came next
  • how far apart words are
  • relative order or structure

Without explicit positional information, the model will treat sequences like:

"cats chase dogs"

and:

"dogs chase cats"

as the same bag of tokens, which would be disastrous for language understanding.

1.1 The self-attention limitation

Self-attention computes relationships based solely on content (word embeddings), not position. That is:

Attention = f(Q, K, V)

Where:

  • Q = Query
  • K = Key
  • V = Value

If two words appear in different positions but share similar embeddings, the model may treat them as interchangeable.

1.2 Importance of order in human language

Words form meaning through order:

  • “He ate the cake” ≠ “The cake ate him”
  • “Not bad” ≠ “Bad, not”
  • “I only saw her yesterday” ≠ “Only I saw her yesterday”

Order affects syntax, semantics, emphasis, and interpretation.
Without positional encoding, the Transformer loses this essential information.


2. What Is Positional Encoding?

Positional encoding is a method to add sequence order information to the input embeddings. It is essentially a vector that represents the position of each token in a sequence.

Transformers add this vector to the word embeddings:

InputEmbedding + PositionalEncoding = FinalEmbedding

This combined embedding provides both:

  • semantic meaning (from the word)
  • positional meaning (from the encoding)

Thus, the model learns relationships between words while also understanding their order.


3. Types of Positional Encodings

There are two major types:

3.1 Sinusoidal Positional Encoding (original Transformer)

Proposed in Attention is All You Need (Vaswani et al., 2017), these are fixed and deterministic.

3.2 Learnable Positional Embeddings

The model learns position vectors during training, similar to word embeddings.

We explore each in depth.


4. Sinusoidal Positional Encoding (Fixed Position Embeddings)

This is the original method used in the first Transformer model.

4.1 Mathematical Formula

For each position pospospos and each dimension iii:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

  • pos = token position in the sequence
  • i = dimension index
  • d_model = embedding size (e.g., 512)

4.2 Why use sine and cosine?

Sinusoidal functions provide:

  • smooth, continuous variation
  • easy extrapolation to longer sequences
  • unique encoding for each position
  • relative position information
  • multi-scale frequency representation

The encoding captures both fine-grained and long-range order relationships.

4.3 Properties

  1. Deterministic
    No training is required; they are fixed.
  2. Infinite generalization
    Works for any sequence length, unlike learned embeddings.
  3. Relative structure
    The model can infer relative distances between tokens.
  4. Smoothness
    Adjacent positions have gradually changing values.

5. Learnable Positional Embeddings

Instead of using fixed functions, many modern Transformers learn positional vectors.

5.1 How they work

A matrix is created:

position_embedding_matrix = [max_position, embedding_dim]

Each row corresponds to a specific sequence position.

These embeddings are optimized during training.

5.2 Advantages

  • Often achieve better performance
  • Adapt to dataset-specific patterns
  • Learn task-relevant position structures

5.3 Disadvantages

  • Cannot generalize to unseen longer sequences
  • Require training data diversity

5.4 Transformers that use learned positional embeddings

  • BERT
  • GPT-family
  • RoBERTa
  • DistilBERT
  • ELECTRA

Most modern large-scale NLP systems use learned embeddings.


6. Relative Positional Encoding

Relative encodings model distance between tokens, rather than absolute position.

Examples:

  • Transformer-XL
  • DeBERTa
  • T5

Advantages:

  • better long-range understanding
  • easier generalization
  • length-invariant reasoning

Relative encodings often outperform absolute encodings in tasks requiring long sequences.


7. Keras and Keras-NLP: How Positional Encodings Are Implemented

Keras makes positional encoding simple using a variety of layers.

7.1 Token + Position Embedding Layer (Keras-NLP)

import keras_nlp

embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(
vocabulary_size=vocab_size,
sequence_length=max_len,
embedding_dim=256
)

This layer automatically provides:

  • token embeddings
  • positional embeddings
  • addition of both

It works similarly to architectures like BERT.


7.2 Custom Sinusoidal Positional Encoding in Keras

import tensorflow as tf
from tensorflow.keras.layers import Layer
import numpy as np

class SinusoidalPositionEncoding(Layer):
def __init__(self, embed_dim):
    super().__init__()
    self.embed_dim = embed_dim
def call(self, inputs):
    length = tf.shape(inputs)[1]
    position = tf.range(length, dtype=tf.float32)[:, tf.newaxis]
    i = tf.range(self.embed_dim, dtype=tf.float32)[tf.newaxis, :]
    angle_rates = 1 / tf.pow(10000., (2 * (i // 2)) / tf.cast(self.embed_dim, tf.float32))
    angle_rads = position * angle_rates
    sines = tf.sin(angle_rads[:, 0::2])
    cosines = tf.cos(angle_rads[:, 1::2])
    pos_encoding = tf.concat([sines, cosines], axis=-1)
    return inputs + pos_encoding

This matches the original Transformer paper.


8. How Positional Encodings Affect Model Performance

8.1 Without positional encodings

A Transformer performs no better than a bag-of-words model.

8.2 With positional encodings

The model:

  • understands grammar
  • maintains long-range coherence
  • interprets word order
  • disambiguates sentence structure

Performance improvements are large and immediate.

8.3 Impact on training stability

Positional encodings help the model converge faster by providing immediate structure.


9. Sinusoidal vs Learned Positional Encodings: A Detailed Comparison

AspectSinusoidalLearned
TrainingNo training neededLearned with gradient descent
GeneralizationExcellent for long sequencesPoor for long unseen sequences
FlexibilityHard-coded patternHigh adaptability
PopularityUsed in early TransformersUsed in modern models
PerformanceGoodOften better

Both types remain relevant, depending on application.


10. Positional Encoding in Modern Transformer Architectures

10.1 BERT and GPT

Use learned positional embeddings.

10.2 T5

Uses relative positional encoding.

10.3 Transformer-XL

Introduced recurrence + relative encoding.

10.4 DeBERTa

Separates content and positional attention—best-in-class.

10.5 ViT (Vision Transformer)

Uses learned positional embeddings for patches in images.


11. How Position Information Combines With Self-Attention

Transformers use attention weights across tokens:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

If Q and K had no positional information:

  • word positions are indistinguishable
  • attention collapses
  • predictions become meaningless

Positional encodings ensure that:

  • the model learns to attend based on both content and location
  • word order influences attention patterns
  • structural dependencies emerge naturally

12. Visualizing Positional Encodings

Visualization reveals interesting insights.

12.1 Sinusoidal patterns

Sin encodings produce smooth waves.

When plotted:

  • odd dimensions = sine waves
  • even dimensions = cosine waves
  • frequencies increase with dimension

This creates a unique signature for each position.

12.2 Learned positional embeddings

Visualizations show:

  • clusters
  • discontinuities
  • task-specific structures

Unlike sinusoidal patterns, these embeddings lack predictable shape but adapt well to the data.


13. Positional Encoding in Keras Attention Models

When building custom attention layers, positional encoding is essential:

x = TokenEmbedding(input_tokens)
x = x + PositionEmbedding(positions)
x = MultiHeadAttention()(x, x)

Transformers in Keras always combine token + position embeddings before applying attention.


14. Challenges and Limitations

14.1 Maximum sequence length

Positional embeddings usually require specifying a max length.

14.2 Poor adaptation to new lengths (learned embeddings)

Models like BERT cannot handle longer sequences without re-training embeddings.

14.3 Memory footprint

Relative encodings can be more expensive.


15. Best Practices for Using Positional Encodings in Keras

  • Use TokenAndPositionEmbedding for standard NLP tasks.
  • Prefer learned embeddings for fine-tuning tasks like classification.
  • Prefer relative positional encodings for long sequences.
  • For text generation tasks, consider models like GPT-2 that use learned absolute encodings.
  • For long-range dependencies, choose Transformer-XL or DeBERTa style encodings.

16. Real-World Applications of Positional Encodings

16.1 Machine Translation

Maintaining word order is essential for accurate translation.

16.2 Text Summarization

Summaries depend on sequence flow.

16.3 Chatbots and Dialogue Systems

Positional encoding helps maintain coherence and turn-taking structure.

16.4 Document Classification

Understanding sentence structure improves accuracy.

16.5 Audio and Time-Series Transformers

Time steps act like positions.

16.6 Vision Transformers

Patch positions guide spatial reasoning.


17. Future Directions of Positional Encoding Research

17.1 Relative attention dominance

More models are moving towards relative positional encoding.

17.2 Rotary Position Embeddings (RoPE)

Used in GPT-NeoX and LLaMA; improves extrapolation.

17.3 ALiBi positional bias

Allows arbitrarily long sequences without retraining.

17.4 2D positional encodings for images

Extending to computer vision applications.

Positional encoding continues to evolve as Transformers expand across domains.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *