Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models

Researchers at Zhejiang University and Quantstamp identified "sticky tokens" in text embedding models, which are tokens that, when repeatedly inserted, unexpectedly pull sentence similarity towards a specific value. Their Sticky Token Detector (STD) found 868 such tokens across 40 models, showing they significantly degrade downstream task performance by up to 52.3% due to disproportionate attention domination.

Table of Contents

Introduction

Text embedding models have become fundamental components in modern natural language processing, powering applications from information retrieval to semantic similarity tasks. However, these models harbor a previously unrecognized vulnerability: certain tokens can artificially manipulate sentence similarity scores when inserted into text. This paper introduces the concept of "sticky tokens" - anomalous tokens that consistently pull cosine similarity between sentence pairs toward a specific value, typically the mean similarity in the model's embedding space.

Sticky Token Behavior Example

The phenomenon was first observed in a Kaggle competition where participants noticed that adding the token "lucrarea" to sentences in Sentence-T5 models would make unrelated sentences appear more similar. As shown in the figure above, repeatedly inserting this token causes the cosine similarity between two semantically different sentences to increase progressively, demonstrating the sticky token effect.

Problem Definition and Formal Framework

The authors provide a formal definition of sticky tokens based on their observed behavior. A token tt is considered "sticky" if, when repeatedly inserted into a sentence s2s_2, the cosine similarity between any sentence s1s_1 and the modified s2s_2 converges toward uu (the mean pairwise similarity of the model's token embeddings) within a threshold ε\varepsilon.

Mathematically, this is expressed as:

Sim(s1,I(s2,t,n))uε|Sim(s_1, I(s_2, t, n)) - u| \leq \varepsilon

where I(s2,t,n)I(s_2, t, n) represents inserting token tt into sentence s2s_2 a total of nn times, and SimSim denotes cosine similarity between embeddings.

The insertion operation II can occur in three ways: prefix insertion (adding tokens at the beginning), suffix insertion (adding at the end), or random insertion (placing tokens at random positions within the sentence). This comprehensive approach ensures that sticky tokens are detected regardless of their positional influence on the embedding.

Detection Methodology

The researchers developed the Sticky Token Detector (STD), a four-step framework for efficiently identifying sticky tokens across different embedding models:

STD Framework

Step 1: Sentence Pair Filtering optimizes the search space by focusing on sentence pairs most susceptible to sticky token influence. Since sticky tokens primarily pull similarities toward the mean uu, the method filters sentence pairs to retain only those whose initial similarity is below uu. This ensures the detector observes upward pulls toward the mean, which is characteristic of sticky token behavior.

Step 2: Token Filtering categorizes and sanitizes the model's vocabulary by removing:

  • Undecodable tokens containing invalid characters
  • Unreachable tokens whose ID changes after decode-then-re-encode cycles
  • Special tokens like [CLS], [SEP], or </s>

Step 3: Shortlisting via Sticky Scoring computes a "sticky score" for each candidate token to avoid evaluating every token on all sentence pairs. The sticky score SS(t)SS(t) quantifies how much token tt influences similarity toward the mean, considering both the magnitude and frequency of similarity changes, with a penalty for tokens semantically close to the reference sentence.

Step 4: Validation rigorously tests shortlisted tokens against the formal definition using all filtered sentence pairs, with an adaptive threshold based on the interquartile range of calculated similarity deviations.

Experimental Results and Token Characteristics

The STD was applied to 40 popular text embedding models spanning 14 model families from 2019 to 2025, successfully identifying 868 sticky tokens total. The percentage of sticky tokens within vocabularies ranged from 0.006% to 1%, confirming their rarity while demonstrating the detection method's efficiency.

Several key characteristics emerged from the analysis:

Model Family Consistency: Models within the same family often shared similar sticky tokens, suggesting architectural or training methodology influences. However, no consistent correlation existed between sticky token count and model size or vocabulary size.

Token Categories: Approximately 7% of detected sticky tokens were special tokens (e.g., </s>, [CLS], [MASK]) or unused reserved tokens (e.g., <extra_id_18>). About 22% comprised non-ASCII characters including Cyrillic, CJK, Arabic fragments, and mathematical symbols, likely resulting from fragmented multilingual subwords with limited pre-training coverage.

Model-Specific Patterns: T5-based models commonly had sticky tokens like </s> and unused <extra_id_X> tokens. BERT/RoBERTa derivatives showed inverse correlations with size, where larger models sometimes had fewer sticky tokens. LLM-based models exhibited highly varied counts, with some models like gte-Qwen2-7B-instruct containing 103 sticky tokens.

Impact on Downstream Tasks

Comprehensive evaluation across 15 MTEB tasks demonstrated that sticky tokens cause significantly higher performance degradation compared to randomly chosen normal tokens (p < 0.05, Cohen's d = 0.41). For the ST5-base model, sticky token insertion led to substantial performance drops: SciFact retrieval accuracy fell by 41.5% and NFCorpus retrieval accuracy by 52.3%.

The impact varied by model size, with lightweight models suffering more catastrophic degradation while larger models showed greater robustness, though all remained vulnerable to some degree.

Theoretical Analysis and Attention Patterns

The authors conducted attention layer analysis to understand the underlying mechanism behind sticky token behavior. They found that sticky tokens disproportionately dominate model attention, with their attention weights concentrating in high-value ranges (>0.4), unlike normal tokens which follow a more Gaussian distribution.

Attention Analysis

Layer-wise analysis revealed that irregularities are progressively amplified across layers. While divergence between sticky and normal token attention patterns is moderate in early layers, it sharply increases in mid to late layers, peaking at the final layers. This indicates that minor anomalies introduced by sticky tokens compound as information propagates through deeper layers.

The authors conjecture that this phenomenon relates to the inherent anisotropy of embedding spaces, where representations occupy a narrow cone rather than being uniformly distributed. This anisotropic structure enables sticky tokens to pull sentence embeddings toward specific focal points, reducing variance and making unrelated sentences appear more similar.

Anisotropy Conjecture

Security Implications and Mitigation Strategies

The discovery of sticky tokens opens new avenues for adversarial attacks, particularly against Retrieval-Augmented Generation (RAG) systems. By injecting sticky tokens into malicious content, attackers could manipulate retrieval results, forcing language models to access and potentially generate toxic or misleading information even when responding to benign queries.

The authors propose initial mitigation strategies including tokenizer sanitization through proactive pruning of problematic tokens before fine-tuning, and runtime detection systems that flag inputs containing suspected sticky tokens for real-time masking or embedding recalibration.

Limitations and Future Directions

The current approach assumes sticky tokens uniformly pull similarity toward the mean of token embeddings, which may not hold for models with isotropic embedding spaces or highly task-specific embeddings. The study was also limited to open-source models using BPE-based tokenization, leaving closed-source models and alternative tokenization schemes unexplored.

While the paper successfully identifies and characterizes the sticky token problem, it does not propose definitive solutions such as tokenizer retraining or embedding space regularization techniques. These limitations present clear directions for future research into more robust tokenization strategies and model architectures that can mitigate the adverse effects of sticky tokens, ultimately leading to more reliable embedding-based NLP systems.

Relevant Citations

Mteb: Massive text embedding benchmark.
This paper introduces the Massive Text Embedding Benchmark (MTEB), which is the primary framework used in the main paper to evaluate the downstream impact of sticky tokens. The reported performance degradation on various MTEB tasks, such as clustering and retrieval, serves as the core evidence for the severity of the sticky token problem.
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. Mteb: Massive text embedding benchmark. Preprint, arXiv:2210.07316.
Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models.
This work introduces the Sentence-T5 model, which is used as the primary illustrative example of the sticky token phenomenon throughout the paper. The specific sticky token 'lucrarea' is from a Sentence-T5 model and is featured prominently in figures and analysis, making this citation central to the problem's demonstration.
Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. 2021a. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. Preprint, arXiv:2108.08877.
Glitch tokens in large language models: Categorization taxonomy and effective detection.
This paper investigates 'glitch tokens', the most closely related anomalous token phenomenon discussed in the main paper's related work section. It provides essential context by establishing the prior art on token-level anomalies in LLMs, which helps frame the novelty of the main paper's focus on a similar issue within text embedding models.
Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, and Haoyu Wang. 2024. Glitch tokens in large language models: Categorization taxonomy and effective detection. Preprint, arXiv:2404.09894.
Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.
The main paper identifies tokenization artifacts as a key origin of sticky tokens. This citation introduces SentencePiece, a widely used tokenization algorithm employed by many of the analyzed models, including the T5 family. Understanding this foundational tokenization method is crucial for comprehending the root cause of the observed anomalies.
Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Preprint, arXiv:1808.06226.