Table of Contents
- Introduction
- Problem Definition and Formal Framework
- Detection Methodology
- Experimental Results and Token Characteristics
- Impact on Downstream Tasks
- Theoretical Analysis and Attention Patterns
- Security Implications and Mitigation Strategies
- Limitations and Future Directions
- Relevant Citations
Introduction
Text embedding models have become fundamental components in modern natural language processing, powering applications from information retrieval to semantic similarity tasks. However, these models harbor a previously unrecognized vulnerability: certain tokens can artificially manipulate sentence similarity scores when inserted into text. This paper introduces the concept of "sticky tokens" - anomalous tokens that consistently pull cosine similarity between sentence pairs toward a specific value, typically the mean similarity in the model's embedding space.
The phenomenon was first observed in a Kaggle competition where participants noticed that adding the token "lucrarea" to sentences in Sentence-T5 models would make unrelated sentences appear more similar. As shown in the figure above, repeatedly inserting this token causes the cosine similarity between two semantically different sentences to increase progressively, demonstrating the sticky token effect.
Problem Definition and Formal Framework
The authors provide a formal definition of sticky tokens based on their observed behavior. A token is considered "sticky" if, when repeatedly inserted into a sentence , the cosine similarity between any sentence and the modified converges toward (the mean pairwise similarity of the model's token embeddings) within a threshold .
Mathematically, this is expressed as:
where represents inserting token into sentence a total of times, and denotes cosine similarity between embeddings.
The insertion operation can occur in three ways: prefix insertion (adding tokens at the beginning), suffix insertion (adding at the end), or random insertion (placing tokens at random positions within the sentence). This comprehensive approach ensures that sticky tokens are detected regardless of their positional influence on the embedding.
Detection Methodology
The researchers developed the Sticky Token Detector (STD), a four-step framework for efficiently identifying sticky tokens across different embedding models:
Step 1: Sentence Pair Filtering optimizes the search space by focusing on sentence pairs most susceptible to sticky token influence. Since sticky tokens primarily pull similarities toward the mean , the method filters sentence pairs to retain only those whose initial similarity is below . This ensures the detector observes upward pulls toward the mean, which is characteristic of sticky token behavior.
Step 2: Token Filtering categorizes and sanitizes the model's vocabulary by removing:
- Undecodable tokens containing invalid characters
- Unreachable tokens whose ID changes after decode-then-re-encode cycles
- Special tokens like
[CLS]
,[SEP]
, or</s>
Step 3: Shortlisting via Sticky Scoring computes a "sticky score" for each candidate token to avoid evaluating every token on all sentence pairs. The sticky score quantifies how much token influences similarity toward the mean, considering both the magnitude and frequency of similarity changes, with a penalty for tokens semantically close to the reference sentence.
Step 4: Validation rigorously tests shortlisted tokens against the formal definition using all filtered sentence pairs, with an adaptive threshold based on the interquartile range of calculated similarity deviations.
Experimental Results and Token Characteristics
The STD was applied to 40 popular text embedding models spanning 14 model families from 2019 to 2025, successfully identifying 868 sticky tokens total. The percentage of sticky tokens within vocabularies ranged from 0.006% to 1%, confirming their rarity while demonstrating the detection method's efficiency.
Several key characteristics emerged from the analysis:
Model Family Consistency: Models within the same family often shared similar sticky tokens, suggesting architectural or training methodology influences. However, no consistent correlation existed between sticky token count and model size or vocabulary size.
Token Categories: Approximately 7% of detected sticky tokens were special tokens (e.g., </s>
, [CLS]
, [MASK]
) or unused reserved tokens (e.g., <extra_id_18>
). About 22% comprised non-ASCII characters including Cyrillic, CJK, Arabic fragments, and mathematical symbols, likely resulting from fragmented multilingual subwords with limited pre-training coverage.
Model-Specific Patterns: T5-based models commonly had sticky tokens like </s>
and unused <extra_id_X>
tokens. BERT/RoBERTa derivatives showed inverse correlations with size, where larger models sometimes had fewer sticky tokens. LLM-based models exhibited highly varied counts, with some models like gte-Qwen2-7B-instruct containing 103 sticky tokens.
Impact on Downstream Tasks
Comprehensive evaluation across 15 MTEB tasks demonstrated that sticky tokens cause significantly higher performance degradation compared to randomly chosen normal tokens (p < 0.05, Cohen's d = 0.41). For the ST5-base model, sticky token insertion led to substantial performance drops: SciFact retrieval accuracy fell by 41.5% and NFCorpus retrieval accuracy by 52.3%.
The impact varied by model size, with lightweight models suffering more catastrophic degradation while larger models showed greater robustness, though all remained vulnerable to some degree.
Theoretical Analysis and Attention Patterns
The authors conducted attention layer analysis to understand the underlying mechanism behind sticky token behavior. They found that sticky tokens disproportionately dominate model attention, with their attention weights concentrating in high-value ranges (>0.4), unlike normal tokens which follow a more Gaussian distribution.
Layer-wise analysis revealed that irregularities are progressively amplified across layers. While divergence between sticky and normal token attention patterns is moderate in early layers, it sharply increases in mid to late layers, peaking at the final layers. This indicates that minor anomalies introduced by sticky tokens compound as information propagates through deeper layers.
The authors conjecture that this phenomenon relates to the inherent anisotropy of embedding spaces, where representations occupy a narrow cone rather than being uniformly distributed. This anisotropic structure enables sticky tokens to pull sentence embeddings toward specific focal points, reducing variance and making unrelated sentences appear more similar.
Security Implications and Mitigation Strategies
The discovery of sticky tokens opens new avenues for adversarial attacks, particularly against Retrieval-Augmented Generation (RAG) systems. By injecting sticky tokens into malicious content, attackers could manipulate retrieval results, forcing language models to access and potentially generate toxic or misleading information even when responding to benign queries.
The authors propose initial mitigation strategies including tokenizer sanitization through proactive pruning of problematic tokens before fine-tuning, and runtime detection systems that flag inputs containing suspected sticky tokens for real-time masking or embedding recalibration.
Limitations and Future Directions
The current approach assumes sticky tokens uniformly pull similarity toward the mean of token embeddings, which may not hold for models with isotropic embedding spaces or highly task-specific embeddings. The study was also limited to open-source models using BPE-based tokenization, leaving closed-source models and alternative tokenization schemes unexplored.
While the paper successfully identifies and characterizes the sticky token problem, it does not propose definitive solutions such as tokenizer retraining or embedding space regularization techniques. These limitations present clear directions for future research into more robust tokenization strategies and model architectures that can mitigate the adverse effects of sticky tokens, ultimately leading to more reliable embedding-based NLP systems.