31006
Technology

Understanding Frequency Bias in SGD and Adam's Adaptive Remedy

Posted by u/Tiobasil · 2026-05-19 21:27:35

In training language models, not all words are created equal. Common tokens like "the" appear in nearly every sentence, while rare but meaningful ones may be seen only occasionally. This imbalance creates a hidden challenge: standard optimization methods like Stochastic Gradient Descent (SGD) treat all parameters uniformly, causing rare tokens to learn slowly. Adaptive optimizers like Adam address this by scaling updates based on each parameter's history. In this Q&A, we explore the frequency bias problem and how Adam's design provides a fix.

What is the frequency bias problem in SGD?

Stochastic Gradient Descent (SGD) applies the same learning rate to every parameter in a model, regardless of how often that parameter receives gradient updates. In natural language processing, token frequencies follow a highly skewed distribution: common words like "the" or "a" appear in almost every batch, while rare tokens like "thalweg" may appear only once in a thousand batches. Under SGD, parameters associated with common tokens are updated frequently and converge quickly, but rare-token parameters often remain close to their random initialization because they receive very few gradient steps. This asymmetry—where frequently updated weights dominate learning and rarely updated weights stagnate—is called frequency bias. It leads to suboptimal representations for low-frequency but semantically important tokens, ultimately hurting model performance on tasks that depend on those rare features.

Understanding Frequency Bias in SGD and Adam's Adaptive Remedy
Source: www.marktechpost.com

How does Adam overcome the frequency bias of SGD?

Adam is often described as SGD with momentum, but its most impactful feature for addressing frequency bias is variance normalization. Adam maintains per-parameter estimates of the mean (first moment) and uncentered variance (second moment) of recent gradients. It then normalizes each update by dividing by the square root of the second moment, effectively scaling the step size based on the historical consistency of gradient information. Parameters that rarely receive updates have low accumulated variance, so their effective learning rate becomes proportionally larger. This allows rare-token weights to take bigger steps when they finally receive a gradient, compensating for the infrequency of updates. In contrast, common tokens with high gradient variance see smaller normalized updates, preventing overshooting. This adaptive mechanism helps all parameters learn at a balanced pace, even when gradient exposure is extremely uneven.

Can you describe the controlled experiment that demonstrates this behavior?

To isolate the effect of token frequency, researchers constructed a synthetic experiment using a vocabulary of six tokens whose appearance probabilities span four orders of magnitude. Each token was assigned the same ground-truth weight of 1.0, removing semantic complexity. Training samples were represented as sparse binary vectors indicating which tokens were present in a batch, and the target value was the sum of active token weights plus noise. A linear model was trained twice: once with vanilla SGD and once with Adam, keeping all target weights identical. By comparing final parameter values, non-zero gradient counts, and Adam's effective learning rates for each token, the experiment directly observed how adaptive optimization compensates for frequency imbalance. This clean setup controlled for everything except how often each parameter received gradient updates.

What were the key findings from comparing SGD and Adam on imbalanced token frequencies?

The experiment revealed stark differences. Under SGD, parameters for the most frequent token (appearing in nearly every batch) converged quickly to near 1.0, while the rarest token (appearing only 0.1% of the time) remained almost unchanged from initialization, with its weight still close to random values. Adam, in contrast, produced final parameter values much closer to 1.0 for all tokens, including the rarest. Analysis of gradient counts showed that the rare token received far fewer non-zero gradients under both optimizers, but Adam’s effective learning rate for that token was significantly higher—often orders of magnitude larger than for common tokens. This allowed the rare token to make substantial updates during its few appearances, while SGD’s uniform learning rate left it stuck. The results quantitatively demonstrate that Adam’s variance normalization directly counteracts frequency bias, enabling balanced learning across the entire vocabulary.

Understanding Frequency Bias in SGD and Adam's Adaptive Remedy
Source: www.marktechpost.com

Why is Adam particularly effective for training modern language models?

Modern language models are trained on massive, naturally occurring text where token frequencies follow a Zipfian distribution—a tiny number of tokens dominate, while very many rare tokens appear infrequently. Standard SGD would cause the model to become disproportionately good at representing common tokens while poorly representing rare but often critical ones (e.g., technical terms, named entities). Adam’s per-parameter adaptive learning rates automatically compensate for this imbalance, allowing the model to learn robust representations across the entire vocabulary without manual tuning of learning rates per token or reweighting of data. This property makes Adam (and its variants like AdamW) the default optimizer in almost all large-scale NLP training pipelines. The result is models that perform well not only on frequent language patterns but also on niche and low-frequency expressions, which are crucial for tasks like question answering, translation, and domain-specific applications.

What implications does frequency bias have for model architecture design?

The frequency bias problem highlighted by SGD vs. Adam experiments suggests that naive uniform optimization wastes capacity on rare tokens. Even with Adam, some residual bias may remain for extremely rare tokens. This has spurred research into architectural modifications, such as adaptive embedding layers where the learning rate for each token embedding is explicitly controlled, or frequency-based reweighting of the loss function. Additionally, it underscores the importance of using optimizers that adapt to gradient sparsity. For models with very large vocabularies or long-tail distributions, practitioners often combine Adam with techniques like gradient clipping and learning rate schedules that further modulate updates. Understanding frequency bias also motivates careful initialization and the use of pre-trained embeddings, which can provide a warm start for rare tokens. Ultimately, the choice of optimizer is not just a training detail but a core component for ensuring that all tokens—common and rare—are fairly represented in the learned model.