Building Better AI Brains: How DeepSeek Solved the Hidden Instability Problem in Neural Networks

When you train a large language model, you’re essentially teaching a neural network to predict the next word in a sequence. But as these networks get deeper adding more layers to process information they become increasingly unstable. It’s like trying to build a tower of blocks where each new layer makes the whole structure wobble more precariously. For decades, researchers have managed this problem with a simple trick called residual connections. But now, a team at DeepSeek-AI has discovered that a promising new approach to make these networks more powerful actually breaks this stability mechanism and they’ve found an elegant mathematical solution.

The paper, titled “mHC: Manifold-Constrained Hyper-Connections”, tackles a fundamental challenge in scaling artificial intelligence: how to make neural networks deeper and more capable without them falling apart during training. The solution involves constraining neural network connections to a special mathematical space called the Birkhoff polytope, using an algorithm from the 1960s called Sinkhorn-Knopp. The result is a method that maintains training stability while preserving the performance benefits of more complex network architectures.

The Foundation: Why Residual Connections Matter

To understand why this matters, we need to go back to 2015, when Microsoft researchers introduced ResNets—residual networks—in a landmark paper that transformed deep learning. The key innovation was deceptively simple: instead of just passing information through a layer, they also added a “shortcut” that lets information bypass the layer entirely. Mathematically, instead of computing output = layer(input), they compute output = input + layer(input).

This seemingly minor change had profound consequences. It solved a problem that had plagued deep neural networks for years: the vanishing gradient problem. When you train a network, you adjust the weights by computing gradients—essentially, how much each weight should change to reduce errors. But in very deep networks, these gradients would become so small by the time they reached the early layers that learning essentially stopped. Residual connections prevented this by ensuring that gradients could flow directly through the shortcut, maintaining their strength.

More importantly, residual connections preserve what researchers call the “identity mapping property.” This is a fancy way of saying that the original input is always present in the output, unchanged. This acts as a stabilizing force—no matter what the layer does, there’s always a baseline signal flowing through. It’s like having a guardrail on a bridge: even if the bridge sways, you have something to hold onto.

This principle became the foundation of modern AI. Transformer models, which power ChatGPT and other large language models, rely entirely on residual connections. They’re so fundamental that it’s hard to overstate their importance.

The Tempting Problem: Hyper-Connections

Fast forward to 2024. A group of researchers published a paper on “Hyper-Connections” (HC), proposing a way to make residual connections more powerful. Instead of just adding the input to the layer output, they introduced learnable matrices that could mix and match information from multiple “streams” running in parallel through the network.

Think of it like this: instead of one highway where information flows, HC creates four parallel highways (or more, depending on the expansion rate). These highways can exchange traffic through learnable gates—controlled by the network itself. This allows for much richer interactions between different parts of the network without significantly increasing computational cost.

The performance gains were impressive. Models with HC performed better on downstream tasks. The idea was elegant and the results were promising. But there was a hidden problem lurking beneath the surface.

The Instability Crisis

When DeepSeek researchers tried to use Hyper-Connections in large-scale training, something went wrong. Around the 12,000th training step, the loss—a measure of how wrong the model’s predictions were—suddenly spiked. The training became erratic and unreliable. Gradient norms (a measure of how strongly the network was learning) became wildly unstable.

The culprit was mathematical. When you stack multiple HC layers on top of each other, the learnable mixing matrices get multiplied together. Since these matrices were unconstrained—they could be any values—this multiplication could amplify or shrink signals dramatically. The researchers found that the composite effect of these multiplied matrices could amplify signals by a factor of 3,000 in some cases. This is catastrophic for training.

Here’s why: in a well-designed neural network, information should flow smoothly through all layers without exploding or vanishing. The identity mapping property ensures this. But HC, in its original form, broke this property. The unconstrained mixing matrices meant that signals could grow exponentially as they propagated backward through the network during learning, or shrink to near-zero going forward. This is precisely the kind of instability that residual connections were designed to prevent.

There was also a practical problem: memory access overhead. The parallel streams required loading and storing much more data from GPU memory, which is often the real bottleneck in training (the “memory wall” problem). This added significant latency without being addressed in the original HC design.

The Elegant Solution: Manifold Constraints

Rather than abandoning HC entirely, the DeepSeek team asked a clever question: what if we could constrain the mixing matrices to preserve the stability properties of residual connections while still allowing information exchange?

The answer came from a surprising place: doubly stochastic matrices. These are matrices where every row sums to 1 and every column sums to 1, and all entries are non-negative. They have a beautiful mathematical property: when you multiply them together, the result is still a doubly stochastic matrix. This is called “compositional closure.”

Why does this matter? Because when you multiply doubly stochastic matrices together, the result acts as a weighted average of the inputs. The spectral norm (a measure of how much a matrix can amplify a signal) is bounded by 1. This means signals can’t explode. It’s like having a mathematical governor on the mixing matrices—they can still shuffle and recombine information, but they can’t amplify it beyond control.

The researchers called this approach mHC: Manifold-Constrained Hyper-Connections. The “manifold” is the mathematical space defined by doubly stochastic matrices—called the Birkhoff polytope in the literature.

To enforce this constraint during training, they used the Sinkhorn-Knopp algorithm, a classical algorithm from 1967 that takes any positive matrix and iteratively normalizes rows and columns until they sum to 1. The algorithm is elegant: alternately rescale rows and columns to satisfy the constraints. After about 20 iterations, you have a doubly stochastic matrix.

The mathematical properties are compelling:

Norm Preservation: The maximum amplification is bounded, preventing gradient explosion.
Compositional Closure: Multiple constrained matrices multiplied together remain constrained, so stability is maintained across all depths.
Convex Combination Interpretation: The mixing acts as a convex combination of features, which is inherently stable.

When you expand this to n=1 (a single stream), the constraint degenerates to the identity mapping—recovering the original residual connection. So mHC is a generalization that includes standard residual connections as a special case.

Engineering for the Real World

Having a theoretically sound idea is one thing. Making it practical is another. The researchers faced a challenge: the Sinkhorn-Knopp iterations and the expanded stream operations add computational overhead. In a field where every percentage point of efficiency matters, they needed to minimize this cost.

They employed several sophisticated engineering techniques:

Kernel Fusion: Rather than executing each operation separately (which requires moving data to and from GPU memory multiple times), they fused multiple operations into single GPU kernels. This reduces memory bandwidth bottlenecks. They also reordered operations mathematically—moving normalization operations to happen after matrix multiplications rather than before—which is mathematically equivalent but faster.

Mixed Precision: They used different numerical precisions (bfloat16 for some operations, float32 for others) to maximize speed without sacrificing accuracy.

Selective Recomputing: The expanded streams require storing more intermediate activations during forward passes. Rather than storing everything, they strategically recompute some activations during backpropagation. This trades computation time for memory savings. They derived the optimal recomputation block size mathematically to minimize total memory usage.

Communication Overlapping: In distributed training across multiple GPUs, they extended the DualPipe schedule to overlap communication (sending data between GPUs) with computation. This prevents the network from becoming a bottleneck.

The result: mHC with an expansion rate of 4 adds only 6.7% training overhead compared to the baseline. This is remarkable given the additional complexity.

The Results: Stability and Performance

The empirical validation is convincing. In a 27 billion parameter model trained with proportional data:

Stability: The training loss converges smoothly without the instability spikes seen in HC. Gradient norms remain stable throughout training. The propagation gain magnitude (how much signals amplify through layers) stays bounded around 1.6 in the worst case, compared to 3,000 for unconstrained HC.

Performance: On downstream benchmarks, mHC outperforms both the baseline and HC on most tasks. On reasoning-heavy benchmarks like BBH (Big-Bench Hard), mHC achieves 51.0% accuracy compared to 48.9% for HC and 43.8% for baseline. On DROP (a reading comprehension task requiring discrete reasoning), it reaches 53.9% compared to 51.6% for HC.

Scaling: The performance advantage scales well. Testing on 3B, 9B, and 27B models shows consistent improvements. The token scaling curve (performance over the course of training) demonstrates that benefits are maintained throughout the training process.

Stability Analysis: Visualizations of the learned mixing matrices show that mHC produces stable, well-behaved patterns. In contrast, HC produces extreme values in some entries and near-zero in others—the hallmark of instability.

Why This Matters

On the surface, this is a technical fix to a technical problem. But the implications are broader.

First, it demonstrates that theoretical elegance and practical efficiency aren’t always at odds. By understanding the mathematics deeply, the researchers found a solution that is theoretically grounded, practically efficient, and empirically effective.

Second, it opens new directions for architecture design. Rather than treating residual connections as fixed and unchangeable, mHC shows how to extend them while preserving their stability properties. This suggests other manifolds might be worth exploring for different objectives.

Third, it’s a reminder that scaling AI isn’t just about bigger models and more data. It’s about getting the architecture right. Many papers focus on what networks learn, but this work focuses on how they learn—the underlying dynamics of training. These dynamics matter more than people often appreciate.

Looking Forward

The authors note that their framework is general. While they use doubly stochastic matrices, other manifold constraints could be explored. Perhaps different constraints would optimize different trade-offs between stability and expressivity. Perhaps some manifolds would be better for certain types of tasks.

They also emphasize that this work is about macro-architecture—the overall topology of how layers connect. Most recent AI research focuses on micro-architecture (how individual attention heads or feed-forward layers work) or scaling laws (how performance improves with more parameters and data). But macro-architecture—how layers talk to each other—deserves more attention.

The practical impact is already clear: DeepSeek has integrated mHC into their large-scale training pipeline and reports that it enables stable training at scale with meaningful performance improvements. As models continue to grow larger and deeper, techniques like this become increasingly important.

Conclusion

The mHC paper is a masterclass in how to solve a hard problem. The researchers identified a real instability issue in a promising new architecture, diagnosed the root cause mathematically, found an elegant solution using classical mathematics, engineered it efficiently, and validated it empirically at scale.

It’s a reminder that progress in AI isn’t always about novel ideas from scratch. Sometimes it’s about understanding existing ideas deeply enough to see where they break, and having the mathematical sophistication to fix them in ways that preserve their benefits.

As large language models become increasingly central to AI systems, the stability and efficiency of their training becomes more critical. Techniques like mHC—which improve both—will likely become standard practice. And who knows? The manifold-constrained approach might inspire new architectural innovations that we haven’t even imagined yet.

The tower of blocks is now more stable. And researchers can continue building higher.