DeepSeek's Edge

Three takes on what makes them special

and

Jan 12, 2025

We ran a fun podcast earlier this week with Divyansh Kaushik talking about the tech bros vs MAGA fight where we got into implications for immigration and AI policy as well as education and the Asian immigrant experience in America. Check it out on iTunes, Spotify, or our favorite podcast app.

DeepSeek’s Three Edges

An excerpt from Kevin Xu’s excellent Interconnected Substack

Three idiosyncratic advantages that make DeepSeek a unique beast.

These are idiosyncrasies that few, if any, leading AI labs from either the US or China or elsewhere share. Thus, understanding them is important, so we don’t over-extrapolate or under-estimate what DeepSeek’s success means in the grand scheme of things.

No Business Model

DeepSeek is incubated out of a quant fund called High Flyer Capital. Its AI models have no business model. How this came about can be understood from its unique corporate history.

High Flyer Capital’s founder, Liang Wenfeng, studied AI as an undergraduate at Zhejiang University (a leading Chinese university) and was a serial and struggling entrepreneur right out of college. He finally found success in the quantitative trading world, despite having no experience in finance, but he’s always kept an eye on frontier AI advancement.

When ChatGPT took the world by storm in November 2022 and lit the way for the rest of the industry with the Transformer architecture coupled with powerful compute, Liang took note. DeepSeek, as an AI lab, was spun out of the hedge fund six months after ChatGPT’s launch. It is internally funded by the investment business, and its compute resources are reallocated from the algorithm trading side, which acquired 10,000 A100 Nvidia GPUs to improve its AI-driven trading strategy, long before US export control was put in place.

DeepSeek’s stated mission was to pursue pure research in search of AGI. This idealistic and somewhat naive mission – not so dissimilar to OpenAI’s original mission – turned off all the venture capitalists Liang initially approached. DeepSeek’s failure to raise outside funding became the reason for its first idiosyncratic advantage: no business model.

A lack of business model and lack of expectation to commercialize its models in a meaningful way gives DeepSeek’s engineers and researchers a luxurious setting to experiment, iterate, and explore. Despite having limited GPU resources due to export control and smaller budget compared to other tech giants, there is no internal coordination, bureaucracy, or politics to navigate to get compute resources. No one has to wrestle between using GPUs to run the next experimentation or serving the next customer to generate revenue.

Almost no other leading AI labs or startups in either the US or China has this advantage. OpenAI used to have this luxury, but it is now under immense revenue and profit pressure. Evidently, OpenAI’s “AGI clause” with its benefactor, Microsoft, includes a $100 billion profit milestone! Every other AI shop you’ve heard of – Anthropic, Mistral, xAI, Cohere, 01.ai, Moonshot – has revenue and commercialization expectations in one flavor or another. That inevitably leads to constant internal friction between the sales team that needs to sell compute capacity to make money, and the R&D team that needs to use compute capacity to make technical progress.

But not DeepSeek! Have a hunch for an architectural breakthrough? Do a training run and see what happens. Want to test out some data format optimization to reduce memory usage? Go test it out.

Runs Own Datacenter

One of DeepSeek’s idiosyncratic advantages is that the team runs its own data centers. Unlike OpenAI, which has to use Microsoft’s Azure, or Anthropic, which has to use Amazon’s AWS, or 01.ai, which has to use Alibaba’s cloud platform, DeepSeek racks its own servers.

To be clear, having a hyperscaler’s infrastructural backing has many advantages. Not needing to manage your own infrastructure and just assuming that the GPUs will be there frees up the R&D team to do what they are good at, which is not managing infrastructure. However, having to work with another team or company to obtain your compute resources also adds both technical and coordination costs, because every cloud works a little differently. Meanwhile, when you are resource constrained, or “GPU poor”, thus need to squeeze every drop of performance out of what you have, knowing exactly how your infra is built and operated can give you a leg up in knowing where and how to optimize.

Software-to-Hardware Optimization Expertise

If you combine the first two idiosyncratic advantages – no business model plus running your own datacenter – you get the third: a high level of software optimization expertise on limited hardware resources.

This expertise was on full display up and down the stack in the DeepSeek-V3 paper. By far the most interesting section (at least to a cloud infra nerd like me) is the "Infractructures" section, where the DeepSeek team explained in detail how it managed to reduce the cost of training at the framework, data format, and networking level.

Its training framework is built from scratch by DeepSeek engineers, called the HAI-LLM framework. To increase training efficiency, this framework included a new and improved parallel processing algorithm, DualPipe. At the heart of training any large AI models is parallel processing, where each accelerator chip calculates a partial answer to all the complex mathematical equations before aggregating all the parts into the final answer. Thus, the efficiency of your parallel processing determines how well you can maximize the compute power of your GPU cluster.

This framework also changed many of the input values’ data format to floating point eight or FP8. FP8 is a less precise data format than FP16 or FP32. Think number of decimal places as an analogy, FP32 has more decimals than FP8, thus more numbers to store in memory. This reduced precision means storing these numbers will take up less memory. The bet is that the precision reduction would not negatively impact the accuracy or capabilities of the resulting model. This method, called quantization, has been the envelope that many AI researchers are pushing to improve training efficiency; DeepSeek-V3 is the latest and perhaps the most effective example of quantization to FP8 achieving notable memory footprint.

The networking level optimization is probably my favorite part to read and nerd out about. There are two networking products in a Nvidia GPU cluster – NVLink, which connects each GPU chip to each other inside a node, and Infiniband, which connects each node to the other inside a data center. In the H-series, a node or server usually has eight chips connected together with NVLink. Since we know that DeepSeek used 2048 H800s, there are likely 256 nodes of 8-GPU servers, connected by Infiniband. With NVLink having higher bandwidth than Infiniband, it is not hard to imagine that in a complex training environment of hundreds of billions of parameters (DeepSeek-V3 has 671 billion total parameters), with partial answers being passed around between thousands of GPUs, the network can get pretty congested while the entire training process slows down. To reduce networking congestion and get the most out of the precious few H800s it possesses, DeepSeek designed its own load-balancing communications kernel to optimize the bandwidth differences between NVLink and Infiniband to maximize cross-node all-to-all communications between the GPUs, so each chip is always solving some sort of partial answer and not have to wait around for something to do.

I don’t pretend to understand every technical detail in the paper. And I don't want to oversell the DeepSeek-V3 as more than what it is – a very good model that has comparable performance to other frontier models with extremely good cost profile.

However, what DeepSeek has achieved may be hard to replicate elsewhere. Its team and setup – no business model, own datacenter, software-to-hardware expertise – resemble more of an academic research lab that has a sizable compute capacity, but no grant writing or journal publishing pressure with a sizable budget, than its peers in the fiercely competitive AI industry. These idiocracies are what I think really set DeepSeek apart.

How Much Did They Really Spend?

Nathan Lambert recently published an excellent breakdown of Deepseek V3’s technical innovations and probed more deeply into the $6m training costs claim. An excerpt below.

The cumulative question of how much total compute is used in experimentation for a model like this is much trickier. Common practice in language modeling laboratories is to use scaling laws to de-risk ideas for pretraining, so that you spend very little time training at the largest sizes that do not result in working models. This looks like 1000s of runs at a very small size, likely 1B-7B, to intermediate data amounts (anywhere from Chinchilla optimal to 1T tokens). Surely DeepSeek did this. The total compute used for the DeepSeek V3 model for pretraining experiments would likely be 2-4 times the reported number in the paper.

A true cost of ownership of the GPUs — to be clear, we don’t know if DeepSeek owns or rents the GPUs — would follow an analysis similar to the SemiAnalysis total cost of ownership model (paid feature on top of the newsletter) that incorporates costs in addition to the actual GPUs. For large GPU clusters of 10K+ A/H100s, line items such as electricity end up costing over $10M per year. The CapEx on the GPUs themselves, at least for H100s, is probably over $1B (based on a market price of $30K for a single H100).

These costs are not necessarily all borne directly by DeepSeek, i.e. they could be working with a cloud provider, but their cost on compute alone (before anything like electricity) is at least $100M’s per year.

For one example, consider comparing how the DeepSeek V3 paper has 139 technical authors. This is a very large technical team.With headcount costs that can also easily be over $10M per year, estimating the cost of a year of operations for DeepSeek AI would be closer to $500M (or even $1B+) than any of the $5.5M numbers tossed around for this model. The success here is that they’re relevant among American technology companies spending what is approaching or surpassing $10B per year on AI models.

At this rate it will be true that you can train a model at the performance of DeepSeek V3 for ~$5.5M in a few years. For now, the costs are far higher, as they involve a combination of extending open-source tools like the OLMo code and poaching expensive employees that can re-solve problems at the frontier of AI.

The paths are clear. DeepSeek shows that a lot of the modern AI pipeline is not magic — it’s consistent gains accumulated on careful engineering and decision making. In face of the dramatic capital expenditures from Big Tech, billion dollar fundraises from Anthropic and OpenAI, and continued export controls on AI chips, DeepSeek has made it far further than many experts predicted. The ability to make cutting edge AI is not restricted to a select cohort of the San Francisco in-group. The costs are currently high, but organizations like DeepSeek are cutting them down by the day.

Earlier last year, many would have thought that scaling and GPT-5 class models would operate in a cost that DeepSeek cannot afford. As Meta utilizes their Llama models more deeply in their products, from recommendation systems to Meta AI, they’d also be the expected winner in open-weight models. Today, these trends are refuted. Meta has to use their financial advantages to close the gap — this is a possibility, but not a given. I certainly expect a Llama 4 MoE model within the next few months and am even more excited to watch this story of open models unfold.

Hardware Alone Can’t Win the AI Race

Ritwik Gupta is a PhD candidate and AI researcher at UC Berkeley. In this piece, he introduces the overlooked role of software in export controls.

The Chinese large language model DeepSeek-V3 has recently made waves, achieving unprecedented efficiency and even outperforming OpenAI’s state-of-the-art models. This is an eyebrow-raising advancement given the USA’s multi-year export control project, which aims to restrict China’s access to advanced semiconductors and slow frontier AI advancement.

Trained on just 2,048 NVIDIA H800 GPUs over two months, DeepSeek-V3 utilized 2.6 million GPU hours, per the DeepSeek-V3 technical report, at a cost of approximately $5.6 million — a stark contrast to the hundreds of millions typically spent by major American tech companies. The NVIDIA H800 is permitted for export — it’s essentially a nerfed version of the powerful NVIDIA H100 GPU. DeepSeek’s success was largely driven by new takes on commonplace software techniques, such as Mixture-of-Experts, FP8 mixed-precision training, and distributed training, which allowed it to achieve frontier performance with limited hardware resources.

Mixture-of experts (MoE) combine multiple small models to make better predictions—this technique is utilized by ChatGPT, Mistral, and Qwen. DeepSeek introduced a new method to select which experts handle specific queries to improve MoE performance. Mixed precision training, first introduced by Baidu and NVIDIA, is now a standard technique in which the numerical precision of a model is variably reduced from 32 to 16-bits. DeepSeek-V3, interestingly, further reduces the precision of the model to 8-bits during training, a configuration not commonly seen previously. DeepSeek crafted their own model training software that optimized these techniques for their hardware—they minimized communication overhead and made effective use of CPUs wherever possible.

This remarkable achievement highlights a critical dynamic in the global AI landscape: the increasing ability to achieve high performance through software optimizations, even under constrained hardware conditions. A recent paper I coauthored argues that these trends effectively nullify American hardware-centric export controls — that is, playing “Whack-a-Chip” as new processors emerge is a losing strategy. We reverse-engineer from source code how Chinese firms, most notably Tencent, have already demonstrated the ability to train cutting-edge models on export-compliant GPUs by leveraging sophisticated software techniques.

We explore techniques including model ensembling, mixed-precision training, and quantization — all of which enable significant efficiency gains. By improving the utilization of less powerful GPUs, these advancements reduce dependency on state-of-the-art hardware while still allowing for significant AI advancements. DeepSeek-V3’s advanced capabilities appear to validate the paper’s thesis.

As software-driven efficiencies accelerate, resource-constrained entities will increasingly be able to compete with larger, well-funded organizations. But by focusing predominantly on hardware, U.S. policymakers have overlooked the transformative potential of software innovations, inadvertently enabling adversaries to maintain technological parity through creative workarounds.

Hardware-only export control strategies can be made more effective by hinging themselves on concrete benchmarks that account for changing software. The field of machine learning has progressed over the large decade largely in part due to benchmarks and standardized evaluations. The United States’ security apparatus should first concretely define the types of workloads it seeks to prevent adversaries from executing. Then, it should work with the newly established NIST AI Safety Institute to establish continuous benchmarks for such tasks that are updated as new hardware, software, and models are made available. A data-driven approach can provide more comprehensive assessments on how adversaries can achieve particular goals and inform how technologies should be controlled.

Simultaneously, the United States needs to explore alternate routes of technology control as competitors develop their own domestic semiconductor markets. Limiting the ability for American semiconductor companies to compete in the international market is self-defeating. The United States restricts the sale of commercial satellite imagery by capping the resolution at the level of detail already offered by international competitors — a similar strategy for semiconductors could prove to be more flexible.

ChinaTalk

Discussion about this post