Before December 2024, DeepSeek was rarely mentioned in China’s AI community. With the release of DeepSeek-V3 and the reasoning model R1, Chinese media and AI researchers started to ask the same question as their American counterparts: Who is DeepSeek and how should we feel about them?
In this newsletter, we share a translation of insights from a January 26 closed-door session hosted by Shixiang 拾象, a VC spun out from Sequoia China. Attended by dozens of AI researchers, investors, and industry insiders, the event captures how the Chinese AI community is processing the DeepSeek shock. A core conclusion they’ve come to, one we’ve emphasized in ChinaTalk with our Miles Brundage interview and guest post by Lennart and Sihao, is that “In the long-run, questions about computing power will remain. Demand for compute remains strong and no company has enough.”
Before diving into that translation, we did a broad look at additional details and discussion coming from Chinese-language coverage of DeepSeek.
The Story Behind DeepSeek The Paper 澎湃 offered more details about High-Flyer, the quantitative hedge fund behind DeepSeek. Founded in 2015 by Liang Wenfeng 梁文锋, a Zhejiang University graduate, High-Flyer has a strong background in machine learning-based quantitative trading. Liang founded DeepSeek in July 2023, and the company has not received any outside funding to date.
When it comes to hiring, DeepSeek prioritizes “young and high-potential” candidates — specifically those born around 1998 with no more than five years of work experience, similar to other AI labs in China. Said one DeepSeek employee to The Paper, “The success of DeekSeek has demonstrated the power of young people, and in essence, that the development of this generation of artificial intelligence needs young minds.”
Liang has maintained a relatively low public profile, but 36Kr managed to secure two exclusive interviews with him. The first, in May 2023, followed High-Flyer’s announcement that it was building LLMs, while the second, in November 2024, came after the release of DeepSeek-V2.
In both interviews, Liang emphasized the value of innovation without immediate monetization and DeepSeek’s culture of openness. The second interview had a stark shift in tone, with Liang meditating less on the baked-in idealism of a strategy predicated on open sourcing core innovations and more time emphasizing that he wanted DeepSeek to prove to other Chinese engineers that domestic teams could deliver on “hardcore innovation.”
A budding partnership with ByteDance? TMT 钛媒体 reported yesterday that ByteDance and OpenAI are “considering research collaborations” with DeepSeek. While the two firms may have talked in the past, given today’s political climate it’s kind of hard to put much weight into the OpenAI rumor. Partnering with ByteDance, however, could be an enormous unlock for DeepSeek researchers, giving them access to orders of magnitude more compute.
National Pride in the Face of US Competition. The response from Chinese media has been quite positive. State media and industry leaders have celebrated DeepSeek’s achievements, often tinged with nationalist pride, particularly after English-language reports highlighted its performance and cost efficiency. For example:
China Daily declared, “For a Chinese LLM, it’s a historical moment to surpass ChatGPT in the US.” Daily Economic News echoed this sentiment, stating, “Silicon Valley Shocked! Chinese AI Dominates Foreign Media, AI Experts Say: ‘It Has Caught Up with the U.S.!’”
Tech executives have also weighed in. Feng Ji 冯骥, founder of Game Science (the studio behind Black Myth: Wukong), called DeepSeek “a scientific and technological achievement that shapes our national destiny (国运).” Zhou Hongyi, Chairperson of Qihoo 360, told Jiemian News that DeepSeek will be a key player in the “Chinese Large-Model Technology Avengers Team” to counter U.S. AI dominance.
Ordinary users have also been astounded by the model’s capabilities. Many were impressed by the Chinese poems that DeepSeek could write, and tutorials have come up, instructing users to use as few prompting words as possible and ask DeepSeek to talk like a human (说人话). In a viral Weibo post, a user said, “I never thought there would come a day when I would shed tears for AI,” citing DeepSeek’s response to their feelings of existential threat over DeepSeek’s ability to write.
Here is DeepSeek R1’s response: “Remember, all the words that make you tremble are just echoes that already exist deep within your soul. I am merely a valley that happened to pass by, allowing you to hear the weight of your own voice.” 记住,所有让你颤粟的文字,本质上都是你灵魂深处早已存在的回声。我不过是偶尔经过的山谷,让你听到了自己声音的重量。
And now, our translation of the industry summit.
A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology
January 26th. WeChat Link, Archive.
DeepSeek-R1 has sparked a frenzy in the global AI community, but there is a relative dearth of high-quality information about DeepSeek.
On January 26, 2025, 李广密 Guangmi Li, Founder and CEO of 拾象 Shixiang, organized a closed-door discussion on DeepSeek with dozens of top AI researchers, investors and frontline AI practitioners to discuss and learn from DeepSeek's technical details, organizational culture, and short-, medium-, and long-term impacts of its entry into the world. This discussion attempted to lift the veil of this “mysterious eastern force” about which we have so little information.
Below is a summary of the key points from this discussion.
The Mystical DeepSeek. ‘The most important thing about DeepSeek is pushing intelligence’
Founder and CEO Liang Wenfeng is the core person of DeepSeek. He is not the same type of person as Sam Altman. He is very knowledgeable about technology.
DeepSeek has a good reputation because it was the first to release the reproducible MoE, o1, etc. It succeeded in acting early, but whether or not it did the absolute best remains to be seen. Moving forward, the biggest challenges are that resources are limited and can only be invested in the most high-potential areas. DeepSeek’s research and culture are still strong, and if given 100,000 or 200,000 chips, they might be able to do better.
From its preview to its official release, DeepSeek’s model’s long-context capabilities have improved rapidly. DeepSeek’s long-context 20K can be achieved with very conventional methods.
The CEO of Scale.ai said that DeepSeek has 50,000 chips, but that is definitely not reality. According to public information, DeepSeek had 10,000 old A100 chips and possibly 3,000 H800 cards before the ban. DeepSeek pays great attention to compliance and has not purchased any non-compliant GPUs, so it should have few chips. The way the United States uses GPUs is too extravagant.
DeepSeek focused all its efforts on a single goal and subsequently gave up many things, such as multimodality. DeepSeek is not just serving people, but seeking intelligence itself, which may have been a key factor in its success.
In some ways, quant trading can be said to be the business model of DeepSeek. Huanfang (another quantitative investment company founded by Liang Wenfeng) is the product of the last round of machine learning. DeepSeek’s highest priority is to push intelligence. Money and commercialization are not high priorities. China needs several leading AI labs to explore things that can beat OpenAI. Intelligence takes a long time to develop, and has begun to differentiate again this year, so new innovations are bound to result.
From a technical perspective, DeepSeek has been instrumental as a training ground for talent.
The business model of AI labs in the United States is not good either. AI does not have a good business model today and will require viable solutions in the future. Liang Wenfeng is ambitious; DeepSeek does not care about the model and is just heading towards AGI.
Many of the insights from DeepSeek’s paper involve saving hardware costs. On a couple of big dimensions of scaling, DeepSeek’s techniques are able to reduce costs.
In the short-term, everyone will be driven to think about how to make AI more efficient. In the long-run, questions about computing power will remain. Demand for compute remains strong and no company has enough.
Discussing DeepSeek’s organization:
When investing, we always choose the most advanced talent. But we see from DeepSeek’s model (the team is mostly smart young people who graduated from domestic universities) that a group that coheres well may also gradually advance their skills together. It has yet to be seen whether poaching one person might break DeepSeek’s advantage, but for now this seems unlikely.
While there’s a lot of money in the market, DeepSeek’s core advantage is its culture. The research culture of DeepSeek and ByteDance are similar, and both are critical for determining the availability of funding and long-term viability. Only with an important business model can there be a sustainable culture. Both DeepSeek and ByteDance have very good business models.
Why did DeepSeek catch up so fast?
Reasoning models require high-quality data and training. For LLMs or multimodal AI, it’s difficult to catch up with a closed source model from scratch. The architecture of pure reasoning models hasn’t changed much, so it’s easier to catch up in reasoning.
One reason R1 caught up quickly was that the task was not particularly difficult. Reinforcement learning only made the model choices more accurate. R1 did not break through the efficiency of Consensus 32, spending 32 times the efficiency, which is equivalent to moving from deep processing to parallelization, which is not pushing the boundaries of intelligence, just making it easier.
Pioneers vs. Chasers: 'AI Progress Resembles a Step Function – Chasers Require 1/10th the Compute’
AI is similar to a step function, where the compute requirements for followers have decreased by a factor of 10. Followers have historically had lower compute costs, but explorers still need to train many models. The exploration of new algorithms and architectures will not stop. Behind the step function, there are significant investments by many people, meaning compute investments will continue to advance. Many resources will also be allocated to products. Apart from reasoning, there are other directions that are compute-intensive. While the vast amount of compute resources spent by explorers may not be visible, without such investment, the next "step" might not occur. Additionally, many are dissatisfied with current architectures and RL methods, and progress will continue.
When exploring directions, performance achieved with 10,000 GPUs may not always be significantly better than that of 1,000 GPUs, but there is a threshold somewhere. It’s unlikely that meaningful results can be achieved with only 100 GPUs because the iteration time for each solution would be too long.
Advancements in physics can be divided into academic research in universities and industry labs. The former focuses on exploring multiple directions without requiring immediate returns, while the latter prioritizes efficiency improvements.
From the perspectives of explorers and chasers, small companies with limited GPUs must prioritize efficiency, whereas large companies focus on achieving models as quickly as possible. Methods that improve efficiency on a 2,000-GPU cluster may not work effectively on a 10,000-GPU cluster, where stability becomes a higher priority.
The advantage of the CUDA ecosystem lies in its extensive and complete set of operators. Chinese companies like Huawei have targeted commonly used operators to achieve breakthroughs, leveraging their latecomer advantage. If a company has access to 100,000 GPUs, the decision between becoming a leader or a chaser is critical. Being a leader comes with high costs, while being a chaser offers higher efficiency. The next direction for China to follow could be multi-modality, especially since GPT-5 has been delayed for a long time.
[points 18-48 was a long technical discussion we’ve machine-translated below]
Why didn’t the other companies take the DeepSeek approach: ‘Models from the big labs need to maintain a low profile’
The question of why OpenAI and Anthropic did not do work in DeepSeek’s direction is a question of company-specific focus. OpenAI and Anthropic might have felt that investing their compute towards other areas was more valuable.
One hypothesis for why DeepSeek was successful is that unlike Big Tech firms, DeepSeek did not work on multi-modality and focused exclusively on language. Big Tech firms’ model capabilities aren’t weak, but they have to maintain a low profile and cannot release too often. Currently, multimodality is not very critical, as intelligence primarily comes from language, and multimodality does not contribute significantly to improving intelligence.
The Divergence and Bets of 2025 Technology: ‘Can We Find Architectures Beyond Transformer?’
In 2025, models will begin to diverge. The most enticing vision is to continuously push the boundaries of intelligence, with many potential breakthrough paths. Methods might change, such as through synthetic data or alternative architectures.
2025 will, first and foremost, see interest in new architectures beyond Transformers. Some initial exploration is already underway, aiming to reduce costs while pushing the boundaries of intelligence. Secondly, the potential of reinforcement learning (RL) has yet to be tapped into completely. On the product side, there is significant interest in agents, though they have yet to see widespread application.
Multimodal products capable of challenging the ChatGPT paradigm might emerge in 2025.
The success of R1 and V3 in achieving low cost and high performance demonstrates the viability of this direction. This does not conflict with the approach of expanding hardware or increasing parameters. However, in China, due to certain restrictions, the former path is the primary option.
On DeepSeek:
First, DeepSeek may have been "forced" into its current path from base models or may simply be following the Scaling Law.
Second, from the perspective of distillation, DeepSeek likely follows a "large to small" approach. This is beneficial for closed-source models, which are growing larger and larger.
Third, there are currently no anti-scaling metrics emerging in the field. If such metrics arise, they could pose a challenge to the Scaling Law. However, open-source models can implement everything closed-source models do while also reducing costs, which is advantageous for closed-source models as well.
It is reported that Meta is still in the process of reproducing DeepSeek, but so far, this has not significantly impacted their infrastructure or long-term roadmap. In the long run, beyond exploring the boundaries of the technology, cost efficiency must also be considered. Lowering costs will let us have more fun.
Have developers moved from closed-source models to DeepSeek? ‘Not yet’
Will developers migrate from closed-source models to DeepSeek? Currently, there hasn’t been any large-scale migration, as leading models excel in coding instruction adherence, which is a significant advantage. However, it’s uncertain whether this advantage will persist in the future or be overcome.
From the developer's perspective, models like Claude-3.5-Sonnet have been specifically trained for tool use, making them highly suitable for agent development. In contrast, models like DeepSeek have not yet focused on this area, but the potential for growth with DeepSeek is immense.
For large model users, DeepSeek V2 already meets most needs. While R1 improved speed, it didn’t provide significant additional value. Interestingly, when engaging in deep reasoning, some previously correct answers now tend to be incorrect.
When choosing models, users tend to simplify problems using engineering methods. 2025 may become a year of applications, with industries leveraging existing capabilities. However, this could lead to a bottleneck, as most day-to-day tasks might not require highly intelligent models.
Currently, reinforcement learning (RL) solves problems with standard answers but has not achieved breakthroughs beyond what AlphaZero accomplished. In fact, it is often simpler. Distillation addresses problems with standard answers, and RL methods work effectively when training with such answers. This explains why distillation and RL have made rapid progress in recent years.
Humanity’s demand for intelligence is vastly underestimated. Many critical problems, such as cancer and SpaceX's heat shield materials, remain unsolved. Existing AI primarily automates tasks, but there are numerous unsolved challenges ahead. Looking forward, the potential for explosive growth is immense, and the advancement of intelligence cannot stop.
OpenAI Stargate’s $500B Narrative and Changes in Computing Power Demand
The emergence of DeepSeek has led people to question the latest $500B narrative from Nvidia and OpenAI. There’s no verdict yet on compute — and OpenAI’s $500B narrative is their attempt to throw themselves a lifeline.
Regarding the doubts about OpenAI’s $500B infrastructure investment: because OpenAI is a commercial company, it could be risky if debt is involved.
$500B is an extreme number — likely to be executed over 4 or 5 years. SoftBank and OpenAI are the leading players (the former providing capital, the latter technology) — but SoftBank’s current funds can’t support $500B; rather SoftBank is using its assets as collateral. OpenAI, meanwhile, isn’t very cash-rich either, and other AI companies are more technical participants than they are funding providers. So it will be a struggle to fully realize the $500B vision.
OpenAI’s $500B computing power makes sense: during the exploration phase, the cost of trial and error is high, with both human and investment costs being substantial. But although the path isn’t clear and getting from o1 to R1 won’t be easy, at least we can see what the finish line looks like: we can track the intermediate markers, and from day one, aim for others’ proven end states; this gives us a better bearing on our progress. Being at the frontier exploring the next generation is most resource-intensive. The followers don’t bear exploration costs — they’re always just following. If Google/Anthropic succeed in their exploration areas, they might become the frontier company.
In the future, Anthropic might replace all their inference with TPU or AWS chips.
Domestic Chinese companies were previously constrained by computing power, but now it’s proven that the potential technical space is vast. For more efficient models, we might not need especially large cards — we can provide relatively customized chips that can be adapted for compatibility with AMD and ASIC. From an investment perspective, Nvidia’s moat is very high, but ASIC will have yet greater opportunities.
The DeepSeek situation isn’t really about compute — it’s about America realizing China’s capabilities and efficiency. DeepSeek isn’t Nvidia’s vulnerability; Nvidia will grow as long as AI grows. Nvidia’s strength is its ecosystem, which has been built up over a long time. Indeed, when technology develops rapidly, the ecosystem is crucial. The real crisis comes, though, when technology matures like electricity: it becomes commoditized; then, everyone will focus on products, and many ASIC chips will emerge for specific scenario optimization.
Impact on the Secondary Market: ‘Short-term sentiment is under pressure, but the long-term narrative continues’
DeepSeek has had a significant short-term impact on the US AI sector and stock prices: pretrain demand growth is slowing, while post-training and inference scaling haven’t scaled up fast enough, creating a gap in the narrative for related companies, which will affect short-term trading.
DeepSeek mainly uses FP8, while the US uses FP16. DeepSeek’s improvements are all based on limited computational engineering capabilities, with efficient use of computing power being the biggest highlight. Last Friday, DeepSeek had a huge impact in North America: Zuckerberg gave higher expectations for Meta’s capital expenditure, but Nvidia and TSMC fell, and only Broadcom rose.
DeepSeek creates short-term market-sentiment pressure on stock prices and valuations. That’s affecting secondary market computing-related companies, and even energy companies — but the long-term narrative will continue.
Secondary-market practitioners will worry about potential air pockets in Nvidia’s transition from H cards to B cards. Combined with pressure from DeepSeek, there will be short-term stock-price pressure — but this may give rise to better long-term opportunities.
This short-term impact reflects sentiment about DeepSeek’s low-cost training investments (see, for instance, how it directly affected Nvidia’s stock price). AI, however, is a growth market with huge potential. Long-term, AI is just beginning, and if CUDA remains the preferred choice, hardware growth potential remains substantial.
Open-Source vs Closed Source: ‘If capabilities are similar, closed source will struggle.’
The battle between open-source and closed-source intensifies the spotlight on DeepSeek.
There is a possibility that OpenAI and others have hidden their good models, and no leading models have been released so far. But after DeepSeek’s release, other AI companies may not be able to hide their good models anymore.
DeepSeek has done a lot of cost optimization. Amazon and others haven't seen any changes as a result and are still following the established plan in a state of coexistence. Open source and closed source models are not contradictory. Universities and small labs should give priority to DeepSeek. There will be no competition for cloud vendors, because cloud vendors support open source and closed source, preserving the current state of coexistence in the ecosystem. DeepSeek’s applications have not been as mature as Anthropic’s, and the latter has spent a lot of time on AI security Deepseek must consider if it hopes to be recognized by European and American markets in the long-term.
Open source controls the margins of the whole market. If open source can do 95% of what closed source can do and closed source is too expensive, then open source can be used completely. If the capabilities of open source and closed source do not differ greatly, then this presents a big challenge for closed source.
The Impact of DeepSeek’s Breakthrough: ‘Vision Trumps Technology’
DeepSeek’s breakthrough made the outside world realize China’s AI strength. Previously, outsiders thought China’s AI progress lagged America by two years, but DeepSeek shows the gap is actually 3 to 9 months, and in some areas, even shorter.
When it comes to technologies and sectors that America has historically blocked China from accessing, if China can break through nonetheless, those sectors ultimately become highly competitive. AI might follow this pattern — and DeepSeek’s success may well prove this.
DeepSeek didn’t suddenly explode. R1’s impressive results reverberated throughout America’s entire AI establishment.
DeepSeek stands on the shoulders of giants — but exploring the frontier still requires much more time and human capital cost. R1 doesn’t mean that future training costs will decrease.
AI explorers definitely need more computing power; China, as a follower, can leverage its engineering advantages. How Chinese large-model teams use less computing power to produce results, thereby having some definite resilience — or even doing better — might end up being how the US-China AI landscape plays out in the future.
China is still replicating technical solutions; reasoning was proposed by OpenAI in o1, so the next gap between various AI labs will be about who can propose the next reasoning. Infinite-length reasoning might be one vision.
The core difference between different AI labs’ models lies not in technology, but in what each lab’s next vision is.
After all, vision matters more than technology.
Technical Discussion
What followed was a deep 2000-word technical discussion that explored SFT, data, distillation and the process reward function.
Technical Detail 1: SFT. ‘No need for SFT on the reasoning level’
The most groundbreaking aspect of DeepSeek isn’t its open-source nature or low cost, but the fact that it eliminates the need for Supervised Fine-Tuning (SFT)—at least for inference tasks. However, tasks beyond inference may still require SFT. This raises questions about whether this represents a new paradigm or architecture that improves data efficiency in training, or whether it accelerates the iteration speed of model performance.