Founded in 2023, Moonshot AI is one of China’s four new “AI tigers” that’s attracted massive valuations and big-name investors including Alibaba and Tencent. The firm is known for its chatbot, Kimi, whose most recent release highlights improved math, coding, and multimodal reasoning capabilities.
The following piece is a translation of an interview with one of Moonshot’s founders, Yang Zhiling. With a bachelor’s degree from Tsinghua University and a PhD from Carnegie Mellon University, Yang boasts an impressive resume that includes time working at Google Brain and Meta AI. He was also a technical contributor to some of China’s earliest large models, including Pangu 盘古 and Wudao 悟道. His research publications are numerous, and he is the first author of two highly-cited papers in the natural language processing (NLP) field: Transformer-XL, which proposed a method that extends the context length of transformer models, and XLNet, which introduced a way for models to better understand complex data relationships.
This interview was published on the official account of Overseas Unicorn on February 21, 2024. In it, Yang outlines his vision of Moonshot AI as a combination of “OpenAI’s technology idealism” and “ByteDance’s business philosophy.” He covers a couple of key points:
Moonshot’s goals and how it plans to compete with OpenAI,
The pursuit of AGI,
Data challenges and the potential of multimodality and synthetic data,
Personalized AI models,
Yang’s approach to leadership and his vision for a global tech future.
Yang Zhilin of Moonshot AI: How Can a Newly Founded AGI Company Surpass OpenAI?
Original article, Archive link.
Interviewers | Tianyi, Penny, and Guangmi. Editor | Tianyi. Typesetter | Scout.
01. AGI: AI is essentially a bunch of scaling laws
Overseas Unicorn: We compare training LLMs to landing on the moon, and the name “Moonshot AI” [literally “dark side of the moon”] is also related to moon landing. How do you view LLM training by startup companies? Under conditions of limited GPU and computing resources, is it still possible to achieve a “moon landing”?
Yang Zhilin: “Moon landing” has several different production factors. Computing power is certainly a core one, but there are others as well.
You need an architecture that simultaneously satisfies both scalability and generality — but today, many architectures actually no longer meet these two conditions. Transformers satisfy these two conditions in the known token space, but when expanded to a more general scenario, they don’t quite work. Data is also a production factor, including the digitization of the entire world and data from users.
So among many core production factors, by changing other production factors, you can make computing power utilization more efficient.
At the same time, regarding “moon landing,” computing power will definitely need to continue growing. Today, the best models we can see are at a scale of 10^25 to 10^26 FLOPs. This order of magnitude will certainly continue to grow, so I believe computing power is a necessary condition. This is because machine learning and AI have been researched for 70 to 80 years, and the only thing that actually works is the scaling law, which is the expansion of these various production factors.
We are actually quite confident that, within a one-year time window, we will be able to achieve a model at the scale of 10^26 FLOPs, and that resources, ultimately, will be reasonably allocated.
Overseas Unicorn: For OpenAI to train their next-generation model, we estimate they have at least 100,000 H100 GPUs, with single clusters reaching 30,000 GPUs. OpenAI is clearly pursuing the “moon landing,” with the possible shortcoming being that they don’t focus as much on user and customer experience. Where will Moonshot AI’s path differ from OpenAI’s? What can Moonshot AI do that OpenAI won’t do?
Yang Zhilin: A key point in the short term is that everyone’s tech vision is not exactly the same. Many areas are not OpenAI’s core competitive strengths (for example, image generation); DALL·E 3 is at least one generation behind Midjourney. GPT’s long-context capabilities are also not state-of-the-art. The lossless long-context technology we recently developed performs better than OpenAI in many specific scenarios because it uses lossless compression technology. You can use it to read a very long article, and it can effectively reproduce specific details and make inferences about the content. Users will discover many scenarios themselves, such as chucking 50 resumes at it and having it analyze and screen them according to their requirements.
To achieve differentiation, I believe we need to look at how large the tech space is: the larger the tech space, the greater the differentiation that can be achieved at the technical, product, and business levels. If the technology has already converged, then all anyone can do is follow the same path, resulting in homogeneous involution.
And I’m actually quite optimistic, because there is still a huge tech space. AGI technology can be divided into three levels:
The first layer is the scaling law combined with next-token prediction (this foundation is the same for everyone, and the catching-up process is gradually converging). On this path [of scaling law with next-token prediction], OpenAI is currently doing better because they have invested the right resources over the past four to five years.
The second level has two core problems. The first is how to represent the world in a general-purpose way. True “general-purpose” representation is like a computer using 0 and 1 to represent the entire world. Transformer-based language models can represent a book, an article, or even a video — but representing a larger 3D world, or all the files on your hard drive, is still difficult. They haven’t achieved token-in-token-out, and are actually still far from the so-called unified representation. Architecture actually solves this problem.
Overcoming the bottleneck of data scarcity through AI self-evolution is another issue at the second level. Today’s AI is actually like a black box, and this black box has two inputs: a power cable and a data cable. After inputting these two things, the box can produce intelligence. Subsequently, everyone realized that the input from the data cable is limited — ie. the so-called data-bottleneck problem. The next generation of AI needs to unplug the data cable, so that as long as power is continuously input, intelligence can be continuously output.
These two core problems lead to enormous space at the third level, including long context, cross-modal generation, the model’s multi-step planning capabilities, instruction-following capabilities, various agent functionalities, and so on.
These higher-level elements will all have enormous differentiation, because there are two important technical variables in between. I believe this is our opportunity.
In addition to the technical level, our values differ somewhat from OpenAI: we hope that, in the next era, we can become a company that combines OpenAI’s technology idealism with the business philosophy of ByteDance. I believe the Asian mindset towards commercialization has certain merits. If you don’t care about commercial value at all, it’s actually very difficult to create a truly great product, or to make an inherently great technology even greater.
Overseas Unicorn: What kind of story should AI model companies tell? Should they frame their narrative around the pursuit of AGI, like OpenAI, or focus on becoming a super app? Are these two narratives in conflict, and how should they be balanced?
Yang Zhilin: The way a company tells its story depends on investors’ mindsets. For us, the more important question is understanding the relationship between these two goals.
AGI and product development are not a means-to-an-end relationship for us; they are both goals in themselves. In the pursuit of AGI, I believe the so-called "data flywheel” is crucial, even though it’s a somewhat old concept.
Products like ChatGPT haven’t yet fully established a continuous evolution loop based on user data. I think this is largely because base models are still evolving — when a new generation is developed, previous user data becomes less useful. This is tied to the current development stage — today, progress is driven by the scaling laws of base models, but in the future, there could be a shift toward leveraging the scaling laws of user data as a source of progress.
Historically, almost all successful internet products have ultimately relied on scaling user data. Today, we can already see signs of this with MidJourney. By leveraging the scaling laws of user data, it has managed to outperform simple base model scaling. However, when it comes to language models and text generation, the scaling effects of base models still far outweigh those of user data. That said, I believe this will eventually shift towards user data scaling — it’s just a matter of time.
This is particularly important now, as we face data bottlenecks. Human preference data, for example, is extremely limited but also indispensable. I believe this is one of the most critical challenges for every AI-native product today. A company that doesn’t care enough about its users may ultimately fail to achieve AGI.
Overseas Unicorn: What’s your view on MoE (Mixture of Experts)? Some argue that MoE isn’t truly a form of scaling up and that only scaling up a dense model improves a model’s capabilities.
Yang Zhilin: You can think of models with MoE and without MoE as following two different scaling laws. Fundamentally, a scaling law describes the relationship between loss and parameter count. MoE changes this function, allowing you to use more parameters while keeping FLOPs (floating point operations per second) constant. Meanwhile, synthetic data changes a different relationship — it allows for data scale growth while keeping FLOPs unchanged.
Following a scaling law is a predictable path, and people try to modify specific relationships within these laws to achieve greater efficiency. That extra efficiency becomes their competitive advantage.
Right now, many believe that simply implementing MoE is enough to achieve something like GPT-4. I think this view is too simplistic. Ultimately, the more fundamental challenge is how to establish a unified representation space and a scalable data production process.
Overseas Unicorn: If compute were sufficient, would anyone build a trillion-parameter dense model?
Yang Zhilin: That depends on how fast inference costs decrease, but I definitely think someone would. Right now, inference costs are too high, so everyone is making trade-offs. However, if compute weren’t a constraint, training a trillion-parameter dense model would undoubtedly perform better than a model with only hundreds of billions of parameters.
Overseas Unicorns: Anthropic has been emphasizing model interpretability, which has sparked a lot of debate. What’s your perspective on interpretability?
You just mentioned that models are a “black box,” and we still don’t fully understand how the human brain works either.
Yang Zhilin: Interpretability is fundamentally about trust. Building a system that people can trust is important, and the applications related to this might be quite different from something like ChatGPT — such as integrating long-context models with search.
If a model never hallucinates or has an extremely low hallucination rate, interpretability wouldn’t even be necessary, because everything it says would be correct. Also, interpretability itself can be seen as part of alignment — for example, chain-of-thought reasoning can be considered a form of interpretability.
Hallucinations can be addressed through scaling laws, but not necessarily in the pre-training stage. Alignment itself also follows a scaling law, meaning it can be solved as long as the right data can be found. AI, at its core, is just a set of scaling laws.
Overseas Unicorn: What are your expectations for AGI? At its core, isn’t the transformer still a statistical probability model? Can it lead to AGI?
Yang Zhilin: There’s nothing wrong with statistical models. When next-token prediction is good enough, it can balance creativity and factual accuracy.
Factual accuracy is usually a challenge for statistical models, but today’s LLMs can exhibit highly peaked distributions. If you ask a model a question like, “What is the capital of China?” then the model can assign a 99% probability to the character “Bei” (as in Beijing). At the same time, if I ask it to write a novel, the probability distribution of the next word would be much more evenly-distributed. Probability is really a method of general-purpose representation (通用的表示方式). In this world, there is a vast amount of entropy. We need to capture the deterministic elements while also allowing the inherently chaotic aspects to remain chaotic.
To achieve AGI, long-context will be a crucial factor. Every problem is essentially a long-context problem — the evolution of architectures throughout history has fundamentally been about increasing effective context length. Recently, word2vec won the NeurIPS Test of Time award. Ten years ago, it predicted surrounding words using only a single word, meaning its context length was about 5. RNNs extended the effective context length to about 20, LSTMs increased it to several dozen, and transformers pushed it to several thousand. Now, we can reach hundreds of thousands.
If you have a billion-token context length, then the problems we face today would no longer be problems.
Additionally, lossless compression is essentially the process of learning determinism from chaos. An extreme example is an arithmetic sequence — given the first two numbers, every subsequent number is deterministic, meaning there is no chaos, so a perfect model can reconstruct the entire sequence. However, real-world data contains noise. We need to filter out this noise so the model only studies the learnable content. During this process, we must also assign appropriate probabilities to uncertainties.
For example, if you generate an image, its loss will be higher than that of generating text because images contain more chaos and information. However, the key is to capture only the aspects you can control, while treating the remaining uncertainty probabilistically. Take a water cup as an example — whether its color is green or red is a probability-based variation, but the shape of the cup remains unchanged. Therefore, the priority is learning the cup’s shape, while its color should be treated probabilistically.
Overseas Unicorn: What patterns exist in the increase of context length? Is there any technological predictability?
Yang Zhilin: I personally feel that there is a Moore’s Law for context length. However, it’s important to emphasize that accuracy at a given length is also crucial. We need to optimize both length and accuracy (lossless compression) simultaneously.
As long as we ensure the model’s capability and intelligence, I believe the increase in context length is very likely to follow exponential growth.
02. Multimodal: most architectures aren’t worth scaling up
Overseas Unicorn: Everyone anticipates multimodal technology will explode in 2024. Compared to text, where do the technical challenges of multimodal lie?
Yang Zhilin: Currently, state-of-the-art video generation models actually use at least an order of magnitude fewer FLOPs than language models. It’s not that people don’t want to scale them up; it’s that most architectures aren’t worth scaling.
In 2019, the most popular architecture was BERT, and people later asked why nobody scaled up BERT. The truth is that architectures worth scaling need to have both scalability and generality. I don’t think BERT lacked scalability, but you can clearly see it lacked generality — no matter how much you scaled it, it could never write an article. Multimodal has also been stuck on architecture issues for the past few years, lacking a truly general-purpose model that people are willing to scale. Diffusion clearly isn’t it — even if you scaled it to the heavens, it could never be AGI. Today, auto-regressive architectures have brought some new possibilities, sacrificing some efficiency to solve the generality problem.
Auto-regressive architectures themselves are scalable, but tokenizers might not be, or eventually tokenizers won’t be needed at all. This is a core problem for 2024.
Overseas Unicorn: If tokenizers aren’t scalable, do we need a completely new architecture beyond transformers?
Yang Zhilin: Just talking about transformers themselves, I don’t think there’s a major problem. The core issue still is solving the tokenizer problem. The transformer architecture has actually already undergone many changes — today’s implementations for long-context and MoE aren’t standard transformers. But the spirit or ideas behind transformers will definitely exist for a long time. The key is how to solve more problems based on these foundational ideas.
Overseas Unicorn: If context length becomes infinitely long, we wouldn’t need tokenizers anymore?
Yang Zhilin: Correct. Essentially, if a model is strong enough, it can process any token, pixel, or byte. With infinite context length, you could directly input everything on your hard drive to it, and it would become your real new computer, taking actions based on all that context.
Overseas Unicorn: Leading model companies like OpenAI and Anthropic think a major bottleneck in 2024 will be data — so they have high expectations for synthetic data. What’s your view on synthetic data?
Yang Zhilin: A scalable architecture is the foundation — and this architecture must first support continuously adding more data before data truly becomes the bottleneck. The data bottleneck we’re talking about now will be encountered in the text modality in 2024, but introducing multimodal data will delay this problem by one to two years.
If the bottlenecks in video and multimodal can’t be solved, then the text data bottleneck will become critical. We’ve actually made some progress on this — if the problem is constrained, such as mathematics or code writing, data is relatively easy to generate. For general-purpose problems, there isn’t a complete solution yet, but there are some directions worth exploring.
Overseas Unicorn: Will the bottleneck in 2025 be energy? Because by then, individual clusters will be very large, which will bring energy challenges.
Yang Zhilin: These problems are actually connected. Eventually, multimodal might solve the data problem, and synthetic data might solve the energy problem.
By the GPT-6 generation, players who master synthetic-data technology will show clear advantages. This is because there are two types of data: “pre-training” data and “alignment” data, the latter of which is more costly to obtain. If you master data-generation technology, the cost of alignment might decrease by several orders of magnitude, or you could produce several orders of magnitude more data with the same investment, changing the landscape.
I think 2025-2026 might be an important milestone: most of the model’s computation will occur on data generated by the model itself.
By 2026, the amount of computation used by models for inference might far exceed training itself; you might spend 10 times the cost on inference, and then one-tenth of that cost on training. A new paradigm will emerge: inference becomes training, and this inference doesn’t serve any users — it only serves to generate synthetic data for itself.
If this happens, the energy problem is also solved, because inference can be distributed. It doesn’t violate any laws; it’s essentially energy conservation. I’m just changing the computational paradigm to allow energy to be solved in a distributed way.
03. Super App: Model Fine-Tuning May Eventually Not Exist
Overseas Unicorn: The search and recommendation systems behind Google and Douyin have strong flywheel effects: their algorithms can provide real-time feedback based on user behavior, continuously improving user experience. LLMs, however, currently can’t provide real-time feedback on user behavior. What will the flywheel effect of AI-native products be?
Yang Zhilin: I’ve thought deeply about this question. The ultimate, core value of AI-native products is personalized interaction, which is something previous technologies haven’t implemented well. So this question is actually about personalization — how to enable users to gain highly personalized interactive experiences the more they use your product. For many products today, the degree of personalization is almost zero. Previously, we could only do personalized recommendations, but now users can interact with products. This interaction is highly anthropomorphic and personalized. How do we achieve this?
I think this is fundamentally a technical issue. In the traditional AI era, achieving personalization required continuously updating models, using small models to solve specific problems. In the large model era, one way to achieve personalization is through fine-tuning — but I believe fine-tuning may not be the fundamental method and may not exist in the long term. Why? When your model’s instruction-following ability, reasoning ability, and contextual consistency ability become stronger, everything only needs to be placed in memory. For example, your large model’s memory can have a bunch of prefixes to follow, reducing costs dramatically. Ultimately, the process of personalizing a model is actually your entire interaction history — which is a collection of your preferences and feedback. This feedback is more direct than products from previous eras because it’s generated entirely through conversational interfaces.
Based on this judgment, the next question is: how to achieve long-context-based customization at the technical level to completely replace fine-tuning?
I believe we’re moving in this direction now. Future models won’t need fine-tuning but will instead solve problems through powerful contextual consistency and instruction-following capabilities. The long-term trend should be personalization of the underlying technology, which will be a very important change.
For example, GPT-4 brought a new computing paradigm where creating GPTs doesn’t require fine-tuning. Previously, customization was achieved through programming, but today it’s achieved by making the model’s prefix very complex, and extracting what you want from this general-purpose set. Personalization achieved this way is truly AI-native personalization, and a traditional recommendation engine plug-in will definitely be eliminated by this new approach.
Overseas Unicorn: How did you make the decision to first develop lossless long-context?
Yang Zhilin: I think the most important thing is to begin with the end in mind. Large models as new computers definitely need large memory, because the memory of old computers has increased by at least several orders of magnitude over the past few decades, and old computers also started with very little memory. The second point is that the ultimate value of AI is personalization.
Overseas Unicorn: OpenAI also has some long-context capability now.
Yang Zhilin: But they haven’t truly viewed the user interaction process as a personalization scenario. For example, if we prompt ChatGPT with something, regardless of whether it’s today or tomorrow, as long as the model version is the same, the effect is basically the same. This is what I mean by a lack of personalization.
Ultimately, everything is instruction-following. It’s just that your instructions will become increasingly complex. Today, your instruction might start with 10 words, but later it could be 10,000 words or even 1 million words.
Overseas Unicorn: Chatbots have always been the ideal for AI scientists. If each user has hundreds of conversations with a chatbot daily, and the chatbot system can collect and understand more user context, will it ultimately far exceed the matching accuracy of search and recommendation systems? Like interactions between colleagues or family members, where just one sentence or even a glance is enough to understand each other.
Yang Zhilin: The key is crossing the trust threshold.
I think the ultimate measure of an AI product’s long-term value is how much personalized information users are willing to input into it, and then lossless long-context and personalization are responsible for turning these inputs into valuable outputs.
New hardware forms may also be needed — but I think models and software are still bottlenecks. To dig deeper, the prerequisite for users to input a lot of information is trust — you need a sufficiently engaging and human-like AI. You can’t say, “I’m setting up product features specifically to get your information.” The end result should be that users and AI become friends, so users can tell the AI anything.
Inflection Pi’s motivation is actually good — wanting to establish strong trust — but Pi may need to take another step forward. How to build trust with users? Human society probably won’t accept being assigned a lifelong companion; that seems somewhat against human nature.
Overseas Unicorn: Moonshot AI wants to create a super app. What does your ideal super app look like? How big does it need to be to qualify as “super”?
Yang Zhilin: It’s about breaking out of niche adoption. When all your relatives are using it, only then have you truly become a super app. And I believe that improvements in AI capabilities will lead product adoption. For example, if character.ai were a perfect multimodal model today, I think its chances of breaking out of its niche would be at least 10 times greater. Ultimately, an application’s ceiling is reflected in the year-over-year increase in connections between AI and humans.
04. Moonshot AI: People with the ability to unlearn make the best talent
Overseas Unicorn: What does the ideal CEO for an AGI company look like?
Yang Zhilin: On one hand, there needs to be a tech vision. You can't just keep doing things that have already been proven to work by others. A real AGI company must have its own unique technical judgment, and this judgment should influence the overall direction of the company. If the top leader can't make decisive calls, that won't work either. At the beginning of the year, we were already working on auto-regressive multimodal models and lossless long-context, but these only became extremely popular in the last couple of months. Even today, lossless long-context is still not widely accepted as a consensus. If you only start noticing these trends now, there won’t be enough time to iterate, and in the end you'll just become a follower.
Another point is having a profound understanding of AI-native product development and then adapting the organization to this new mode of production. In the past, product development was about understanding user needs and designing features accordingly. But in this new era, design needs to be completed during the manufacturing process. ChatGPT’s design was finalized through its creation — it wasn’t built by pre-defining a bunch of scenarios and then finding corresponding algorithms. Similarly, Kimi users uploading resumes and using it for screening was a completely untested use case before we launched, yet it emerged naturally from real-world usage.
Resource acquisition is also crucial, with compute power being the primary cost driver. In the early stages, funding is key, but later on, product commercialization becomes necessary. However, commercialization cannot simply copy mature models from the previous era; it requires innovation. A good CEO and team should have some experience but also possess strong learning and iteration capabilities.
Overseas Unicorn: But maybe some investors can’t tell whose “tech vision” actually leads the pack.
Yang Zhilin: I’m not too worried about this problem. What we have now is the best distribution mechanism: it’s close to a real free market and we will end up with the most efficient resource distribution. What we need to prove to others is not the value of our vision, because a vision is an abstract thing. We need to prove our worth through delivering real models and products. Anthropic received much more funding immediately after it released models like Claude. The market is fair.
Overseas Unicorn: From the perspective of building product- and company-competitive moats, the industrial era relied on economies of scale, and the internet era emphasized network effects. Will there be a new paradigm in the AGI era?
Yang Zhilin: In the short term, changes in organizational structure drive technological advancements — better technology is achieved through better organization, which then directly translates into a superior product experience.
In the long term, network effects are still likely to dominate. The question is: how will they manifest? Traditional two-sided networks from the internet era may still exist, but not necessarily in the form of users and content creators. For AI-native products, the two-sided network effect may be reflected in personalization, where users and the AI engage in a co-creative relationship.
So right now, I see two key areas worth exploring: the continuous improvement of model capabilities and the development of two-sided network effects. These will shape new paradigms in the AGI era. Midjourney has already seen explosive growth through its two-sided effect, while Stable Diffusion, as an open-source model, faces the challenge of being too fragmented on a single side, instead relying solely on base model improvements.
Overseas Unicorn: From the hiring perspective, how do you define strong talent?
Yang Zhilin: I break it down into experience and learning. The ability to learn is a general-purpose capability, which not only includes learning but also unlearning — especially unlearning previous experiences of success. Let’s say you built YouTube from 0 to 1; you might find it harder to work on AI products now than other people do, because you have to unlearn a lot of things. Learning is more important than experience. Maybe in 5 years, the AI industry will cultivate a large number of so-called mature roles. Currently, I don’t actually think that dividing people by roles is all that meaningful, since every person needs to be multi-faceted.
Overseas Unicorn: What kinds of researchers possess “tech vision”?
Yang Zhilin: The core ideas are twofold: focusing on the big picture while letting go of the small details, and maintaining an endgame mindset. I’ve worked with many researchers, and a common issue is over-optimization — getting caught up in refining details while missing the broader perspective. For example, we saw that transformers solved the context length limitations of LSTMs, but if we take a step further back, we realize that each generation of technology is fundamentally about extending context length.
Overseas Unicorn: How many more of these people do you think Moonshot AI still needs?
Yang Zhilin: Objectively speaking, the real limit for us is still supply. Currently, experienced AGI talent is very rare, but there are lots of people with the ability to learn.
But from a demand perspective, the organization cannot become too large — if it turns into just another Big Tech corporation, many of its organizational advantages will be lost. So we will definitely maintain a lean and highly efficient structure. One key judgment is that AGI does not require that many people. In the long run, once we truly “unplug the data,” models at the level of GPT-6 and beyond should be able to evolve on their own, breaking through the limits of human capability.
Overseas Unicorn: How do you assess the difficulty and timeline for catching up with GPT-4?
Yang Zhilin: Hitting benchmark scores on par with GPT-4 is very easy, but achieving its actual performance is definitely challenging. It’s not just a matter of resources — Google has already demonstrated this. In fact, the training cost of GPT-4 isn’t that high; several tens of millions of dollars is not an intimidating figure. This is positive news for us, and we’ve even already made substantial progress.
The most critical factor is having a strong tech vision to anticipate what GPT-5 and GPT-6 will be, and then executing and building the necessary foundations ahead of time. Otherwise, it’ll never be possible to surpass OpenAI. Much of OpenAI’s advantage comes from its foresight — by 2018, it had already committed to what it believed was the right path and spent years building deep capabilities.
Overseas Unicorn: If you were to develop an image-generation AI, how would you approach it? How would you balance language comprehension and image quality?
Yang Zhilin: Midjourney has already done exceptionally well in the single task of image generation. If I were to develop a similar product, I would want it to handle multiple tasks, while still excelling in certain key areas. This is actually the same approach OpenAI attempted, but they didn’t quite succeed.
An AGI company should focus on becoming the default platform — the primary way users interact with AI. Meanwhile, niche user groups will still have specialized needs and ultra-high standards for performance, which is why there’s room in the market for companies like Midjourney. However, if AGI becomes powerful enough, many users will migrate. For example, if I were to repackage all of Photoshop into a single prompt — essentially turning it into an outsourced all-in-one designer — then fewer people would use Midjourney.
Midjourney’s current dominance comes from its first-mover advantage, which enabled it to kickstart a powerful data flywheel. The tricky part is whether such a time window will exist in the future — if not, general-purpose models may eventually outcompete and overtake it.
Overseas Unicorn: Following the strategy of becoming the default platform, how many key user entry points do you foresee in the future?
Yang Zhilin: At least two — one for utility, the other for entertainment.
The way we access information today may become obsolete because, at its core, searching for information is just a means to an end — we do it to complete a task from start to finish. In the future, AI-driven interfaces will likely replace search engines as the primary way users interact with information. Retrieving information is never the end goal; it has just been artificially framed as one. Sometimes we want to accomplish a task, and other times, we want to learn something new. The ideal AGI interface should directly help users complete tasks, rather than simply helping them find information.
Overseas Unicorn: From today onward, how much investment do you think it will take to realize your vision of AGI?
Yang Zhilin: Achieving a fully realized AGI will require investment on the scale of tens of billions of dollars. However, it won’t be a one-time expense — it’s about setting up a self-sustaining loop where the business can generate the necessary resources to fuel further development. This multi-billion-dollar estimate is based on the need to scale up by at least two to three orders of magnitude. Of course, costs will be optimized along the way.
Overseas Unicorn: What should the business model of an AGI company look like? Will it still be seat-based or usage-based?
AGI delivers varying levels of value depending on the task it completes. It may operate more like an outsourced service, pricing each task individually. Beyond that, advertising will undoubtedly play a crucial role. With deeply personalized interactions and conversational engagement, ad monetization could become significantly more efficient than it is today.
Overseas Unicorn: If training costs for models like GPT-4.5, Claude-3, and Gemini-2.0 are around $300 million today — and future models in 2025 could require tens of billions of dollars — does that mean the pursuit of AGI is a trillion-dollar gamble? Have you considered its ultimate impact on human society?
Yang Zhilin: One impact that’s almost certain is a real and tangible increase in productivity. Today, a single piece of software might function at the intelligence level of 1,000 programmers, but in the future, applications could be powered by the equivalent of a million programmers, continuously improving through iteration.
Thinking about the possibilities, everything we take for granted today could change. Training models on a vast range of languages and cultures will inevitably influence values and perspectives. The way people allocate their time will shift — fewer people may work purely for money, and more of human life may be spent in digital or intellectual spaces. Ultimately, we may see the emergence of a massive virtual cognitive ecosystem. To truly build the Metaverse, we may first need to perfect AI.
Additionally, I firmly believe AGI will be inherently global.
Overseas Unicorn: Right now, leading AI models are both powerful and relatively inexpensive, leading to a strong Matthew effect [self-reinforcing cycle in which early winners keep accumulating advantages]. Doesn’t that mean the final market landscape will be highly consolidated?
Yang Zhilin: Within a five-year window, top players might still dominate. However, in 50 years, I believe AGI will be fully commoditized — it will be no different from electricity today.