Hugging Face Blocked! “Self-Castrating” China’s ML Development + Jordan at APEC
“We will inevitably be behind.”
Jordan will be at APEC in SF. If you think your boss should come on ChinaTalk, please reach out! We’d love to schedule some interviews alongside the event.
The following is written by anonymous contributor “L-Squared.”
Hugging Face is currently inaccessible in China without a virtual private network (VPN), suggesting Chinese authorities are cracking down on access to open-source platforms despite their importance to machine learning research.
Hugging Face is a platform that helps users host and collaborate on machine learning datasets, models, and applications. Launched in 2016, the New York–based company now hosts over 200,000 open-source models and serves over 1 million model downloads a day. It was valued at $4.5 billion in August, when it raised funding from tech companies including Google, Amazon, Nvidia, and Intel.
Recognizing the “remarkable achievements and contributions of the Chinese AI community,” Hugging Face registered a WeChat account in November 2022, launched a volunteer-powered Chinese language blog in April 2023, and appointed a China lead (who spoke at a prominent government-sponsored AI conference in Shanghai in July). On May 8, a user of Zhihu (a Chinese Q&A platform similar to Quora) reported being unable to access the site and asked for others’ opinions on the issue. Service resumed shortly after — but it now seems that it has stopped again, since at least September 12.
The top-ranked answer to the Zhihu thread, with 1,000 upvotes before it was blocked in mid-October, was written by a Beijing-based senior NLP algorithm engineer going by the username Johnson7788 (referred to as Johnson hereafter). From September 12 to October 8, he posted near-daily updates of his almost monomaniacal mission to get to the bottom of the problem and have Hugging Face reinstated. On September 19, after contacting the company and conducting network diagnostics, he assessed in a colorful turn of phrase that it seemed to be a case of Chinese “self-castration” 国内自宫了 — a block on the website imposed by the Cyberspace Administration of China (CAC).
Possible Reasons for the Block
Although it’s not clear why Hugging Face has been blocked, one likely reason is that it allows access to code and demos of models not compliant with the country’s regulations on generative AI, in force since August 15. The regulations, which apply only to models made available to the public (as opposed to business users), clarify that AI-generated content must conform to strict pre-existing controls on online speech. Where services have “public opinion properties or the capacity for social mobilization,” they must conduct security assessments and file their algorithms in the CAC’s registry.
A response to the Zhihu thread by a PhD student at the University of the Chinese Academy of Sciences, which received over 200 upvotes, said that Hugging Face undoubtedly did not comply with generative AI rules and should have been disciplined earlier.
Microsoft’s GitHub, which also hosts open-source model code and datasets (as well as non-ML software projects), was inaccessible in China without a VPN earlier this week, but is available again as of October 18. As it can host user-generated content, it has a history of temporary blockages in China. For a long time, however, the importance of the platform to Chinese programmers — highlighted by tech executive Kai-Fu Lee in public criticism of a 2013 blocking — prevented a permanent ban.
The Hugging Face block may also prove to be temporary, but other developments in cyberspace governance suggest that the tide is against the platform. Beijing’s recent tightening of supervision over mobile apps means that Apple users, after benefiting for years from lax enforcement of domestic registration rules, are now set to lose access to many foreign apps. The drive for greater control by the CCP over online discourse is also clear in the recently announced Xi Jinping Thought on Culture.
As well as maintaining strict control over information, a secondary goal of the Hugging Face block may be to continue nudging developers toward developing and contributing to Chinese open-source platforms and datasets, rather than relying on foreign ones. Homegrown platform Gitee won a government contract for a Chinese, independent open-source software repository in 2020. Article 6 of the recent generative AI regulations encourages “indigenous innovation” in basic technologies for generative AI, while also promoting “the establishment of generative AI infrastructure and public training data resource platforms” and “collaboration and sharing of algorithm resources.” In other words, the Chinese government still wants open source infrastructure — as long as it’s made in China. And you can see why the Party-State is worried about over-reliance on foreign-developed software: the Biden administration is reportedly considering export controls on future iterations of large language models (LLMs).
Implications for Chinese AI Development
So will homegrown alternatives soften the blow of the Hugging Face block? By the end of 2022, Gitee had over 10 million users and 25 million repositories. Even so, it does not seem to be an important hub for AI models. A search for “Llama” (Meta’s open-source LLM, upon which many other models are built) generated only 200 or so results, compared to over 9,000 on Hugging Face.
Instead, Alibaba-developed modelscope, established in November 2022, claims to be the largest AI model community in China. At the end of July, it had over 1,000 AI models, 2 million AI developers, and over 45 million cumulative model downloads. Then there’s Baidu’s aistudio, an AI development platform that hosts models and projects in a similar way to Hugging Face. But whereas Hugging Face is not aligned to any one tech giant, companies in China seem reluctant to contribute to model hubs run by their competitors, reducing the coverage of modelscope and aistudio. Moreover, contributions to the latter must use Baidu’s open-source deep learning framework, adding conversion hassle for developers working with the more common PyTorch or with TensorFlow.
Comments on the Zhihu thread about the blocking of Hugging Face lament the lack of suitable Chinese alternatives and discuss how the move will push the country’s AI field further behind.
Partial translation: “[Hugging Face] doesn’t talk about politics, and there also aren’t any equivalent products in China — do we expect [Baidu’s deep learning framework] PaddlePaddle to cut it? In China, Huawei and Baidu’s homegrown equivalents are not unusable, but the main problem is that they are not at an acceptable level, and there are many traps; adaptation is also cumbersome…”
“…I’m really very sad, this is strangling our own competitiveness — in today’s world of AI quickly increasing productivity, shutting oneself off like this is really determining that, in the AI field at least, we will inevitably be behind.”
Johnson, who posted the subsequently blocked top-voted answer on the thread, was so distraught that he called numerous government bodies and sent pleading emails trying to get the site restored. He was shuttled between different departments, all trying to eschew responsibility for the matter. It was all to no avail.
Johnson’s email to the center responsible for China’s cybersecurity emergency response on September 30. He wrote that since Hugging Face had been blocked, model training had become almost impossible.
When Johnson attempted to ask the chatbot of the national government affairs service platform which department he should find to resolve the issue of Hugging Face’s blocking, the chatbot responded that the question was too profound. Recounting this on Zhihu, he wrote that the chatbot “is still very unintelligent; it has definitely not been trained on the latest large models.”
Researchers at businesses and universities, who tend to have legal means to use VPNs, should still be able to access Hugging Face. Providers of consumer-facing generative AI services, however, are unlikely to be able to benefit from the platform, if recently published draft security requirements for these services are confirmed and enforced. The requirements prevent the use of information blocked on the Chinese internet (which would now include datasets hosted on Hugging Face) as training data. They also ban the use of foundation models that have not completed filing with CAC, which constitute nearly all the models on that site.
Individual programmers and novices looking to get started in the field, who may lack access to stable VPNs, will also suffer from the blockage. Even those who have decent VPNs may be reluctant to continue using them after the recent high-profile case of a programmer who had three years’ salary confiscated after using one.
While the impact of blocking Hugging Face will be felt unevenly across the Chinese ML community, the move demonstrates how maintaining tight information controls is fundamentally at odds with the promotion of a flourishing LLM ecosystem. Faced with this trade-off, Chinese authorities are prioritizing the former.