Why China, and now Taiwan, are making their own chatbots using their own data

By Emily Feng

Published May 29, 2024 at 2:53 AM CDT

STEVE INSKEEP, HOST:

The tension between China and Taiwan now extends to artificial intelligence and chatbots resembling ChatGPT. Mainland China has developed its own chatbots that converse with people in Mandarin, and now Taiwan is developing chatbots that are free of Chinese influence. NPR's Emily Feng reports.

LEE YUH JYE: (Speaking Chinese).

EMILY FENG, BYLINE: (Speaking Chinese).

On the lush, tropical campus of Taipei's Academia Sinica research institute, AI researcher Lee Yuh Jye shows me TAIDE, asking it to say hi to me.

LEE: (Speaking Chinese).

FENG: TAIDE is a Taiwan-trained large language model, or LLM.

We warmly welcome Miss Feng's arrival. Thank you.

TAIDE launched earlier this year on a budget of about 6 million U.S. dollars paid by Taiwan's government. Lee has been its driving force. And his pitch was simple.

LEE: The performance of AI models really depends on the training data. So if the original data, the material coming from Taiwan is very small, you cannot expect the AI model or these large language models to understand Taiwan very well.

FENG: There are other Mandarin Chinese LLMs. For example, you can try out ERNIE, developed by Chinese internet company Baidu. But ERNIE has to follow Chinese censorship rules and political mores, including, for example, saying in answers that Taiwan is part of China - because only training a model on data from China could make a pro-China LLM. And so companies and governments in Taiwan are scrambling to create their own generative AI capabilities.

WINSTON HSU: For every generation for these technological revolutions, those countries, those people that benefit the most is those that embrace the technology change.

FENG: Winston Hsu is a Taiwanese AI professor and the founder of Thingnario, a tech company that uses AI to manage energy use. And he would not trust a Chinese AI model to do that for a Taiwan because...

HSU: Energy is definitely - is a national security issue, right? So it's the must-have resource.

FENG: But to create a sophisticated LM - one that can chat with a human and which understands cultural references, for example - you need data; a lot of it. And the reality is there just is not as much Traditional Chinese data - the script used in Taiwan - as in Simplified Chinese, the script used in China.

SHUI DASHAN: Just the data. And don't nobody just keep the data at one place for you, right? You have to discover.

FENG: Shui Dashan is a managing director at the research arm of MediaTek, one of Taiwan's biggest semiconductor companies. He spent the last two years training BreeXe, a Traditional Chinese script LLM. One of the hardest parts was finding high-quality new information to teach it on.

SHUI: Just imagine a person. The person learns something in first grade, and then you - second grade, you try to teach the same thing. The kid would just, like, not pay attention. And that's probably similar thing for LLM, too. If you force that into a language model, it actually becomes stupid.

FENG: Back at Academia Sinica, Lee Yuh Jye tells me the most valuable data, though, will come from people themselves, drawn from what they type or speak into generative AI models. Businesses in Taiwan are already using LLMs for customer service.

LEE: So probably you have business secrets or you have some customer information.

FENG: Which he would not want LLMs owned by China to have, given its tense relations with Taiwan. And LLMs will also have individuals inputting their private information, asking chatbots highly personal questions about their lives, asking for answers. That is data that gives platforms a collective understanding of what people are thinking about at any given moment.

LEE: All of Taiwanese thinking - what you want, why you are interested - not only that, they will teach you how to do that. That is quite scary, right?

FENG: Teach you, as in the AI will be advising humans on what to do, which is why Taiwan is investing in its own LLMs - because as much as we shape the AI we are training, the idea is one day they will shape us, too.

Emily Feng, NPR News, Taipei, Taiwan.

INSKEEP: Tomorrow we follow the tech wars to mainland China. I visit a Chinese tech company that has been sanctioned by the United States. Transcript provided by NPR, Copyright NPR.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.