Jiang Tao, Senior Vice President of iFlytek, explains how his company came to dominate the world of speech technology and artificial intelligence in China
For a company whose software speaks to more than half a billion people, iFlytek tends to keep its voice remarkably low. The firm was founded by a group of researchers in 1999 and is headquartered in the relatively sleepy eastern city of Hefei, far from China’s tech hubs, Beijing and Shenzhen.
But despite keeping a low profile, iFlytek has been quietly revolutionizing the world of speech recognition in China over the past two decades. The company’s software can not only understand several Chinese dialects fluently—a feat Apple’s Siri, for instance, still struggles with—it can transcribe it into text, and even translate it into English instantly.
Now, the company that MIT Technology Review ranked as the sixth smartest company in the world in 2017—just one place below Google—is integrating artificial intelligence technology into its solutions in a bid to push the boundaries of what is possible in the world of voice-activated technology.
In this interview with the CKGSB Case Center, Jiang Tao, Senior Vice President of iFlytek, who has been at the company since the very beginning, explains how the company reached this point, how it keeps hold of its world-class researchers, and why it is not afraid of competition from China’s tech giants.
How did iFlytek get started?
A special thing about the company is that we are the first listed company [in China] that was started by college students at the University of Science and Technology of China in Hefei. There were undergraduates, postgraduates and also doctoral candidates. Liu Qingfeng, who is now Chairman of iFlyTek, was in the first year of his PhD when iFlyTek started. Now, most of our C-level people hold a doctor’s degree. And that initial research team is still working at the company.
What was the voice recognition field like when iFlytek was founded in 1999?
In 1997, IBM launched ViaVoice, which was recognized as a major event by the global science and technology community. Although that product now seems basic by current standards, at the time being able to speak at a computer and see words appear on the screen was amazing.
At the end of the 1990s, there was a global boom in speech recognition technology, driven by the achievement of IBM. Many IT giants, including Intel, Microsoft, Motorola and Toshiba, had set up voice research centers in China. The market was basically monopolized by foreign firms. Top universities and institutions, like Tsinghua University and the Chinese Academy of Sciences, had related majors and their graduates went to work in those foreign companies.
We were like newborn calves that are not afraid of tigers. The situation was really hard, because after that initial breakthrough by IBM, the technology actually got stuck in a bottleneck for a long time. The accuracy of speech recognition systems stayed at 80%. Some people used voice recognition software, but many found it inefficient. The entire market did not rise. We experienced a very painful growth period, and eventually broke even in 2004.
I often divide our company’s development into three stages. The first was the startup phase, from 1999 to 2004. In this period, you could describe our products as being like “iFly Inside,” a play on the phrase “Intel Inside.”
What do you mean by “iFly Inside”?
When we started in 1999, we had a lot of ideas and experimented with a lot of different things. For example, we tried to create a voice-driven operating system for a PC, but this turned out to be unsuccessful. We also invested in telecom products. I led a team to develop something called a “phone internet,” where people could access the internet by calling a number and then “listening” to the information. Those ideas seem naïve today, but we were so passionate.
But a turning point came when people from Huawei approached us at a tech fair. They were developing a smart telecom network, and they decided to use our technology. After Huawei became our client, many other companies like Digital China also began to use our services. So, we became a technology provider to companies producing telecom systems, PCs and other smart devices like digital dictionaries. In Huawei’s case, they used our voice recognition technology in their call centers. This is what I mean by the term “iFly Inside.”
But these partnerships also proved to be a problem because we didn’t know how to do marketing or develop our own products. All we knew how to do was to provide technology support.
So, lots of people were using your product, but none of them had heard of iFlytek?
Yes, we were hidden. This model had a good side because it matched our abilities at the time. The downside was that our value-added was very small. But it did enable us to reach the break-even point six years after starting up.
How did you move beyond the “iFly Inside” stage?
We explored two different directions, and the first was education. In 2004, a government official paid a visit to our company and gave us a suggestion. He said that college graduates need to take a Mandarin Chinese- or English-speaking test to become civil servants or teachers, and that such tests are manually scored and cost the government quite a lot of money. He asked if our speech technology could help solve this problem. Following his advice, we entered the education market by developing a speech-based evaluation system. It was a relatively narrow, but interesting application.
So, your technology could analyze the candidates’ pronunciation?
Yes, we started from the Mandarin tests and then expanded into English. Now, our product is used in most oral tests. After entering into the education industry, we found that there was more demand in this market. For example, many teachers in areas with strong local dialects have poor Mandarin Chinese pronunciation. Our speech technology allows the teachers to have an in-classroom assistant that helps correct their pronunciation. Up to now, more than 80 million teachers and students have been served by educational products from iFlytek.
The other area we explored was partnering with telecom operators. In 2005, China introduced dial tone ringtone services that allow users to replace the dial tone you hear when you make a phone call with a song. It became really popular, but one problem users faced was that when they wanted to change the ringtone, they needed to go online. But you had to do that on a PC, which is not very convenient.
So, we created a system that allowed users just to make a call and say what song they wanted to use as the ringtone. Our system recognized the song title, and that was it. Later that service evolved to cater to more complex needs. For example, a football fan could call up and ask to hear the latest news from his favorite club. From here, we slowly formed our consumer business group.
When did iFlytek start to focus on deep learning?
After we went public in 2008, the company’s market value and revenue continued to increase. Then, in around 2013, our business entered a new stage, which was more about deep learning.
Pattern recognition—which is the technology we had used up to that point—is completely different from. In pattern recognition, the technology has to extract a lot of features from a piece of speech, understand these features, and then analyze them to find a “pattern.” But in deep learning, you only need to look at the results, and then send them back to the deep learning platform to train the machine. The next time you encounter a similar problem, it knows how to deal with it. In a sense, speech recognition is the first suitable scenario for deep learning.
How has the shift toward deep learning altered your company’s business strategy?
We launched iFlytek Super Brain in 2014. To explain this project in one phrase: it will not only be able to listen and speak, but also understand and think. In the beginning, our mission statement was to make every machine, television, car and toy like people. And then, we changed it to making every machine understand and think, and using artificial intelligence to build a better world.
Also, the company’s valuation increased a lot as we raised our ambitions in deep learning.
In deep learning, people often say that the players with the most money and data have the advantage. If that’s true, how can iFlytek compete against giants like Google?
The field of perception intelligence is only just beginning to mature. Whether it is speech recognition or image recognition fields like facial recognition, machines have now reached the level of human beings. There are no longer any complex challenges to solve, and so the rule that whoever has more data and a stronger ability to process that data will prevail certainly applies.
But the field of cognitive intelligence still has a long way to go. At the moment, we don’t have a universal cognitive intelligence system that can solve problems in different scenarios. Developing that system is going to be a big challenge. So, what is status of the cognitive intelligence field today? We look at it from three different levels.
The first level is in special professional fields, like machines that can take medical exams and read CT scans. In this area, machines have already reached the level of humans.
The second level is the generalist level. For example, China’s college entrance exam includes tests on Chinese, English, math and many other disciplines that require general knowledge.
The third level is common-sense reasoning. For machines, this is the most difficult. There is a global contest for common-sense reasoning called the Winograd Schema Challenge, which has questions like: A asks B if there is a restroom nearby. B tells A there is a KFC opposite. Why? Answering these questions requires a different kind of knowledge. It’s not like reading a CT scan, where there is a single, standard answer.
When our system took the Winograd test, we got a score of 50% and assumed we were going to be eliminated. But it turned out we scored the highest of all the tech companies entering the contest. So, what does that all mean? Making machines with common sense is very difficult.
So, back to your question of whether it’s true that the more you invest, the better the results you get. My answer would be: not necessarily. We believe that scaling up cognitive intelligence will require moving up the levels I just mentioned: starting with industries that require logic and standardized answers, like health care, and moving on toward more general applications.
We’re not afraid of competing against the large firms because developing AI requires top talent and an ability to integrate the data.
In terms of people, we feel that we actually have an advantage. The current research team we have is competitive and we’re able to hold on to them. It is what’s special about this company: our top managers used to be researchers, so we know how to train them and provide what they need and value.
We have a group of world-class scientists with no financial burden in the quiet city of Hefei. They know that their work here helps produce world-class achievements.