Baidu’s AI Lab Director on Advancing Speech Recognition and Simulation
Craig Cannon [00:00] – Hey, this is Craig Cannon, and you’re listening to Y Combinator’s podcast. This episode is with Adam Coates. Adam’s the director of Baidu’s Silicon Valley AI Lab, and what they focus on is developing AI technologies that’ll impact at least 100 million people. We spent a good chunk of this episode talking about Adam’s work in speech to text and text to speech, so if you want to learn more about those projects, you can check out research.baidu.com, and as always, if you want to read the transcript or watch the video, you check out blog.ycombinator.com. Alright, here we go. Today we have Adam Coates here for an interview. Adam, you run the AI lab at Baidu, in Silicon Valley. Could you just give us a quick intro and explain with Baidu is for people who don’t know?
Adam Coates [00:40] – Yeah, Baidu is actually the largest search engine in China. So it turns out the internet ecosystem in China is this incredibly dynamic environment. So Baidu, I think, turned out to be an early technology leader and really established itself in PC search, but then also has remade itself in the mobile revolution, and increasingly, today, is becoming an AI company, recognizing the value of AI for a whole bunch of different applications, not just search.
Craig Cannon [01:10] – Okay, what do you do exactly?
Adam Coates [01:13] – I’m the director of the Silicon Valley AI lab, which is one of four labs within Baidu Research. Especially as Baidu’s becoming an AI company, the need for a team to be on the bleeding edge and understand all of the current research, be able to do a lot of basic research ourselves, but also figure out how we can translate that into business and product impact for the company. That’s increasingly critical, so that’s what Baidu Research is here for. In the AI lab in particular, we founded recognizing how extreme this problem was about to get. So I think the deep learning research and AI research right now is flying forward so rapidly that the need for teams to be able to both understand that research, but also quickly translate it into something that businesses and products can use, is more critical than ever. So we founded the AI lab to try to close that gap, and help the company move faster.
Craig Cannon [02:14] – How do you break up your time in between doing basic research around AI and actually implementing it, bringing it forward to a product?
Adam Coates [02:22] – There’s no hard and fast rule to this. One of the things that we try to repeat to ourselves every day is that we’re mission-oriented. So the mission of the AI lab is precisely to create AI technologies that can have a significant impact on at least 100 million people. We chose this to keep bringing ourselves back to the final goal, that we want all the research we do to ultimately end up in the hands of users. Sometimes that means that we spot something that needs to happen in the world to really change technology for the better and to help Baidu, but no one knows how to solve it. And there’s a basic research problem there, that someone has to tackle. We’ll go back to our visionary stance and think about the long-term and invest in research. And then, we as have success there, we shift back to the other foot, and take responsibility for carrying all of that to a real application and making sure we don’t just solve the 90% that you might put in, say, your research paper. But we also solve the last mile, we get to the 99.9%.
Craig Cannon [03:35] – Maybe the best way to do this, then, is to just explain something that started with research here, and how that’s been brought on to a full-on product that exists.
Adam Coates [03:47] – I’ll give you an example. We’ve spent a ton of time on speech recognition. Speech recognition a few years ago was one of these technologies that always felt pretty good but not good enough. Traditionally, speech recognition systems have been heavily optimized for things like mobile search. If you hold your phone up close to your mouth, and you say a short query,
Craig Cannon [04:10] – Talk in a non-human voice.
Adam Coates [04:13] – Exactly, the systems can figure it out. And they’re getting quite good. I think the speech engine that we built at Baidu called Deep Speech is actually superhuman for these short queries. Because you have no context, people can have thick accents. That speech engine actually started out as a basic research project. We looked at this problem, we said gosh, what would happen if speech recognition were human-level for every product you ever used? Whether you’re in your home or you’re in your car, you pick up your phone, but you hold your phone up close or you hold it away, if I’m in the kitchen and my toddler is yelling at me, can I still use a speech interface? Could it work as well as a human being understands us?
Craig Cannon [04:59] – What is the basic research that moved it forward, to put it in a place that it’s useful?
Adam Coates [05:03] – We had the hypothesis that maybe the thing holding back a lot of the progress in speech is actually just scale. Maybe if we took some of the same basic ideas, we could see in the research literature already and scaled them way up, put in a lot more data, invested a lot of time in solving computational problems, and built a much larger neural network than anyone had been building before for this problem, we could just get better performance. And lo and behold, with a lot of effort, we ended up with this pretty amazing speech recognition model that, like I said, in Mandarin at least, is actually superhuman. You can actually sit there and listen to a voice query that someone is trying out, and you’ll have native speakers sitting around debating with each other, wondering what the heck the person is saying. And then the speech engine will give an answer and everybody goes, “Oh, that’s what it was!” Because it’s just such a thick accent from perhaps someone in rural China.
Craig Cannon [06:02] – How much data do you have to give it to train it? Because I think on the site it was English and Mandarin.
Adam Coates [06:09] – Yeah.
Craig Cannon [06:10] – Like if I wanted German, how much would I have to give it?
Adam Coates [06:12] – One of the big challenges for these things is that they need a ton of data. So our English system uses 10 to 20,000 hours of audio, the Mandarin systems are using even more for top-end products. This certainly means that the technology is at a state where to get that superhuman performance, you’ve got to really care about it. For Baidu voice search, maps, things like that that are flagship products, we can put in the capital and the effort to do that. But it’s also one of the exciting things going forward in the basic research that we think about is, how do we get around that? How can we develop machine learning systems that get you human performance on every product? And do it with a lot less data.
Craig Cannon [06:56] – What I was wondering, then, did you see that Lyrebird thing that was floating around this week?
Adam Coates [07:00] – Yeah.
Craig Cannon [07:00] – They claim that they don’t need that much time, all that much audio data, to emulate your voice. Or simulate, whatever they call it. You guys have a similar project going on, right?
Adam Coates [07:11] – Yeah, we’re working on text-to-speech.
Craig Cannon [07:13] – Why can they achieve that with less data?
Adam Coates [07:16] – I think the technical challenge behind all of this, is there are sort of two things that we can do. One is to try to share data across many applications. To take text-to-speech as one example, if I learn to mimic lots of different voices, and then you give me the 1,001st voice, you’d hope that the first thousand taught you virtually everything you need to know about language, and that what’s left is really some idiosyncratic change that you could learn from very little data. That’s one possibility. The other side of it is that a lot of these systems, this is much more important for things like speech recognition that we were talking about, is we want to move from using supervised learning, where a human being has to give you the correct answer in order for you to train your neural network, but move to unsupervised learning, where I could just give you a lot of raw audio and have you learn the mechanics of speech before I ask you to learn a new language. Hopefully that can also bring down the amount of data that we need.
Craig Cannon [08:22] – Then on the technical side, could you give us somewhat of an overview of how that actually works? How do you process a voice?
Adam Coates [08:31] – For text-to-speech?
Craig Cannon [08:31] – Let’s do both actually because I’m super interested, so speech-to-text.
Adam Coates [08:36] – Let’s start with speech recognition. Before we go and train a speech system, what we have to do is collect a whole bunch of audio clips, so for example, if we wanted to build a new voice search engine, I would need to get lots of examples of people speaking to me, giving me little voice queries. And then I would actually need human annotators or I need some kind of system that can give me ground truth, it can tell me for a given audio clip, what was the correct transcription. And so once you’ve done that, you can ask a deep learning algorithm to learn the function that predicts the correct text transcript from the audio clip. This is called supervised learning. It’s an incredibly successful framework, we’re really good with this for lots of different applications. But the big challenge there is those labels, that someone has to be able to sit there and give you, say, 10,000 hours worth of labels which can be really expensive.
Craig Cannon [09:42] – What is the software doing to recognize the intonation of a word?
Adam Coates [09:47] – Well, traditionally, what you would have to do is break these problems down into lots of different stages. So I, as a speech recognition expert, would sit down and I would think a lot about, what are the mechanics of this language? So for Chinese, you would have to think about tonality and how to break up all the different sounds into some intermediate representation. And then you would need some sophisticated piece of software called a decoder that goes through and tries to map that sequence of sounds to possible words that it might represent. And so you have all these different pieces and you have to engineer each one, often with its own expert knowledge. But Deep Speech and all of the new deep learning systems we’re seeing now, try to solve this in one fell swoop. Really the answer to your question is kind of the vacuous one, which is, once you give me the audio clips and the characters that it needs to output, a deep learning outcome can actually just learn to predict those characters directly. In the past, it always looked like there was some fundamental problem that maybe we could never escape this need for these hand-engineered representations, but it turns out that once you have enough data, all of those things go away.
Craig Cannon [11:03] – Where did your data come from, 10,000 hours of audio so far?
Adam Coates [11:10] – We actually do a lot of clever tricks in English where we don’t have a large number of English language products. For example, it turns out that if you go onto, say, a crowdsourcing service, you can hire people very cheaply to just read books to you. And it’s not the same as the kinds of audio that we hear in real applications. But it’s enough to teach a speech system all about liaisons between words, and you get some speaker variation, and you hear strange vocabulary where English spelling is totally ridiculous. In the past, you would hand-engineer these things. You’d say, “Well, I’ve never heard that word before. So I’m going to bake the pronunciation into my speech engine.” But now it’s all data-driven, so if I hear enough of these unusual words, you see these neural networks actually learn to spell on their own, even considering all the weird exceptions of English.
Craig Cannon [12:08] – And you have the input, right? Because I’ve heard of people doing it with a YouTube video, but then you need a caption as well with the audio, so it’s twice as much if not more, work. Interesting, and so, what about the other way around? How does that work on the technical side?
Adam Coates [12:21] – That’s one of the really cool parts of deep learning right now, is that a lot of these insights about what works in one domain keep transferring to other domains. So with text-to-speech, you could see a lot of the same practices. So you would see that a lot of systems were hand-engineered combinations of many different modules. Each module would have its own set of machine learning algorithms with its own little tricks. One of the things that our team did recently with a piece of work that we’re calling Deep Voice, was to just ask, what if I rewrote all of those modules using deep learning for every single one? To not put them all together just yet, but even just ask, can deep learning solve all of these adequately to get a good speech system? It turns out the answer is yes. You can basically abandon most of this specialized knowledge in order to build all of the subsequent modules. And in more recent research that’s in the deep learning community, we’re seeing that everyone is now figuring out how to make these things work end to end. They’re all data-driven and that’s the same story we saw for Deep Speech, so we’re really excited about that.
Craig Cannon [13:32] – That’s wild. And so, do you have a team just dedicated to parsing research coming out of universities and then figuring out how to apply it, are you testing everything that comes out?
Adam Coates [13:42] – It’s a bit of a mix, it’s definitely our role to not only think about AI research, but to think about AI products and how to get these things to impact. I think there is clearly so much AI research happening that it’s impossible to look through everything. But one of the big challenges right now is to not just digest everything, but to identify the things that are truly important.
Craig Cannon [14:11] – So what’s a 90 million person product? You’re like, “Oh man.”
Adam Coates [14:15] – Well it’s the speech recognition we chose because we felt in aggregate, it had that potential. So as we have the next wave of AI products, I think we’re going to move from these bolted-on AI features, to really immersive AI products. If you look at how keyboards were designed a few years ago for your phone, you see that everybody just bolted on a microphone and they hooked it up to their speech API. And that was fine for that level of technology. But, as the technology’s getting better and better, we can now start putting speech up front. We can actually build a voice first keyboard. It’s actually something we’ve been prototyping in the AI lab, where you can actually download this for your Android phone. It’s called TalkType in case anybody wants to try it. But it’s remarkable how much it changes your habits. I use it all the time, and I never thought I would do that. And so, it emphasized to me why the AI lab is here. That we can discover these changes in user habits, we can understand how speech recognition can impact people much more deeply than it could when it was just bolted onto a product. And that spurs us on to start looking at the full range of speech problems that we have to solve and get you away from this close talking voice search scenario, and to one where I can just talk to my phone or talk to a device and have it always work.
Craig Cannon [15:45] – So as you’ve given this to a bunch of users, I assume, and gotten their feedback, have you been surprised with the voice as-is interface? I know lots of people talk about it. Some people say it doesn’t really make sense. For example, you see Apple transcribing voicemails now. Are there certain use cases where you’ve been surprised at how effective it is, and others where you’re like, I don’t know if this will ever play out?
Adam Coates [16:08] – I think the really obvious ones like texting seem to be the most popular. Like the feedback that is maybe the most fun for me is when people with thick accents post a review and they say, “Oh, I have this crazy accent I grew up with and nothing works for me, but I tried this new keyboard and it works amazingly well!” I have a friend who has a thick Italian accent and he complains all the time that nothing works.
Craig Cannon [16:38] – And it’s working?
Adam Coates [16:38] – And all of this stuff now works for different accents, because it’s all data-driven. We don’t have to think about how we’re going to serve all these different users. If they’re represented in the data sets and we get some transcriptions, we can actually serve them in a way that really wasn’t possible when we were trying to do it all by hand.
Craig Cannon [16:55] – That’s fantastic. And have you gone through the whole system? In other words, if I want to give myself an Italian-American accent, can I do that yet with Baidu?
Adam Coates [17:02] – We can’t do that yet with our TTS engine, but it’s definitely on the way.
Craig Cannon [17:07] – Okay, cool, so what else is on the way? What are you researching, what products are you working on, what’s coming soon?
Adam Coates [17:12] – Speech and text-to-speech, I think these are part of a big effort to make this next generation of AI products really fly. Once text-to-speech and speech are your primary interface to a new device, they have to be amazingly good and they have to work for everybody. I think there’s actually still quite a bit of room to run on those topics. Not just making it work for a narrow domain, but making it work for really the full breadth of what humans can do.
Craig Cannon [17:40] – Do you see a world where you can run this stuff locally or will they always be calling an API?
Adam Coates [17:46] – I think it’s definitely going to happen. One funny thing is that if you look at folks who maybe have a lot less technical knowledge and don’t really have the instinct to think through how a piece of technology is working on the back end, I think the response to a lot of AI technologies now, because they’re reaching this uncanny valley, is that we often respond to them as though they’re sort of human. And that sets the bar really high, our expectations for how delightful a product should be is now being set by our interactions with people. And one of the things we discovered as we were translating Deep Speech into a production system, was that latency is a huge part of that experience. That the difference between 50 or 100 milliseconds of latency and 200 milliseconds of latency is actually quite perceptible. And anything we can do to bring that down actually affects user experience quite a bit. We actually did a combination of research, production hacking, working with product teams, thinking through how to make all of that work, and that’s a big part of the translation process that we’re here for.
Craig Cannon [18:57] – That’s very cool. What happens on the technical side to make it run faster?
Adam Coates [19:04] – When we first started the basic research for Deep Speech, like all research papers, we choose the model that gets the best benchmark score, which turns out to be horribly impractical for putting online. And so after the initial research results, the team sat down with just a set of what you might think of as product requirements, and started thinking through what kinds of neural network models will allow us to get the same performance, but don’t require so much future context, they don’t have to listen to the entire audio clip before they can give you a really high-accuracy response.
Craig Cannon [19:44] – Doing the language prediction stuff, like the Open AI guys were doing with the Amazon reviews, like predicting what’s coming next.
Adam Coates [19:51] – Maybe not even predicting what’s coming next. But one thing that humans do without thinking about it is if I misunderstand a word that you’ve said to me, and then a couple of words later, I pick up context that disambiguates it, I actually don’t skip a beat. I just understand that as one long stream. One of the ways that our speech systems would do this is that they would listen to the entire audio clip first, process it all in one fell swoop, and then give you a final answer. And that works great for getting the highest accuracy, but it doesn’t work so great for a product where you need to give a response online, give people some feedback that lets them know that you’re listening. And so you need to alter the neural network so that it tries to give a really good answer using only what it’s heard so far, but can then update it very quickly as it gets more context.
Craig Cannon [20:44] – I’ve noticed over the past few years, people have gotten quite good at structuring sentences so Siri understands them. They put the noun in the correct position so it feeds back the data correctly. I found this when I was traveling. I was using Google Translate, and after one day, recognized that I couldn’t give it a sentence, but if I give it a noun, I could just show it to someone. And like if I just show bread, it will translate it perfectly. Do you find that we’re going to have to slightly adapt how we communicate with machines, or is your goal to communicate perfectly as we would?
Adam Coates [21:22] – I really want it to be human level. And I don’t see a serious barrier to getting there, at least for really high-valued applications. I think there’s a lot more research to do, but I sincerely think there’s a chance that over the next few years, we’re going to regard speech recognition as a solved problem.
Craig Cannon [21:39] – That’s very cool. So what are the really hard things happening right now? What are you not sure if it’ll work?
Adam Coates [21:46] – I think we were talking earlier about getting all this data. So for problems where we can just get gobs of labeled data, I think we’ve got a little bit more room to run there, but we can certainly solve those kinds of applications. But there’s a huge range of what humans are able to do, often without thinking, that current speech engines just don’t handle. We can deal with crosstalk and a lot of background noise. If you talk to me from the other side of a room, even if there’s a lot of reverberation and things going on, it usually doesn’t bother anybody that much. And yet, current speech systems often have a really hard time with this. But for the next generation of AI products, you’re going to need to handle all of this. And so a lot of the research that we’re doing now is focused on trying to go after all those other things. How do I handle people who are talking over each other or handle multiple speakers who are having a conversation very casually? How do I transcribe things that have a very long structure to them like a lecture, where over the course of the lecture, I might realize I misunderstood something? Some piece of jargon gets spelled out for me and now I need to go and transcribe it. So this is one place where our ability to innovate on products is actually really useful. We’ve just launched recently a product vision called Swift Scribe to help transcriptionists be much more efficient, and that’s targeted at understanding all of these scenarios where the world wants this long-form transcription. We have all of these conversations that we’re having that are just lost and we wish we had written down. But it’s just too expensive to transcribe all of it for everyday applications.
Craig Cannon [23:29] – In terms of emulating someone’s voice, do you have any concerns for faking it? Did you see the face simulation? I forget the researcher’s name so I’ll have to link to it. But you know what I’m talking about. Essentially you can feed it both video and audio, and you can recreate Adam talking. Do you have any thoughts on how we can prepare for that world?
Adam Coates [23:51] – No, I think in some sense, this is a social question. I think, culturally, we’re all going to have to exercise a lot of critical thinking. We’ve always had this problem in some sense that I can read an article that has someone’s name on it. And not withstanding understanding writing style, I don’t know for sure where that article came from. I think we have habits for how to deal with that scenario, we can be healthily skeptical, and I think we’re going to have to come up with ways to adapt that to this brave new world. I think those are big challenges coming up, and I do think about them. But I also think a lot about all the positives that AI is going to have. I don’t talk about it too much. My mother actually has muscular dystrophy. Things like speech and language interfaces are incredibly valuable for someone who cannot type on a iPod because the keys are too far apart. These are just all these things that you don’t really think about, that these technologies are going to address over the next few years, and on balance, I know that we’re going to have a lot of big challenges of how do we as users adapt to all of the implications? But I think we’ve done really well with this in the past and we’re going to keep doing well with it in the future.
Craig Cannon [25:23] – Do you think AI will create new jobs for people or will we all be mechanical turks feeding the system?
Adam Coates [25:29] – I’m not sure, I think this is something where the job turnover in the United States every quarter is incredibly high, it’s actually shocking that the fraction of our workforce that quits one occupation and moves to another one is really high. I think it is clearly getting faster. We talked about this phenomenon within the AI lab here, where the deep learning research is flying ahead so quickly that we are often remaking ourselves to keep up with it and to make sure that we can keep innovating. I think that might even be a little bit of a lesson for everyone, that continual learning is going to become more and more important going forward.
Craig Cannon [26:14] – Yeah, so speaking of, what are you teaching yourself so the robots don’t take your job?
Adam Coates [26:20] – I don’t think we’re at risk of robots taking our jobs right now. Actually, it’s kind of interesting. We’ve thought a lot about, how does this change careers? One thing that has been true in the past is that if you were to create a new research lab, one of the first thing you’d do is fill it with AI experts. Where they live and breathe AI technology all day long. I think that’s really important. I think for basic research, you need that kind of specialization. But because the field’s moving so quickly, we also need a different kind of person now. We also need people who are chameleons, who are these highly-flexible types that can understand and even contribute to a research project, but can also simultaneously shift to the other foot and think about, how does this interact with GBU hardware and a production system, and how do I think about a product team and user experience? Because often, product teams today can’t tell you what to change in your machine learning algorithm to make the user experience better. It’s very hard to quantify where it’s falling off the edge. And so you have to be able to think that through to change the algorithms. You also have to be able to look at the research community to think about what’s possible and what’s coming. And so, there’s an amazing full stack machine learning engineer that’s starting to show up.
Craig Cannon [27:44] – Where are they coming from? If I want to be that person, what do I do now? Say I’m 18.
Adam Coates [27:48] – They seem to be really hard to find right now.
Craig Cannon [27:50] – I would believe it!
Adam Coates [27:52] – In the AI lab, we’ve really set ourselves to just creating them. I think this is the way unicorns are, that we have to find the first few examples and see how exciting that is and then come up with a way for people to learn and become that sort of professional. Actually one of the cultural characteristics of our team is that we look for people who are really self-directed and hungry to learn. Things are going so quickly, we just can’t guess what we’re going to have to do in six months. Having that do-anything attitude of saying, well, I’m going to do research today, and think about research papers. But wow, once we get some traction and the results are looking good, we’re going to take responsibility for getting this all the way to 100 million people. That’s a towering request of anyone on our team, and the things that we find help everyone connect to that and do really well with that is really self-directed and able to deal with ambiguity, and also really willing to learn a lot of stuff that isn’t just AI research, but is also stepping way outside of comfort zones and learning about GPU’s and high-performance computing, and learning about how a product manager thinks.
Craig Cannon [29:09] – Okay, this has been super helpful. If someone wanted to learn more about what you guys are working on or even just things that have been influential to you, what would you recommend they check out on the internet?
Adam Coates [29:19] – Oh my goodness. I’ll have to think about this one for a second here. I think the stuff that’s actually been quite influential for me is actually startup books. I think, especially with big companies, it’s easy to think of ourselves in silos, of having a single job. One idea from the startup world that I think is really amazingly powerful is this idea that a huge fraction of what you’re doing is learning. There’s a tendency, especially amongst engineers, I count myself a member, is we want to build something. And so, one of the disciplines we all have to keep in mind is that we all have to be really clear-eyed and think about, what do we not know right now? And focus on learning as quickly as we can to find the most important part of AI research that’s happening and find the most important pain point that people in the real world are experiencing, and then be really fast at connecting those. And I think a lot of that influence on my thinking is coming from the start-up world. There you go.
Craig Cannon [30:33] – That’s a great answer. Okay, cool, thanks man.
Adam Coates [30:35] – Thanks so much.
Craig Cannon [30:38] – Alright, thanks for listening. So please remember to rate the show and subscribe wherever you listen to podcasts, and if you’d like to read the transcript or watch the video, you can check out blog.ycombinator.com. Alright, see you next time.