Google Play


Craig Cannon [00:00] – Hey, this is Craig Cannon and you’re listening to Y Combinator’s podcast. Today’s guest is Wojciech Zaremba, who is a co-founder of OpenAI. OpenAI is a non-profit AI research company. They’re focused on discovering and enacting the path to safe artificial general intelligence. This episode is a bit of a primer on AI, as we have several AI interviews coming up, and they tend to be a bit more specific than this one. Alright, here we go. Hey, today we have Wojciech Zaremba, and we’re gonna talk about AI. So, Wojciech, could you give us a quick background?

Wojciech Zaremba [00:30] – I’m a founder at OpenAI, I’m working on robotics, I think that deep learning and AI is great application for robotics. Prior to that, I spent a year at Google Brain, and I spent a year at Facebook AI Research, and, same time I graduated from… I had finished my PhD at NYU.

Craig Cannon [00:55] – Can you explain how you pulled that off? That seems pretty rare.

Wojciech Zaremba [00:58] – So, the great thing about both of these organizations, is that they are focused on research, so throughout my PhD I was actually publishing papers over there. I highly recommend both organizations, as well as, of course, OpenAI.

Craig Cannon [01:15] – Yeah, okay, so most people probably don’t know what OpenAI is, so could you just give a quick explanation?

Wojciech Zaremba [01:21] – So, OpenAI focuses on building AI for the good of humanity, we are a group of researchers and engineers, collaborating together to essentially try to figure out what are the missing pieces of artificial, of general artificial intelligence, and how to build it in a way that would be maximally beneficial to humanity as a whole. OpenAI is greatly supported by Elon Musk and Sam Altman, in total, we gathered an investment of one billion dollars, in that group.

Craig Cannon [02:04] – Which is quite a lot. And so, what are, I mean, I know some, but what are the OpenAI projects?

Wojciech Zaremba [02:10] – So, there is several large projects going on simultaneously. We have also, we are doing also basic research, so let me first enumerate large projects. These are robotics, so in terms of robotics, we are working on manipulation, we think that manipulation is the complete, it’s the one of the parts of robotics which is the most unresolved.

Craig Cannon [02:37] – Sorry, just to clarify, what does that mean exactly?

Wojciech Zaremba [02:40] – So, it means that, so in robots there are essentially three major families of tasks. One is locomotion, which means how to move from, let’s say how to walk, how to move from point A to point B. Second is navigation, it’s moving in complicated environments, such as, for instance, a flat or a building, and you have to figure out, actually, to which rooms have you visited before, which not, and where to go. And the last one is manipulation, so it means, you want to grasp an object, let’s say, open an object, place objects in various locations. And the third one is the one which is currently the most difficult. So, it turns out that when it comes to arbitrary objects, current robots are unable to just grasp an arbitrary object. For any object, it’s possible to hand-code a single solution, so, say as long as, let’s say in factory, we have same object, like I am, I don’t know, we are producing glasses, and there exists a hand-coded solution to it, there is a way to, by code, to write the program saying, “Let’s place a hand in the middle of the glass and then then let’s close it.” But there is no way, so far, to write a program such that it would be able to grasp an arbitrary object.

Craig Cannon [04:13] – Okay, got you, and then, just very quickly, the other OpenAI projects going on?

Wojciech Zaremba [04:19] – So, another one has to do with playing a complicated computer game, and the third one has to do with playing large number of computer games, and you might ask why it’s interesting? And in some sense, I would like to see, that, so human… has an incredible skill of being able to learn extremely quickly, and it has to do with prior experience. So let’s say, even if you haven’t played ever at volleyball, if you try it out for the first time, and within 10 or 15 minutes you would be able to grasp how to actually, how to play. And it has to do with all the prior experience that you have from different games. If you would put a child, or like, if you would put an infant on the volleyball court and ask him or her to play, it would fail miserably. But, I mean, due to the fact that it has experience coming from large number of other games, or let’s say other life situations, it’s able to actually transfer all the knowledge. So, at OpenAI we’re able to pull together large number of computer games, and computer games can be, it’s quite easy to quantify how good you are in the computer game. Currently best AI systems, so, first of all, it’s possible for many computer games to write a program that solves it pretty well, or plays it well. There are also results from, results in terms of reinforcement learning, or in terms of so-called deep reinforcement learning, showing that it’s possible to learn how to play a computer game, these are like, the initial results are coming from DeepMind. But simultaneously… Simultaneously, it takes extremely long time, like in terms of real-time execution, to learn to play computer games. So, for instance, Atari games, for instance, in terms of real-time execution, it take something around three years of play to learn to play simple games. I mean, it can be hugely paralyzed, therefore it takes few days to train it on current computers, but it’s way shorter for human. In 10 minutes we can kind of–

Craig Cannon [07:12] – Teach it how to play and win?

Wojciech Zaremba [07:13] – Yes.

Craig Cannon [07:13] – Okay. And is that through you, giving it feedback?

Wojciech Zaremba [07:17] – So the way how it works in the case of computer games, the feedback comes from the score, so it looks at the score in the game and tries to optimize it. And, I’d say that’s kind of reasonable, but I would say, simultaneously, it’s not that satisfying to me. So, the reason why it’s not that satisfying to me, so, the assumption, underlying reinforcement learning, is that then there is some environment, and environment you are an agent, and you are acting in an environment by executing actions and getting rewards from the environment. And the rewards might be had thought of as, let’s say, pleasure, or so. And the main issue is that it’s actually not that easy to figure out what are the rewards in the real world. Further on, other underlying assumption is in being able to reset environment, to kind of get to repeat the same situation, so the system can try thousands or millions of times to actually finish a game. So there are some small discrepancies. People also believe that it might be possible, somehow, to hard-code into system rewards, but I would say, that’s actually one of the big issues, that it’s kind of unresolved. Like, when I look how my nephew plays computer game, he actually, he doesn’t look on scoring, because he cannot read.

Craig Cannon [08:46] – Wow, okay.

Wojciech Zaremba [08:46] – Yeah, they can play pretty well. So, I mean, it can say, might be reward is somewhat different, might be reward comes from, like seeing a nice, hearing a nice voice in the game, or so, but I would say that’s something what is very unclear how to build a system, and what system should optimize, so, in some sense, if we have a metric that we want to optimize, it’s possible to build a system that could optimize for it, but it turns out that, in many cases, it’s not that easy. And I would say, that’s actually one of the motivations of why I wanted to work on robotics, because, in case of robotics, it’s way closer to the system that we care about. So, what I mean by that, for instance, let’s say you would like your robot to prepare scrambled eggs for you, and, so the question is, so how should I build the reward? And in computer games, actually, nice thing you’ll see are getting reward extremely frequently, so, let’s say anytime you kill an enemy, or let’s say, won’t die, it’s quite great, but in case of scrambling eggs, it would mean, the way how people write the rewards for systems, it would mean distance from hand to a pan. Then, let’s say, somehow we have to quantify if the egg… if you are able to crack open an egg, or, let’s say, if you fried it sufficiently, and how to kind of quantify turns out to be extremely difficult, and also there is no way even to reset the system, how to reset the system to the same place? So, these are like a fundamental issues, and they’re reason why I am personally interested in robotics, it’s a thing that actually these challenges will tell us how to solve.

Craig Cannon [10:49] – So, let’s start by defining a couple of things. So, what is artificial intelligence, what is machine learning, and then what is deep learning?

Wojciech Zaremba [10:57] – Okay, these are pretty good questions.

Craig Cannon [11:00] – Okay.

Wojciech Zaremba [11:02] – So, artificial intelligence is actually extremely broad, it’s extremely broad domain, and machine learning is sub-part of this domain, and, in a sense, artificial intelligence consists of any writing, any software that tries to solve some problems through some intelligence. It might be hand-coded solution, rule-based system, yeah, so, pretty much it’s actually very hard to say what is not artificial intelligence. You can say that, so, initial version, for instance, of Google search, was based on, it was avoiding any machine learning, and it was… there was like a well-defined algorithm called PageRank, and essentially PageRank counts how many incoming links are from other websites, and that’s artificial intelligence. It’s essentially a system that does intelligent things for you. Then, over the time, Google search started to use machine learning, because it helps to improve results, but simultaneously they wanted to avoid it for some time, as it’s more difficult to interpret results and it’s more difficult to actually understand what system does. So, what is machine learning? Machine learning, it’s essentially, a way of building, or let’s say, as a, essentially, you have data, and you would like to generate, based on data, program with some behaviors. So, like, the most common example, which is still sub-branch of machine learning, so-called “supervised learning.” So, you have pairs of examples, X comma Y, which means, “I would like to map X to Y,” for instance, either if, even if e-mail is spam or not spam, or, let’s say, even image, what is a category of an image, or, for instance, to whom should I recommend given product? And, based on this data, you would like to generate the program, some sort of like, a black box, or some function that, for new examples, would be able to give you singular answers. And that’s an example of supervised learning. But, that sense, machine learning means that you would like to generate program from data.

Craig Cannon [13:59] – Okay.

Wojciech Zaremba [14:00] – And this usually is a statistical machine learning method, so somehow you count how many times even events, given events occurred or so.

Craig Cannon [14:11] – Okay, got you, and then the third being deep learning.

Wojciech Zaremba [14:14] – So, deep learning, that’s also one paradigm in terms of machine learning, and idea behind it is ridiculously simple.

Craig Cannon [14:32] – Okay.

Wojciech Zaremba [14:34] – So, people realized that if you want to, as I said, machine learning means that you get data as the input, and program as the output, and deep learning says that the computation of the program, then that what I am actually doing with this data should involve many steps. Not one step, but many.

Craig Cannon [15:03] – Okay.

Wojciech Zaremba [15:05] – And, pretty much that’s it, in terms of meaning of deep learning. So, you might ask, why it’s so popular now, and how it’s so different from what was there before? So if you assume that you do one step of computation, let’s say that you take your data, and you kind of have single if statement, or a small number of if statements, then, like, for instance, say, if you have a, no, let’s say your data is a recording from a stock market, and you are saying, you’re gonna sell or buy depending on values bigger or smaller than something, or if, let’s say, depending on who is the new president, or so, you are making some decisions. So, in a sense, it turns out that, in case of models that are based on single step, people are able to prove plenty of stuff mathematically, and in terms of models that require multiple steps of computation, mathematical proofs are extremely weak. And for a long time, models that have the single step of computation, they were out-performing models that do many steps of computation. But recently, it kind of changed, and it was, for many people, it was obvious for a long time that true intelligence cannot be done in single step, but it would require many steps. But, so far, many systems actually, they worked in the way that they had, kind of, very, very shallow, they were very shallow, but simultaneously, extremely gigantic. So, what I mean by that. You could generate, let’s say, for the task of interest, let’s say the recommendation, you could generate large number of features. Let’s say thousands of them. These are features saying, for instance… let’s say you want to do movie recommendation, you can say, “Is movie longer or shorter than two hours? Is it longer or shorter than one hour?” That’s, there are two features. You can say, “Is it drama, is it thriller, is it something else?” You can generate million of these, and then, or let’s say 100,000, that’s actually quite a reasonable value, and then your shallow classifier can essentially determine based on the combination of these features, either to recommend it to you, or not.

Craig Cannon [17:41] – Okay.

Wojciech Zaremba [17:43] – In case of deep learning, you would say, “Let’s kind of combine it for multiple steps.” And that’s essentially, that’s entire difference, and in case of deep learning, the most successful embodiment of deep learning is in terms of neural networks.

Craig Cannon [18:07] – Okay, so let’s define that, too.

Wojciech Zaremba [18:09] – So, neural networks is also extremely simple concept. And that’s something people came up with a long time ago, and it means, this follows, it’s… You have an input, this might be say, vector, or it might have some additional structure, like, let’s say image, so it’s kind of a matrix, two dimensional, and… Neural network, it’s a sequence of layers, layers are represented by matrices, and what you do is, you multiply your input by a matrix, and apply some non-linear operation, and multiply it again by a matrix, and apply non-linear operation, you might ask, “Why would I even need to apply this non-linear operation?” It turns out that if you would multiply by two matrices, it can be reduced the multiplication by single matrix. Like, a composition of two linear operators can be written as single linear operator. You could multiply these matrices together, and the result of, you could condense it into single matrix. And, non-linearity is something like, they’re classical non-linearity, but they’re, say, there are extremely large number of variants in terms of what I said, but what I just described is so-called feed for working neural network, so, it essentially takes input, multiplies it by a matrix, non-linearity multiplies it by matrix, examples of non-linearity is, there is something called, one which is classical, something called sigmoid, so, sigmoid is a function that has a shape of an “S” character, “S” letter, it’s kind of close to zero. For negative values it grows to half at zero and then goes up to one when the values are larger, kind of modulates the input, and that’s the most classical version of activation function. It turns out that one which is even simpler empirically works way better, which is called ReLU, rectified linear unit, and this one, it’s ridiculously simple. ReLU is just maximum of 0,x. So, when you have negative values at zero, you have positive value, just copy that value, and that’s it. So, you might ask, so. First of all, what are the successes of deep learning, why we actually believe that it works, why, what change, and why it’s so much different than it was before, and there are like some few differences.

Craig Cannon [21:27] – Yeah.

Wojciech Zaremba [21:29] – This is a good question–

Craig Cannon [21:30] – No, it’s exactly where I was gonna go, but I was gonna ask, beforehand, yeah, why neural networks are a thing now, as opposed to in the past?

Wojciech Zaremba [21:41] – The main difference is, all of a sudden we can train them to solve various problems, and let’s say one family of problems, these are problems in supervised learning, so, better than any other method, they can map these examples to labels, and then on the hold out data on test data, they outperform anything else, and, in many cases, they get superhuman results.

Craig Cannon [22:06] – And is that just a function of computational power that we have access to?

Wojciech Zaremba [22:10] – When it comes to models, and neural networks is an example of model, there is always that question. So, how to figure out parameters of a model. So, there is some training procedure, and the most common procedure for neural networks is so-called stochastic gradient descent, it’s also ridiculously simple procedure, and it turns out that empirically it works very well. So, people came out with vast number of learning algorithms, stochastic gradient descent is an example of one learning algorithm, there are others. Let’s say there is something called Hebbian learning that’s motivated by the way how neurons in human brain learn. But this one, so far, empirically is working the best.

Craig Cannon [23:00] – Okay, so then let’s go to your, the question you asked yourself, which is, “Why now?” What’s happening to make people care about it right now?

Wojciech Zaremba [23:12] – So, since 20 years ago, there were several small differences in terms of how people train neural networks. And there is a large increase in computational power. So, I can speak about the major advances. So, number one advance, I would say, that’s even, the one advance that’s actually an old one, but it seems to be extremely critical, it’s something called Convolutional Neural Network.

Craig Cannon [23:47] – Okay, and what does that mean?

Wojciech Zaremba [23:52] – It’s actually a very simple concept. So, let’s say your input is an image, and let’s say your image is of a size 200 x 200, that has also, let’s say, three colors, so, the number of values in total is actually 120,000, so, if you would actually squash it into a vector, this vector would be of this size, okay, and you can think, that if you would like, let’s say to apply neural network, to essentially multiply it by a matrix, and let’s say, if you would like to have output of the multiplication of similar size, let’s say 120,000, then, all of a sudden, the matrix, to multiply it, would be of a gigantic size. And learning, learning consists of estimating parameters for neural network. It turns out that, empirically, that wouldn’t essentially work, that if you would use algorithm of backpropagation, you would get quite poor results. And people realized that in case of images, you might want to multiply by a little bit special matrix that also allows to do way faster computations. So, you can think that neural network as it applies some computation to the input, so neural network applies some computation to the input, you might want to constrain this computation in some sense, so, you might think, as you will have several layers, maybe initially you would like to do very local computation, and it should be pretty much similar in every location. So, you would like to apply the same computation in the center as in the corners, maybe later on you need some diversification, but, you want to pre-process image the same way. So the idea is that, when you take an image, or any, actually, two-dimensional structures, so, another example is, you can take voice, and it turns out that you can, by applying Fourier transform you turn voice into image, and all the, it’s like a–

Craig Cannon [26:26] – So, much like a wave form?

Wojciech Zaremba [26:28] – Yes, so you take a wave form, and you apply Fourier transform, and essentially, on the X-axis you have time as the speech goes down, and on the Y-axis you have different frequencies, and that’s an image, and speech-recognition systems, they also treat sound as it would be an image.

Craig Cannon [26:55] – I didn’t realize that, that’s really cool. Okay.

Wojciech Zaremba [26:58] – That’s why I’m saying, that the technique… Also, as a kind of a side-track, the cool thing about neural networks is, it used to be the case, that people specialized in processing text, images, sound? And, these days, this is the same group of people.

Craig Cannon [27:21] – Mmhmm, that’s really cool.

Wojciech Zaremba [27:23] – They are using the same methods. So, coming back to what is convolutional neural network, as I mentioned, you would like to apply the same computation all over the placing image, and essentially convolutional neural network says when we take an image, let’s just connect neuron with local values on the image, and let’s copy the same waves over and over again. So, this way you will multiply, kind of. Multiply values in the center, in the corners, by the same values in the matrix. And so an input to the convolution is an image, and output is kind of also an image, you can think that, there is also some specific vocabulary, so in this kind of three-dimensional images, like, you have height and you have also depth, so let’s say, in case of image that’s three-dimensions and then you apply convolution, you can kind of change number of depth dimensions, usually people go to, let’s say, I don’t know, 100 dimensions or so.

Craig Cannon [28:48] – Okay, got you.

Wojciech Zaremba [28:49] – And then you kind of have several of these layers, and then there are so-called “fully connected layers” which are just, conventional matrices. So, I would say, that’s one of advances, that actually happened 20 years ago, already. And another one which is, it might sound kind of funny, but, for a long time people didn’t believe that it’s possible to train deep neural networks, and they were thinking quite a lot about what are the proper learning algorithms, and it turns out that… So, let’s say, when you train a neural network, you start off by initializing weights to some random values, and it turns out that it’s very important to be careful to what magnitudes you initialize with. And, if you set it to the right values, and I can even give you some, let’s say, intuition of what it means, turns out that, then, simplest algorithm, which is called stochastic gradient descent, actually works pretty well.

Craig Cannon [30:00] – Okay.

Wojciech Zaremba [30:01] – So, in some sense, as I said, let’s say, layers of neural network, they kind of multiply, they multiply input by matrices, and the property that you would like to retain, you don’t want the magnitude of values to blow up, and also you don’t want it to shrink down. And if you kind of multiply, if you choose random initialization, it’s easy to choose some initialization that will kind of, you know, turn, the magnitude will keep increasing, and then if you have 10 layers, and let’s say, in each of them, you multiply by two two two two two?

Craig Cannon [30:42] – Yeah, yeah.

Wojciech Zaremba [30:43] – And then the output, all of a sudden, is of completely different magnitude, and learning is not happening anymore. And if you kind of just choose them, and it’s a matter of choosing variants of like a magnitude of initial weights, and if you set it, start it up at say, output is of the same magnitude as input, then everything works.

Craig Cannon [31:05] – So, basically just adjusting those magnitudes was what proved that you could do this with a neural network?

Wojciech Zaremba [31:10] – Yes.

Craig Cannon [31:11] – Oh, wow, okay.

Wojciech Zaremba [31:12] – That’s kind of ridiculous that, let’s say, people haven’t realized it for a long time, but that’s what it is.

Craig Cannon [31:17] – And when did, when and where did that happen?

Wojciech Zaremba [31:20] – It happened actually at the University of Toronto.

Craig Cannon [31:23] – Oh, okay.

Wojciech Zaremba [31:24] – So at the Geoffrey Hinton lab. So the crazy thing is, people had several schemes in terms of how to train deep neural networks, and one was called generative pre-training. And, so let’s say there was some scheme work to do, in order to get to such a state of neural network that all of a sudden you can use this algorithm called stochastic gradient descent, so there was like an entire involved procedure, and at some point, Geoffrey asked his students to, you know, compare it to the simplest solution, which would be adjusting magnitudes, and like showing how big a difference there is.

Craig Cannon [32:14] – That’s crazy, man. Oh my God, okay, so. A question that’s a little bit broader, is just like, then what has happened in the past, say, five years to excite people so much about AI?

Wojciech Zaremba [32:27] – So, I would say the most stunning were so-called ImageNet results. So, first of all, I should tell you, where was computer vision five years ago? And then I will tell you what is ImageNet, and then I will tell you about the results. So, computer vision is a field where, essentially, you try to make sense of images, like a computer tries to interpret what is on the images. And it’s extremely simple to say, “Oh, here on an image there is a cow, a horse, or so,” but for a computer, images is just a collection of numbers. So, it’s a large matrix of numbers, and it’s very difficult to say, oh, how to, it’s very difficult to interpret what’s the content. And, it was the case that people came out with various schemes of how to do it, you know, you could imagine, I don’t know, let’s quantify how much of a brown color there is, such that you can say it’s a horse. Like a simple stuff. People, of course came out with more clever solutions, but the systems were quite bad. I mean, you could feed a picture of a sky to the system, and it was telling you that there is a car.

Craig Cannon [34:00] – So, not so good.

Wojciech Zaremba [34:01] – Yeah, so. Then Fei Fei Li, Fei Fei Li is a professor at Stanford. She, together with her students, she collected a large dataset of images, and the dataset is called ImageNet, it consists of one millions images in 1,000 classes, so that was, by the time, actually, the largest dataset of images.

Craig Cannon [34:33] – And a class, just to clarify, being like, “car” might be a class?

Wojciech Zaremba [34:37] – Yes, so, there is… The dataset, I would say, is not perfect, it has, for instance, it doesn’t contain people, that was one of the constraints over there. It contains large number of breeds of dogs. And so that’s a quirky thing about it, but at the same time, I mean, that’s essentially the dataset that made deep learning happen.

Craig Cannon [35:05] – Types of dogs.

Wojciech Zaremba [35:07] – No, the fact that it’s so large. So, what happened, there was like plenty of teams actually participating in ImageNet competition. And let’s say, even as I’m saying, there is 1,000 classes over there, so if you have a guess, a random guess, then the probability that your guess is correct is essentially 0.1%. The metric there was slightly different, you actually, if you make five guesses and if one of them is correct, then you are good, because there might be some other objects, and so on. And I remember, for the first time when I have seen that someone made that system, that someone created that system that had 50% error, I was impressed, okay, I was like, “Oh, man.” It’s like, 1,000 classes and it can say with 50% error what is there, I was quite impressed. But then, during competition, pretty much, like, all the teams got around 25% error rate, there was a difference by 1%, or like, for instance, a team from the University of Amsterdam, Japanese team, like plenty of people around the world, and a team from the University of Toronto, led by Geoffrey Hinton, and that’s like, the own team was Alex Krizhevsky and Ilya Sutskever, they actually got to something like 15%. So, let’s say, all other teams, they were like at 25%, the difference was 1%, and these two guys, they got to 15%, okay? And, crazy thing is that, we… So we’ve been following three years on this dataset, the error dropped dramatically, I remember, like, next year the error got to like 11%, 8%, I was kind of, you remember, by that time I was wondering, “What’s the limit? How good can you be?” And I was thinking 5%, that’s the best. And even they’re human strength to see how far they can get, they spend arbitrary amount of time on, let’s say, looking on other images and kind of comparing to be able to figure out what is there, I mean, it’s not that simple for a human. For instance, if you have plenty of breeds of dogs, and, like, who knows. But let’s say, if you can use some external images to kind of compare and say one, that that helps. But in a sense, within several years people got down, I believe, to 3% error, and that’s essentially superhuman performance. And, as I’m saying, it used to be the case, that systems in computer vision need take picture of sky, they were telling you it was a car. And all of a sudden you are getting to superhuman performance, and it turns out that these results are not just limited to computer vision, people are able to get amazing other systems, let’s say, speech recognition, so–

Craig Cannon [38:38] – ‘Cause that’s like the underlying question, right? Because, like, it’s not, I mean, to someone not in the field, like me, it’s not necessarily intuitive that computer vision, computer image recognition, would seed artificial intelligence. So, I mean, what came after that?

Wojciech Zaremba [38:55] – So, in a sense… The crazy thing is that the same architectures work for various tasks, and all of a sudden that fields which seem to be unrelated, they start to benefit from each other. So, simultaneously, it turns out that problems in speech recognition can be solved in very similar way, you can essentially take speech, apply Fourier transform, and then speech starts to look like an image. And you apply similar object-recognition network to kind of recognize what are the sounds over there, and like a phonemes, and so phonemes are like kinds of sounds, so, then you can turn it into text.

Craig Cannon [39:52] – And so that’s right, so it went to speech after images, and then, yeah.

Wojciech Zaremba [39:56] – And the next big thing was essentially a translation. The translation was extremely surprising to people, that’s the result by Ilya Sutskever. So, translation is an example of another field that actually leaped there by it’s own. And one of the crazy things about translation is, input is of a variable length, and output is of variable length, and it was unclear even how to kind of consume it with neural network, how to produce variable length input, variable length output. And Ilya came out with an idea. There is something called recurrent neural network, so, meaning… Let’s say recurrent neural network and convolutional neural network, they share an idea, which is… You might want to use the same parameters if you are doing similar stuff, and in case of convolutional network, it means, let’s share the same parameters in space, so, let’s apply the same transformation to the middle of image as in the corners, and so on, and in case of recurrent neural network, they assess… will be reading text from left to right. I can consume first word, can create some hidden state representation, and then, next time step, when I’m consuming next word, I can take it together with this hidden representation, and generic next hidden representation, and you are applying the same function over again, and this function consumes hidden representation and next word, hidden representation and the word, hidden representation and the word. So it’s relatively simple, the cool thing is, if you are doing it this way, regardless of length of your input, you have the same size of a network. And, that way, how his model worked, and that’s describing a paper called “Sequence to Sequence.” Sequence essentially consumes, word-by-word, a sentence that you want to translate, and then when you are about to generate translation, you essentially start emitting word by word, and at the end, you are, when you meet a dot, that’s end.

Craig Cannon [42:34] – That’s so cool.

Wojciech Zaremba [42:35] – And it was quite surprising to people, by that time they got to decent performance, they were not able to beat a phrase-based systems, and now it’s out-performed. Like, a long time ago already. And, yeah. One other issue that people have, so if neural network systems, like in case of translation, the problem with deploying it on the large scale, is that it’s quite computationally expensive, and it requires and in deep learning literature, there are various ideas how to make things way, way cheaper computationally after you train it, so it’s possible to throw away large number of weights, or, essentially turn 32-bit floats into smaller size numerics, and so on and so forth. And, pretty much, that’s the reason why things are not largely deployed in production systems out there, but neural network-based solutions are actually out-performing anything what is out there.

Craig Cannon [43:49] – There are a couple more things I would like to just define, for a general listener? So, there are a couple words being thrown around a lot, so, narrow AI, general AI, and then superintelligence. Can you just break those apart?

Wojciech Zaremba [44:04] – Sure. So, pretty much all AI that we have out there is narrow AI. No one builds, so far, general AI. No one builds superintelligence. So, narrow AI means artificial intelligence, so it’s like a piece of software that solves a single predefined problem. General AI means it’s a piece of software that can solve huge, vast number of problems, all of the problems. So you can say that human is general, it’s general intelligent, because you can give an arbitrary problem and human can solve it, but, for instance, bottle opener, can solve only bottle opening.

Craig Cannon [44:52] – Right.

Wojciech Zaremba [44:53] – So, pretty much, when we look at any tools out there, at any software, our software, and it’s good at solving single problem. For instance, our chess playing programs cannot drive a car. And, for any problem, we have to create a separate piece of software, and general artificial intelligence is a software that could solve arbitrary problems. So, how we know that it’s even doable? Because we, there is an example of a creature that has such a property.

Craig Cannon [45:40] – And then, superintelligence is, just, I assumed, the next step, yeah?

Wojciech Zaremba [45:45] – Essentially superintelligence means that it’s more intelligent than human, yeah.

Craig Cannon [45:54] – Cool. So given all that, given that like, we’re basically at a state of narrow AI across the board, at this point, where do you think is like, what’s the current status of this stuff? Where do you see it going in the next five or so years?

Wojciech Zaremba [46:11] – So, as I mentioned there, essentially, machine learning, there are various also paradigms, so, one of them is supervised learning, there is something called unsupervised learning, there is also something called reinforcement learning, and so far, the supervised learning paradigm is the only one that works solely remarkably well that it’s ready to be applied in business applications. All others are not really there. And so you ask me where we are, so, we can solve this problem, other problems, they require further work, it’s very difficult to plan, with ideas, how long it will take to make them work. The thing which is very different with contemporary artificial intelligence is that we are using precisely the same techniques across the board. Simultaneously, majority of business problems can be framed as supervised learning, and therefore they can be solved with current techniques, as long as we have sufficient number of input examples, and what we want to predict, and, as I mentioned, the first can be extremely rich, like output might be a sentence, and the current systems work pretty well with it, and nevertheless, it requires an expert to train it.

Craig Cannon [47:58] – And so then, given pretty substantial hype that we see, what do you think of it all?

Wojciech Zaremba [48:06] – The field is simultaneously under-hyped and over-hyped, so. From perspective of business application, as long as you have pairs of examples, pairs that indicate mapping, like what’s the input, what’s the output? We can pretty often get to superhuman performance, but in all other fields, we are still not there, and it’s unclear how long it will take. So, I’ll give some example, let’s say, for recommendation systems, you have often companies like Amazon, they have examples of millions of users, and they know what they bought, when they were happy or not, and that’s an example of a task that is pretty good for neural network, to learn what to recommend to new users. Simultaneously, Google knows what is the good search query for you, because on the “search results” page, you are clicking on the links that you are interested in, therefore, they should be displayed first. And in other fields, it’s actually quite often more difficult. In case of, let’s say, apple picking robot, it’s difficult to provide supervised data telling how to move an arm toward the apple, therefore that’s way more complicated. Same time, the problem of detecting where apple is, it’s where better defined and can be outsourced to human to annotate plenty of images, and to give localization of the apple, and quite often, the rest of that problem can be scripted by an engineer, but the problem of how to place fingers on an apple, or how to grip it, it’s not well scientifically solved.

Craig Cannon [50:22] – And, so, I have a couple questions, then, at this point. If people were to be interested in learning more about AI, in maybe working with OpenAI, or doing something, how would you recommend they get involved, and educate themselves?

Wojciech Zaremba [50:40] – So, let’s say, a good place to start is Coursera, Coursera is pretty good. There is also a lot of TensorFlow tutorials, TensorFlow is an example of a framework to train neural networks. Also, Andrej Karpathy’s class at Stanford, it’s extremely accessible, you can find it, I believe, on YouTube.

Craig Cannon [51:10] – Yeah, and then in terms of actual exercises?

Wojciech Zaremba [51:16] – In the case of TensorFlow tutorial, many of the problems, so, I believe in case of Andrej’s class, there might be homework, and in case of TensorFlow exercises, it’s quite often easy to come up with some random thought after, let’s say you’re reading, like a, you can take, for instance, the simple task over there is, let’s classify, let’s classify digits, and let’s classify pictures of digits, let’s assign them classes, you can try maybe and download some images from some other source, like a Flickr, let’s try to classify it or tag.

Craig Cannon [52:05] – Okay, so given that you guys are working on, with robots at this point, one of the other things that’s thrown in kind of part and parcel with AI is automation, specifically of a lot of these low-level, blue collar jobs. What do you think about the future, maybe in the next 10 years, of those jobs?

Wojciech Zaremba [52:29] – So, I believe that we’ll have to offer to people a basic income. I super strongly believe that that’s actually the only way. So, I don’t think that it will be possible for 40 year-old taxi driver to reinvent himself every 10 years. I think it might be extremely hard. Another crazy thing is… People define themselves through job, and that might be another big social problem. Simultaneously, they might not even like their jobs, like, if you ask someone, “Would you like your kid to sell in a supermarket, to be a seller in the supermarket?” They would answer, “No.” And maybe it’s possible to live in the world, that there is an abundance of resources, and people can just enjoy their life.

Craig Cannon [53:45] – I think we’re gonna have to figure out a way. Maybe people will always find purpose, but I think like, making it easier to find that purpose will become much more important in the future, if automation actually happens to the degree people talk about. And what about just like influences on you? That maybe have inspired you to work with robotics and in AI, are there any books or films or media that you really enjoyed?

Wojciech Zaremba [54:13] – There is pretty good book called Homo Deus. It actually describes the history of humans, and then speaks, has various predictions about the future or where we are heading. That’s one pretty good. I mean, there is no list, there’s like plenty of movies about AI, and how it can go wrong.

Craig Cannon [54:47] – What’s the best one?

Wojciech Zaremba [54:48] – I think Her is pretty good.

Craig Cannon [54:50] – Okay.

Wojciech Zaremba [54:51] – Yeah, Ex Machina is also pretty good.

Craig Cannon [54:54] – Cool, alright, do you have any other last things you wanna address?

Wojciech Zaremba [54:59] – No, thank you.

Craig Cannon [55:01] – Okay, cool, thanks, man. Alright, thanks for listening. Please remember to subscribe to the show, and leave a review on iTunes. After doing that, you can skip this section forever. And if you’d like to learn more about YC, or read the show notes, you can check out See you next week.