Transitioning from Academia to Data Science – Jake Klamka with Kevin Hale
Kevin Hale is a Visiting Partner at YC. Before YC Kevin was the cofounder of Wufoo, which was funded by YC in 2006 and acquired by SurveyMonkey in 2011.
00:00 – Kevin’s intro
00:30 – Jake’s intro
1:05 – Applying to YC with one product then changing it
3:30 – How Insight started
4:20 – Jake’s first students and initial coursework
8:00 – Finding out what companies want from data scientists
10:00 – Picking the first class of students
11:30 – Common pitfalls for people transitioning into data science
14:30 – Types of data science roles
16:45 – What data scientists should look out for in companies
17:40 – Chuck Grimmett asks – When do you know you need to bring in seasoned data scientists?
20:00 – How Insight has scaled and changed
22:00 – What happens in the program
23:20 – Examples of a good project for a data science resume
25:50 – Will more data scientists be founders in the future?
28:00 – Teaching product
29:00 – Cleaning data
31:30 – Tools for tracking data
32:30 – Track what are you trying to optimize
35:20 – Churn and conversion
39:00 – Is there an ideal background for a data scientist?
41:00 – Which startups recruit well at Insight?
43:00 – Contracting
45:40 – Fields Jake is excited about
Craig Cannon [00:00] – Hey, how’s it going? This is Craig Cannon and you’re listening to Y Combinator’s podcast. Today’s episode is with Jake Klamka and Kevin Hale. Jake founded Insight. Insight provides intensive, seven-week professional training fellowships in fields such as data science and data engineering. Insight was in the YC 2011 batch. Kevin’s a visiting partner at YC. Before YC, Kevin was a co-founder of Wufoo which was funded by YC in 2006 and acquired by SurveyMonkey in 2011. You can find Jake on Twitter @JakeKlampa and Kevin @ilikevests. All right, here we go. Kevin for those of our listeners that don’t know who you are, what’s your deal?
Kevin Hale [00:41] – I’m a partner here at Y Combinator. I actually was in the second ever batch, that was in winter 2006, and I founded a company called Wufoo. Ran that for five years and then we were acquired by SurveyMonkey and that moved us from Florida to California and that’s when PG asked if I’d be interested in helping out at YC. I’ve been there pretty much ever since.
Craig Cannon [01:02] – You suggested Jake as a guest for this episode. Jake, what do you do?
Jake Klamka [01:06] – I’m the founder and CEO of Insight. Insight is an education company, we run fellows programs that help scientists and engineers transition to careers in data science and AI. It’s a pretty unique model because they’re completely free, these fellowships. They’re full time, the companies sort of fund the process. Engineers, scientists, build projects for seven weeks. They meet top data teams and they get hired on those teams. We’ve got over 2,000 Insight alumni working as data scientists now across the US and Canada.
Craig Cannon [01:34] – Nice. You always haven’t been working on this. You applied to YC for the winter 2011 batch.
Jake Klamka [01:40] – That’s right.
Craig Cannon [01:41] – What was your idea then?
Jake Klamka [01:44] – I started my career and this is relevant to why I started Insight because I basically started, I wish it had existed when I was around. I was a physicist at the University of Toronto, I thought I was going to be a scientist for the rest of my life and then, partway through my PhD, I realized I want to go into technology. I think to myself, I’m writing code, I’m building machine learning models, this is great. I’ve got what I need. It frankly took me a long time to transition. Eventually got into Y Combinator, came down here from the winter ’11 batch, I was building a bunch of time mobile sort of productivity apps that are machine-learning enabled. Didn’t quite get the up into the right graph that you would hope for after YC. It was an incredible experience and you know, in that sort of late 2011 after… 12 months after YC, I was searching for a new idea and actually went and spoke with Paul Graham and a few other advisors and the recommendation was work on a problem you yourself have had. You’re kind of building these apps that, you know, you’re trying to use these machine learning models and hopefully somebody’s got that as a problem
Jake Klamka [02:47] – but flip it around, start with a problem you’ve had then figure out what the solution is. When I reflected on it, it took me a few years to really make this transition. I’d been so close all along but I didn’t know product, I wasn’t really connected in the Valley, there’s a bunch of… Technically I had the fundamentals but a lot of the tool sets were different in the industry. I didn’t know what I didn’t know. When I got down here and I started talking to people, that’s when I finally started figuring it out. With seeing a lot of my friends having that same struggle so brilliant mathematicians, neuroscientists, biologists, also engineers later, we found the same thing, kind of getting stuck and they’re like I want to go into data science, I want to go into AI, I want to go into these cutting-edge fields but you know, it doesn’t say the right thing on my resume or just getting that last mile is really hard to cross. And I thought, “Okay, well this is the problem I want to solve because these are some of the most brilliant people I’d ever worked with,” a lot of them were my former colleagues from physics and I thought, “What does the solution for this look like?” At first, I was focused on it’s going to be an app again, right, it’s a machine-learning enabled app and then I realize no, it actually probably looks more like an in-person program where folks are getting together and building cool projects and getting started from there.
Craig Cannon [04:01] – Did you just go ahead and teach a class?
Jake Klamka [04:04] – First I talked to companies and I said, listen, I’ve got these brilliant friends coming out of academia who I think you should be hiring, why aren’t hiring them? Basically what they told me is, I know they’re brilliant, I know they got all these great skills but they’re probably one to two months away from where I need them to be in terms of, if I had full days to mentor them for a month or two, they’d be incredible data scientists but they’re like, “I don’t have a month or two to mentor them so I say no in the interview,” right? So I’m like, “Well I have a month or two.” Maybe what Insight is going to be is that month or two where folks are filling in this last piece of the puzzle learning the sort of cutting-edge techniques and sort of tool sets and other things and then lets bring those data scientists in the room and have them hire.
Kevin Hale [04:53] – Who were your first students?
Jake Klamka [04:55] – We’d just jump right in and I ran the first sessions. That first session was just me. First students, so the focus were PhDs. That first group was in 2012…
Kevin Hale [05:03] – Were they like your friends?
Jake Klamka [05:03] – …in my little office. No, I had to go beyond my friends. First, I started talking to my friends in academia. I got confirmation from my friends in academia. I already knew that they were looking for jobs and they were excited about transitioning. I got confirmation from the hiring managers to say, “Listen, we’re hiring, we can’t find folks with the full skill set. If you bring them into a room, we’ll go look at them.” Then the rest was just kind of getting the word out
Kevin Hale [05:29] – How did you know what to teach them because you
Jake Klamka [05:30] – and getting applications.
Kevin Hale [05:31] – mentioned that you didn’t know what you didn’t know.
Jake Klamka [05:34] – By that time, I had spent like three years figuring it out, including doing YC and meeting a bunch of data scientists and building a bunch of data products. By that point, I knew what the pieces were but also really, the program was focused not on me teaching the fellows, it was focused on me bringing in sort of the leading data scientists at the time and having them directly tell them. We had Facebook, LinkedIn, Twitter, Square, all these early data teams in 2012, their heads of data science come in.
Kevin Hale [06:00] – They’re willing to do like one day. They just couldn’t commit like a lot.
Jake Klamka [06:04] – Yeah, that’s exactly it. That’s exactly it. They’re like I’ll come in for a few hours but I don’t have two months. I’m like, well, if I have a bunch of you coming in for a few hours plus really have these folks working away for a few months learning from each other, learning from these mentors. Once we got alumni too, it was incredible. We had all these alumni coming in to help and it’s like this big, nice way to finish the cycle.
Kevin Hale [06:23] – How big was that first class?
Jake Klamka [06:24] – It was eight fellows.
Kevin Hale [06:25] – And how many of them did you get them jobs?
Jake Klamka [06:28] – All of them pretty much. Yeah, all of them.
Kevin Hale [06:30] – 100%?
Jake Klamka [06:30] – Yeah. One went to Facebook, one went to Square, one went to LinkedIn, one went to Twitter. I mean at the time these were like, they still are the top data teams but I mean it was a clear success. It was super stressful, I didn’t have the model, I hadn’t figured it out. It was crazy.
Kevin Hale [06:42] – What mistakes did you make? Was that first class kind of a shitshow?
Jake Klamka [06:48] – Of course, of course.
Kevin Hale [06:49] – What was the problem?
Jake Klamka [06:49] – In the sense that, in the sense that it was the first time I was doing it and a lot of it, a career transition is always stressful. Whenever people are doing Insight, they’re stressed but at least there’s a track record there and now we have things pegged pretty well. At that time, the overall idea was there but a lot of the details weren’t there, right? Frankly, the track record wasn’t there so a lot of these folks are like, what have I done? I’m in a room with this guy who’s never done this before, so there’s a lot of stress just around is this even going to work, this weird model? We made it work, we went ahead and got that done.
Kevin Hale [07:25] – What did those eight students believe, right? Were they desperate or were you great at sales?
Jake Klamka [07:33] – No, I think they were genuinely excited. Part of the application process– I got way more applications than I expected when I started it. There was a real demand to get into the field. I didn’t have a track record but I basically went around to these universities and said, I’m going to have the head of data science from Facebook, LinkedIn, Twitter, all these companies coming in and you’re going to meet them.
Kevin Hale [07:54] – The roster made them feel a lot more comfortable.
Jake Klamka [07:55] – That’s right. My interview process really centered around how excited you are about this. The folks who were like, I really don’t want to do this but I need a Plan B, no thank you. It was the people who said to me, I love my work as a scientist but I really want to have more of an applied impact in the world, I’m excited about what I’m seeing here, here’s what I think I can do. That’s the kind of folks that I would take into the program.
Kevin Hale [08:22] – Totally makes sense, starting off with qualifying the lead. It’s such a more common technique you’re seeing a lot of start-ups do now, like Superhuman for example. Heavily qualifying the lead before they’ll even let them access to the product so that way you’re trying to guarantee that by the time I do spend will be with someone that’s like going to have a spectacular experience.
Jake Klamka [08:40] – That’s why these hiring managers wanted to come in
Craig Cannon [08:41] – How did you figure out with, a small group–
Kevin Hale [08:43] – what students were going to be the most excited about this. What do you ask them?
Jake Klamka [08:47] – I had some opinions, but really what I did is I went to these early heads of data science teams and said, “What do you look for?” What they said is, they all said some technical skills but, it’s kind of like… “Oh, they need to know SQL, they need to know Python, they need to know, you know.” I’m like, “Okay, but what really, like would like clinch it for you, like you want this person.” There was two things always, there was like, “They have a side project?” and their eyes would light up, they’d go, “Oh, if they had a side project and if they send me a URL, then I know they’re excited, then I know they’re…” That’s where the idea came around for this isn’t about these folks have been through enough classes. It’s about actually building a product, actually creating something and proving that I’ve got all this great background but now I’m going to do this last piece of the puzzle to show you I can do something relevant in this area. The second thing that they wanted and I think this is where the project really shows this, but they wanted overall is just curiosity. Folks and I thought that they weren’t being serious to be honest with you, because I was like, “Yeah, you say you want curiosity but really you just want somebody who’s good at SQL or something, right?”
Kevin Hale [09:53] – Or good at like machine learning. It proved to be true. The people they would hire would be the ones who were the people who, “Hey, I studied astrophysics but in my spare time I was like, dabbling with genomics and then I got into machine learning on the side and then I built this cool, for fun project that like, I don’t know, predicts like where I should go… camping or something because I’m a big camper…” Then you take a person like that and that’s the kind of folks that these teams wanted
Jake Klamka [10:21] – and still want because these problems are so open-ended.
Craig Cannon [10:23] – Well, it’s like, curious Curious people don’t get blocked as much.
Jake Klamka [10:27] – Exactly.
Craig Cannon [10:27] – They’re willing to try…
Jake Klamka [10:28] – It’s such a new field. The roles our fellows are getting hired into, most of the companies, it’s not like, we know what we need you to do, just do it. It’s what can we even do here, right, what kind of impact can we have on data, what problem can we solve?
Kevin Hale [10:40] – Again, so how did you ask, how did you test for that? The project thing’s like, “Okay that’s something we have to shoot for,” but again, how did you know that these were the right eight people?
Jake Klamka [10:49] – A lot of it was trial and error. I would do 12 plus interviews a day and kind of get to know folks and kind of get to know it but I think the main thing, the signal that I saw was– It’s kind of that example I gave is that almost, people would be almost apologetic. They’d be like, “Listen, what I’m about to tell you is not part of my usual work but it’s on the side,” and it’s like, “No, no, I want to hear about that.” I remember I had this, one of the fellows, came, well she became a fellow, but she was an early session. She was a mathematician at Berkeley and she had done all this incredible analysis, I can’t quite remember on what, but this really cool data analysis project on I think like on, maybe flight times or something in sports, I can’t quite remember. Partway through the interview, I’m like, “But you’re a mathematician… Don’t you do pencil and paper math?” And because I had done some math, she’s like, “Oh yeah,” I can’t remember what the field was and she’s like, “Oh yeah, this is not even part of my…” and she almost felt kind of apologetic about it. I was like, “This is who I want to fellow.” Brilliant mathematician doing incredible work and able to, on the side, on the weekend, quickly pick up Python, this, that, the other, make something useful. She went on, she continues to work at Facebook, she went to Facebook after the program and super successful ever since so. It’s people like that, that you’re like I want you.
Craig Cannon [12:09] – This is related to one of these over-arching questions we had for you. Basically, it’s, how can people get into data science and then what are the pitfalls for people who say have a PhD. They know Python, they’re at a higher level than a coding bootcamp person. What are the pitfalls they make when they’re trying to bridge that gap and get into a data science role, provided that they didn’t do your program.
Jake Klamka [12:31] – Absolutely, and we see it because we started with scientists and now we also have programs for engineers who are transitioning to machine learning engineering and deep learning research and you sort of see very similar problems on both sides which is, folks are extremely focused on the sort of technical, let me get the algorithmic knowledge down, let me know every last algorithm. Which of course you need and you need those foundations but when you’re already dealing with someone who has been doing a bunch of work for years and is a PhD or in engineering in these areas, what you actually want to see and what these teams want to see is communication ability, it’s ability to understand the underlying like business and product problem, because what they want to do is hire someone who’s going to first think about what are we trying to accomplish here, how can we help our users, how can we help our company succeed? And then figure out how do I use my tool set of machine learning or analysis to do it. What often happens and this is the pitfall, is part of why you get into it is because you’re excited about it. You’re excited about the machine learning and so you start always putting that first and you’re always like, “Let me tell you the algorithm so I can build,” and it’s like, what folks who are trying to transition into it need to start thinking about product, need to start thinking about business, you need to ask
Kevin Hale [13:47] – Like, the skills there are
Jake Klamka [13:48] – what are they actually trying to achieve?
Kevin Hale [13:50] – Making them a better salesperson. What’s interesting about the advice that we give to a lot of people about sales is it’s not about selling your own thing, it’s about understanding their problem
Jake Klamka [13:59] – Oh, I completely agree.
Kevin Hale [14:01] – and then fitting whatever you have to them and so it seems like for the data science, the same thing needs to happen,
Jake Klamka [14:06] – That’s exactly right.
Kevin Hale [14:06] – Is not to say, here’s all the things I have,
Jake Klamka [14:07] – Right.
Kevin Hale [14:08] – They just like try to figure out what it is that you fit into for them.
Jake Klamka [14:12] – Exactly right. Understand the underlying– Forget data, forget machine learning algorithms, what are we trying to accomplish here? What’s our mission, what are we trying to do for our users and then–
Kevin Hale [14:21] – Making yourself look like the solution, not trying to be like, “Oh I have a bunch of stuff, which one of these things are you interested in?”
Jake Klamka [14:30] – Exactly. I have a hammer and a screwdriver. I can use all of them, it’s like “What are we trying to build?”
Craig Cannon [14:33] – Sometimes that’s actually a separate role, so for instance Facebook might list a data science job whereas some smaller start-up would say we have an engineering role open. You might classify yourself as a data scientist so if you have to pitch data science to a start-up, how do you do that as an engineer?
Jake Klamka [14:53] – Now this is a great question. First of all, data science, machine learning, these are all like super broad umbrella terms. It’s such a new field.
Craig Cannon [14:59] – Maybe you should define data science.
Jake Klamka [15:00] – Maybe what I’ll do is define data science. I think this is essentially answering your question. What we see in the industry kind of broadly speaking, broad terms, let’s not worry about the details. I see kind of three big pieces of how sort of data science is used. Some data science roles are what I kind of call product analytics or business analytics roles. The idea there is you’re looking for a better understanding, you’re analyzing data about users or company and trying to understand how to improve it. Help users succeed, help the business succeed. The second types of roles that we see are data product roles. These are roles where you’re actually using machine learning and predictive models to actually change the user experience and give them something they want right there and then as part of the product. The third one is kind of, usually what you hear termed as like AI which is AI roles, machine learning engineering roles, where it’s not just the feature in the product, like that’s the prediction, it’s like the product is machine learning, right? It’s like a self-driving car. If the machine learning doesn’t work, the whole product doesn’t work. You’ll have an example of a lot of things, – Is there not a category–
Kevin Hale [16:15] – Is there not a category–
Jake Klamka [16:16] – the key is to understanding
Kevin Hale [16:16] – Is there
Jake Klamka [16:18] – which one.
Kevin Hale [16:18] – Not a category in-between where it’s like, oh, machine learning supplemental feature or augments?
Jake Klamka [16:23] – Usually that’s where folks talk about data products. When they talk about data products, it’s often a feature so like the Netflix recommendation engine. That’s a situation where honestly, if they didn’t have machine learning, they could still just say here are the top movies, go watch them. But with that predictive model, you’re really getting a much better experience. We have probably 30 plus fellows working at Netflix, a lot of them work on that stuff but some of them work on analytics which is, how are people even using this product? What can we at a more product level do to improve it and there the output isn’t a feature that the user sees, like an actual algorithm serving recommendations. There it’s like they have to go and communicate with the product team to say, “Hey, users seem to want us to be building this sort of product for them. Let’s over the next six to 12 months take the product in that direction.” It’s a very different role.
Kevin Hale [17:12] – Here’s an interesting question. I know what the dream scenario for a lot of data scientists are, I want to get a job working on these interesting problems. What should they look out for that they should avoid in a company? What is that company who says– Because I think everyone’s kind of thinking, or more than should is like, looking for a data scientist and what should a data scientist be worried about as like, “Oh, they’re not ready to actually hire me and if I go here, this will be a bad experience.” See, remember how I said the data scientist needs to know what the actual problem is? The company needs to know what the actual problem is. The companies you need to be wary about are the ones where it’s like, “Hey, you know what, just like, I want deep learning.” And it’s like what, what? What does that mean, what do you want us to do here and why do you need it? And the company you want to go to is the one that got a mission you align with, you want to see them succeed, you want to have whatever solution
Jake Klamka [18:03] – they’re bringing to the market you know, thrive in the world and then they have a clear sense of if we add some data analysis to this, if we add machine learning to this, it’s going to be better. Then you can help them get there.
Craig Cannon [18:14] – Someone from Twitter asked this. Chuck Grimmett asked, when do you know you need to bring in seasoned data scientists? Is there any kind of benchmark you can offer?
Jake Klamka [18:25] – First of all, I think you have to start, as a founder, start with the idea and you can do this, I recommend this before you have a data scientist, understand is data sort of critical to building my product or is it something that I’ll just add on once it’s already working and I need to kind of optimize the experience. An example for sort of something critical is like Amazon Alexa, right? If you’re building Alexa, those algorithms, voice-recognition algorithms better work from day one versus a scenario where like say, you’re on the analytics team at Airbnb and you already have a lot of users and you’re just trying to optimize that experience. For a start-up, figure that out first and then if you need one from day one, hire one from day one. If you get a machine-learning engineer in the door who really, that’s their forte, you’re going to be better set up for success instead of trying to sort of kind of hack it and then have to kind of catch up later. Because often you don’t know what you don’t know and you might not be tracking the right data or you’re not setting things up, your infrastructure, in a way that’s going to help you scale later. Especially in products where machine learning’s critical, that becomes challenging. One thing I recommend to start-ups actually is just talk to folks in the industry and I mean frankly, get an advisor, right? If you’re not ready to hire a data scientist yet,
Jake Klamka [19:51] – at least maybe think about getting a data science advisor because they’re going to be able to sit down–
Kevin Hale [19:54] – Where do you find those?
Jake Klamka [19:56] – Yeah, good question. So–
Kevin Hale [19:57] – I’m trying to understand–
Craig Cannon [19:58] – Craiglist.
Kevin Hale [19:58] – Who gives that information for free?
Jake Klamka [20:00] – Yeah, email me. Yeah. No, I mean you’d be surprised, a lot of, I mean maybe some of the top folks who started the data science team at LinkedIn, you know that’s hard to get as advisor but I think even any sort of data scientists who’ve been in the field who knows what they’re doing will be able to sit with a founder and say, “Listen, you’re probably going to want to instrument these features to collect this data because you’re going to want to analyze this later.” Or here’s the type of work you want done probably down the road so.
Kevin Hale [20:28] – Like you want someone to help you understand how to lay the groundwork to actually do that hire. You guys started off with eight students in that first class. Can you talk about where it is right now, how many students are you processing now and then also like what is different about the curriculum and program?
Jake Klamka [20:44] – It’s definitely scaled up a bit since then. We’re now in five cities so San Francisco, New York, Boston, Seattle, Toronto. My hometown just launched it this year which is fun. We’ve got a bunch of different specializations now. So data science is one, data engineering, health data, AI, we’re even sort of doing product management now, helping product managers transition to AI. Overall we’re– We do three sessions a year
Kevin Hale [21:09] – Oh, it’s fascinating. It’s like almost like you have different classes depending on where you’re starting at.
Jake Klamka [21:14] – On the specializations, yeah. Because the field’s specialized. It used to be like you just hire a data scientist who you hope will take care of everything and now you want folks who are building infrastructure, the data engineers. You want the data scientists who are sort of building the early prototypes and figuring out what to build. More often than not now you need machine-learning engineers to really kind of put that into production now. You see these different specializations and we essentially have a program for each. The data science program’s for PhDs ’cause that sort of scientific experience is critical. The AI program for instance is for PhDs but predominately for engineers actually, who are going to machine-learning engineering roles.
Kevin Hale [21:53] – Then how big are these classes?
Jake Klamka [21:56] – Overall across all the cities and programs, we’re at about 300, just over 300 fellows per session now. But each program is small. We keep it sort of maximum 20-30, 35 fellows and because the idea is–
Kevin Hale [22:07] – For each one of those subprograms?
Jake Klamka [22:09] – That’s right. Each program, each location because you want that, the collaboration is critical. You want that group to sort of gel. Everybody’s working on a project, you want people kind of tapping each other on the shoulder, asking for help. You want that alumni who’s coming in to be able to kind of sit with the fellows.
Kevin Hale [22:24] – How long is the class?
Jake Klamka [22:26] – And the small groups are really critical for that.
Kevin Hale [22:28] – How long’s the class?
Jake Klamka [22:29] – Seven weeks.
Kevin Hale [22:30] – And then–
Jake Klamka [22:30] – Super fast.
Kevin Hale [22:31] – What gets done in seven weeks?
Jake Klamka [22:32] – It’s pretty incredible how fast people learn what in what they build. Literally, they’ll go from in week one trying to come up with the idea, we’re partnering with a start-up so often fellows work with start-ups. We have a partnership with YC.
Kevin Hale [22:47] – Right from the get-go they start with a project?
Jake Klamka [22:49] – Well, week one’s figure out which project.
Kevin Hale [22:51] – Gotcha.
Jake Klamka [22:51] – Your first week is like should I come up with something on my own and build it based on advice I’m getting from our alumni, from our mentors, our team? Or should I go partner with a YC start-up who’s got a data challenge that they want solved? And so that’s step one is figure out what you’re building, figure out what, and again, figure out a problem you’re building. In the next couple of weeks, you’d better build it fast. Folks have to go from literally nothing to like an MVP in a week or two. And then they’re out presenting those projects in a few week’s time.
Kevin Hale [23:18] – Are they working in teams?
Jake Klamka [23:20] – They’re working individually because they’re trying to show that they’re able to kind of execute and end on a real world problem but it’s incredibly collaborative. So if you come to Insight, it’s like– It doesn’t look like a classroom, it looks kind of like a start-up office and everybody’s just kind of at desks, sitting together and people are on whiteboards, they’re talking to each other, helping each other. Because you know, you encounter the same problems. Technical, otherwise, and it’s that collaborative aspect that allows people to move super fast and learn a ton.
Craig Cannon [23:48] – If you’re in the program or you’re just checking out the program, maybe applying for jobs like this, what are the types of projects that you recommend avoiding? Things that people have seen 100 times before.
Jake Klamka [23:59] – I recommend like, are people happy on Twitter is like, that’s maybe done. That’s a bad example, I’ll give a general example because there’s people been doing like this– The more kind of useful example is, make something useful, right? It’s really easy to just be like, I took this algorithm that used to operate at 99.1% accuracy and now I’m going to make it 92.3%, you know. I don’t know why but like, it’s better now right? Or what you see scientists sometimes do is this very generic like, I studied you know– Here, I’ll give you an example of a project I loved that a fellow came up– Here’s the bad version, here’s the version someone did at Insight but you can do this at home. The bad version, so let’s say the topic is solar panels. You want to understand solar panel usage and really enable people to adopt solar panels. Bad project is I analyze general trends about solar panel usage in California. It’s like look at this interesting fact I found about, it’s like okay, whatever, right? Maybe for an analyst report that’s interesting but not for actually getting anything done.
Kevin Hale [25:07] – To me it’s like, it has no call to action.
Jake Klamka [25:10] – Exactly.
Kevin Hale [25:11] – You want it to be almost opinionated because that way a business knows, it’s like, “Oh, I can look at this now and know what to do.”
Jake Klamka [25:15] – That’s exactly right.
Kevin Hale [25:15] – I think the bad projects are the ones it feels like, “Oh, now I have homework.”
Jake Klamka [25:19] – That’s right. Oh, here’s some–
Kevin Hale [25:20] – It’s actually the problem I have with a lot of like analytics are, like all you do is like just tell me that I don’t know anything, but now I still don’t know what to do. I’ve paid to be told that I have to figure stuff out or that feels dumb.
Jake Klamka [25:31] – Exactly. The good version of this project, which is, a fellow did and is one of my favorites is, I’m a homeowner, should I buy solar or not? Will solar be profitable on a roof? Okay, that’s a hard problem. What’s the weather like a ton of different factors plus there’s some predictive aspects of all that. The fellow took all this data, synthesized it, built a predictive model. I come in, I type in my address, it tells me whether I should buy solar or not.
Kevin Hale [26:00] – Oh, they basically built a product.
Craig Cannon [26:01] – Yep.
Jake Klamka [26:03] – All these projects are very product-focused. They’re so product-focused that sometimes companies are like, “Why are you showing us products when like we just want data scientist?” The answer is because that demonstrates that people can think product-wise and they end up loving it because they sort of abstractly don’t understand why they’re showing us products but people gravitate to real solutions and then, they hire the fellows.
Craig Cannon [26:26] – This is related to something we talked about the other day which is like. In the future, are more data scientists going to become founders or is that personality, that mentality, is that best suited within a big company?
Jake Klamka [26:39] – Absolutely.
Kevin Hale [26:39] – Oh really? This is not going to be a case like designers who’ve like for some reason, designers don’t tend to become founders.
Jake Klamka [26:46] – We’ll see how it shapes up in terms of like, is it going to be en masse data scientists but certainly I would say probably about a quarter of every fellows program I see like, raise their hand when they say they want to start a company in the next five years.
Kevin Hale [27:02] – Shit, I got to get my ass out there.
Jake Klamka [27:05] – I think that’s going to be a big thing. We’ve already seen some of our alumni start companies, although again, it’s early and in the early days we had very few fellows, so. But Diana Wu who was in one of the early sessions, started Trace Genomics, a genomics company which uses genomic data to tell farmers when to plant, when not to plant. Super interesting. Not an alumni but like an early mentor, Ben Kamens used to be the kind of founding engineer at Khan Academy and he hired one of our fellows, Lauren who was a physicist. She went there, helped them sort of, you know, hopefully impact a bunch of kids’ lives by like helping them learn faster because they really have millions of data points of data on how people learn and she was there for a few years with Ben helping with education and now Ben went off. I mention Ben ’cause he’s very much kind of a data scientist at heart although his title’s officially CTO, he went off and founded Spring Discovery. Now they’re doing sort of helping aging-related diseases using machine learning to do that. Lauren went over there with him, sort of part of the founding team. TBD in terms of like what the stats are going to be in terms of founders but that founders spirit is there and the skill set is so useful.
Craig Cannon [28:26] – That’s the thing, like– Regardless having an understanding of product is like the pickle so you–
Jake Klamka [28:30] – Absolutely, because whether you’re employee, whether you’re founder or employee 10 or 100, or frankly 1,000, you better know–
Craig Cannon [28:38] – Do you teach that as well?
Jake Klamka [28:39] – Oh yeah. That’s one of the biggest things.
Craig Cannon [28:41] – How do you teach it?
Jake Klamka [28:43] – I found the only way you can teach is by doing. You say build a product and then they don’t, they give you a graph. That shows you interesting things and you say, “No, no, like no,” and you iterate, you just iterate. That’s the learning experience, you do it wrong and then you iterate and you fix it and get better and so the model at Insight is really just continual feedback. If at the end of the program I tell you like no that’s wrong, then that’s a bad learning experience but at Insight, you’ll be told like half a day in that that’s like not the way to go and by the like next half day, you’ll be closer there and by the first week you’ll hopefully be on a good path to building a cool product since that fast iteration feedback.
Craig Cannon [29:25] – Cool.
Jake Klamka [29:26] – One of the things that ends up being a problem for a lot of start-ups or for even people getting into the data science field is like, they’re accounting very dirty data. Now a lot of time actually is like, this is not like oh I’m solving cool problems and making products, it’s like, “Oh, I’m just sitting here cleaning up. Just so I can get to this point.” What I’m trying to figure out is like, is this something that data scientists need to be aware of, that you’re just going to walk into this or something like start-ups need to start thinking about and like, what can they do to prevent that? Both, but I think you can never avoid it so it really is the data scientist’s job to be prepared for that, to do well at that. And that’s–
Kevin Hale [30:03] – What is that? What’s the ratio of the job, of like cleaning versus like–
Jake Klamka [30:06] – Oh my god, I mean there’s this joke that 90% of the job is data cleaning. I don’t know if it’s 90 but it’s a lot and it’s not just data cleaning, because data cleaning sounds kind of lame, like you’re just kind of cleaning things up, it’s I think more interesting than that. It’s like literally like, what data even makes sense to get here?
Craig Cannon [30:21] – Right.
Jake Klamka [30:22] – It’s not obvious, in advance you think it’s obvious. You’re like oh, just throw some data in. What data? Of what? And how can you combine that data and what does it mean to have clean, relevant data and that’s a skill set–
Kevin Hale [30:31] – Do you have an example?
Jake Klamka [30:36] – I have an example around the founder’s side, right? I think founders often make the sort of assumption that they’re tracking all the right things and then we’ve had many experiences where, you know we’ll talk to a founder a fellow’s going to work with, like a founder and they’ll say, “Yeah, we’ve got all the data, we’ve got everything that’s posted. Big data, big data, has the best data.” And then you know, you open it up and it’s like, “Oh shit, they didn’t track user log-ins.” Which user was logged in? They’re tracking like all the movements on the site but not which movement, which user was doing that and then what timestamp. Again, it’s like oh my god, like all this data is borderline unusable because we can’t kind of peg it to specific behavior and model that behavior. When you’re looking at it from the data perspective, it sounds hilarious like, why didn’t you track users? But you know what, I’m a founder, I know– When you’re a founder, you’re thinking about a million different things, you have a million different trade offs and honestly, like yeah, the logging’s turned on, like let’s go, right? Let’s build let’s– And then a year later, you’re regretting that. Again, I think a lesson learned for sure. That’s why it’s like, hey have a coffee with a data scientist like, maybe all you’ll get from it is like, log your user log-ins but that might be enough and then a year later, you can get started
Jake Klamka [31:56] – with a data scientist.
Kevin Hale [31:57] – What’s the best tools that people should do for tracking data or is there a product that startups should use just right out of the gate that you know that if they do this, they’re just going to start on a good foot.
Jake Klamka [32:06] – Honestly I saw some of the questions on Twitter and you know folks always ask about tools so I was actually asking around with some of my team like hey, like what’s the latest on this and there are great tools I think for just sort of like basic analytic tracking of websites but if you’re really building products, like it’s still to this day we see the teams roll around because there’s so much, there’s so much–
Kevin Hale [32:32] – Such a disappointing answer.
Jake Klamka [32:33] – It is a disappointing answer. Listen there are companies working on it, some YC companies and they’re slowly progressing up to more sophisticated sort of data products. But at the end of the day, if your lifeblood is a very specific product that does something very specific, nothing beats just having somebody very thoughtfully say, what do we actually care about tracking here? How do we track it?
Craig Cannon [32:55] – Stepping back then, assuming there is no easy answer then, you’re a founder, you just started your thing. Can you give me five or 10 things that I should be tracking?
Jake Klamka [33:07] – Well, I mean it really depends on the company, right?
Craig Cannon [33:09] – Sure, okay fine.
Jake Klamka [33:09] – I think the number one thing you have to think about as a founder is actuallynot even what you’re tracking because honestly if you think about this first thing, right, I think that’ll become more obvious. The first thing you got to think about and think about it right is what are you actually trying to optimize? What’s the one or two metrics you actually care about? If you’re thinking about machine learning and building predictive models, like, say you had a magic machine-learning model that like did whatever you want. But you only had one or two. Which problem in your company would you apply it to? Because what I see folks do is, oh I know my business in and out so I know my metric is this, this, this, this, this and this and then oh, machine learning will build this, this, this, this, this and this and you know what, you might at some point down the road. But initially, you’re going to have to focus and if you don’t have that focus, that’s where you get into this habit of I’ll just track everything or nothing whereas if you know what you’re trying to optimize is–
Craig Cannon [34:07] – Let’s say I’m Netflix. What am I going to start tracking?
Jake Klamka [34:11] – You obviously want to see how long people are watching the video, how far they get in that video. One of the teams there less obvious is people are using different devices on different bandwidths so they track, I mean they test this stuff and track it on all sorts of different machines. Again, in a generic tool, would you have a situation where you’re testing like a stream on 100 different devices? No, you wouldn’t because like if that’s not core to your business, why would you ever do that?
Craig Cannon [34:42] – Right.
Jake Klamka [34:42] – But if you’re Netflix, you better be doing that because you know that user experience is the key, right? Khan Academy had something different, right? For Khan Academy it’s maybe it’s the amount of time kids are spending on a question and that’s telling you something about whether they’re learning where on another site, it’s like you don’t really care about the timing, you just care about the flow.
Craig Cannon [35:04] – I got you.
Kevin Hale [35:04] – Can I simplify that for us?
Jake Klamka [35:05] – So it’s understanding–
Kevin Hale [35:06] – For any start-up and most companies, it’s like always like, my goal is growth. And for us at YC, we’ve actually pretty much simplified it where it’s just like, for the most part your KPI that your company is actually interested in driving, it’s either going to be revenue and that’s like 99% of the company and for some like consumer, very difficult play is like, I’m going after engagement. Daily active users is the ideal. Sometimes it’s weekly active users, that’s just the nature of the product. And so to me, it’s just like okay, what drives those two things? I really just like, only two numbers. It’s like conversion and then like churn. And so I imagine like, most questions fall into those two categories like, what increases conversion for revenue and what reduces churn for revenue and the same thing for like engagement.
Jake Klamka [35:52] – I’ll speak directly to those because now you’re kind of zeroing in on certain types of companies and– So for churn we often have fellows build a churn prediction model for start-ups. Again, they’re customized because churn for what, what happening but when we’re talking about churn, it’s a customer deciding to stop using the product and if we can predict that ahead of time, then they’re able to intervene, maybe offer a discount, maybe engage that user, get feedback so those are top of the list. And for conversion, experimentation is the key. It’s like use experimentation for you markets.
Kevin Hale [36:23] – I always feel like a lot of times start-ups and especially early ones, they neglect that whole churn question. Because I always tell them is like, look, you’re obsessed about conversion because you’re in sales mode and trying to bring them in, but I always feel like it’s very expensive and I feel like improving churn by the same percentage is exactly the same thing as conversion but it’s way easier, cheaper, et cetera, et cetera. Is that usually what the first projects that start-ups and companies should be looking at if they haven’t at all?
Jake Klamka [36:52] – Absolutely agree. And you know, one thing I’ll add about churn is it’s often more reflective of what is actually working or not working, right? It’s like, make something people want. It’s like, if you improve churn, that means you’re truly understanding what the user wants. Maybe you can get them to sign up or convert just by sort of, having a flashy sales pitch but churn, really you understand it. And then that’s where the exploratory data analysis comes in. Do you really understand what your user’s doing? That’s where the A/B testing and often what’s called like multi-armed bandit testing where you’re trying various different experiments at once, that’s where you’re predicting churn and then trying to intervene to help the customer. But you see what I’m saying, it’s like a number of different things all of which are grounded in, do I understand what my user wants and am I building to what they really care about?
Kevin Hale [37:37] – I think the other big trend that you’re having people sort of obsessed with metrics-wise is like, cohorts and like retention curbs over time. What are usually like the best things people should do? Like, yes just understanding and knowing it, like that’s sometimes that’s really difficult but in terms of improving that, like where does data science usually help?
Jake Klamka [37:57] – Right, I mean I think it’s coming back to churn, right? Because if you’re seeing folks drop off at month three and like your early cohorts I mean, that’s a churn problem right there. So yeah, I think it goes back to churn. A lot of those sort of dashboards are, you know there are great tools for those. Certainly, when I started, people would like hang code like no hard analyses. Now there are a bunch of tools for that. So I’m not saying that, certainly I think in the metrics sort of dashboard domain there’s a lot of solutions. When I was saying that there isn’t really a ready made solution it’s more that stuff that’s kind of, where you’re actually building models to improve the product in a very sort of deep way.
Kevin Hale [38:33] – Do you guys have a favorite for like– Because you said like good start-ups have good problems. Are you waiting for a sponsorship, I’m trying to understand.
Jake Klamka [38:40] – What do you mean?
Kevin Hale [38:41] – For like some tool that’s paying you money to say what’s great
Jake Klamka [38:43] – No so honestly, at Insight, almost everybody just uses open-source, right? So everybody’s building Python, you know. Yeah, it’s all the open-source and because that’s actually what we’re seeing reflected in the industry. So if you go to a top data science team by far and away the vast majority of what they’re using and building on is Opensource.
Kevin Hale [39:02] – What are those projects?
Jake Klamka [39:05] – Like, I think Python is definitely– It used to be like Python–
Kevin Hale [39:07] – Oh, they’re still building it themselves.
Jake Klamka [39:08] – Yeah, yeah, absolutely.
Craig Cannon [39:11] – Right, and then they just use like what, Jupyter Notebooks, stuff like that?
Jake Klamka [39:15] – Yeah, for prototyping. And then you got to then start building–
Craig Cannon [39:18] – Then you roll your own.
Jake Klamka [39:19] – And then you roll your own and frankly at that point, as soon as you get away– As soon as you get past the prototyping stage, you’re really just building product, right? It’s the same thing an engineering team does at start-up, right? It’s like what tools are they using to build the fundamental product? And that’s where you’re living, those data scientists are often embedded with the team build directly.
Kevin Hale [39:36] – Who makes the best data science? Like from what field have you noticed where like “Oh, it’s much better that they come from this field?” What’s kind of been shocking–
Craig Cannon [39:46] – I want to know who your favorite children are. Early days, I was accused of you know, I’m from physics so but now there’s, we have fellows from all the different backgrounds so they all succeed. NThat’s been the shocking thing is like, how different the backgrounds are. We have a fellow in this session who’s an archeology PhD, we had a fellow a session ago who was like an engineer at SpaceX, right? Like, we had
Kevin Hale [40:12] – Imagine each of them have
Jake Klamka [40:15] – so we have–
Kevin Hale [40:15] – Certain kinds of problems, Like you have a mathematician
Jake Klamka [40:18] – That’s exactly right.
Kevin Hale [40:18] – Going to get the math, understand the particle but selling themselves and understanding problems probably another challenge.
Jake Klamka [40:23] – Exactly, so often you’ll find like a mathematician is great for instance, make great data engineers because they think about large scale systems and how can they fail. In math it’s logic systems but then you kind of transfer that sort of mode of thinking to data infrastructure. But someone like, for instance, psychology was one that like that in the early days, I didn’t really have a network in kind of psychology or in neuroscience so we did a lot of work to try and kind of put the word out there. We found social scientists are incredible data scientists quite often because they really…
Kevin Hale [40:55] – They know how to ask the right questions.
Jake Klamka [40:56] – They know how to think about people and ultimately, you know, obviously data’s branching out but most of the time when you’re talking about users you’re talking about customers, it’s people, right? And so fantastic data scientists from those fields but it’s just– One of my favorite parts of my job actually is the fact that I’ll sit at lunch or happy hour or just hang out at the office and it’s like, an astrophysicist with a psychologist with a software engineer with a you know, electrical engineer and they’re all kind of working, collaborating, and it’s just incredible kind of environment to be around all these different people.
Kevin Hale [41:33] – You have all these companies coming in, talking with all your students during the program and they’re usually coming with a problem? Or are they just talking about here’s the kind of problems we work on and solve? Because they’re kind of doing a little bit of recruiting in addition to giving an understanding–
Jake Klamka [41:46] – Oh absolutely.
Kevin Hale [41:48] – Who’s been really great at that? Or what do they do that’s really great?
Jake Klamka [41:53] – There’s a bunch of teams. You know, listen, I think– The way the program works is fellows will often work with a start-up company on a project but most of the interactions that fellows have with companies is actually companies coming in to try and hire them, right? And when I say companies I mean like, the actual technical data team coming in, talking about what they work on and, but you know, and try to hire them. And so, the teams that do really well– Listen, obviously the ones with great brands, the Airbnbs, the Lyfts, the Ubers, the Facebooks–
Kevin Hale [42:20] – I want to know what the little guys have to even compete with them.
Jake Klamka [42:23] – But this is what I found is when start-ups come in, what often happens is fellows come in, what’s this start-up, I’ve never heard of that, why do I have to go to this and they come out and they’re like, this is my dream job, I want to work at this company. And I started trying to figure out what certain start-ups did to do that and what it really boils down to is impact. The start-ups that do well recruiting data scientists make the pitch, you are critical to our success. If data fails–
Kevin Hale [42:50] – Oh, they made them look like they’re going to be all-stars.
Jake Klamka [42:54] – And they’re telling the truth because a lot of companies these days frankly, if the machine learning or if the analytics doesn’t work, like the company will fail, like that’s what they’re pitching to you.
Craig Cannon [43:03] – Well also when there’s one of you versus 300 of you–
Jake Klamka [43:05] – Right. Well that’s the personality thing, right? Some people are excited about, I’m going to be the first data scientist and some people are like, I want some mentorship, I want–
Craig Cannon [43:14] – Yeah, I need a little motor on me.
Jake Klamka [43:15] – But when it comes to like, I’ve never heard of this company before and then an hour later like, oh my god I want to work for them, it’s always the impact piece. It’s always the if you come here, what you do will matter in a big way. And obviously there’s the technical piece, you’re going to work on cool stuff, but I thought the technical piece would be the biggest one but the biggest one actually is the impact, for sure.
Craig Cannon [43:37] – One thing we haven’t talked about and I actually don’t know if you have an opinion on this is contracting. For an average start-up, say they’re like a couple of years in and they’re like, I don’t know if we really have a need for this but we have all this data, maybe we could put it to use. Do you see people like doing two month contracts and getting a system up and then just letting it go? What happens?
Jake Klamka [43:57] – I think contracting is good for prototyping. So we see a lot of like, what I’m saying YC start-ups work with our fellows, that’s essentially– It’s a pro-bono consulting but they’re working with them for, you know, the program and helping to deliver some results. Where that works really well is, you know, often in an integrated but it’s at this sort of prototyping stage. Will this even work like– Or I’ve got a model, will this one work better if we try this? So let me give you an example of one I really like recently. A fellow worked with iSono Health, YC start-up, really amazing product like a sort of in-home breast cancer screening. It’s a device instead of going once a year to get screened for breast cancer if you’re in high risk, you can do it at home. Second-leading cause of cancer death in women so huge impact potentially, life-saving technology. And obviously a big part of that is, do we have the right algorithms to detect and notify a user that hey, you need to go speak to your doctor or notify your doctor. Obviously a doctor does the final thing but is there something abnormal here that we need to be taking a closer look at? They had algorithms that were working great and doing well for them, especially at that stage of,”Hey let’s just bring it to a doctor to be safe,” but they were curious about, “Hey, are some of theses more… deep learning algorithms that are just coming out in the papers, are they going to do better for us?” A fellow did that. They took the data and essentially used
Jake Klamka [45:19] – some brand new sort of convolutional neural network techniques that had just kind of been published and got better results from them that were almost on par with sort of expert radiologists. That’s awesome, right? Of course that team then has to do some more work to implement it but that’s an example where I think consulting works is like, is this going to work? Is this feasible, is this a prototype? Anytime you actually I think then, anytime that becomes a part of your product, you need a team.
Craig Cannon [45:49] – Right.
Jake Klamka [45:50] – Because it’s never static. Something’s going to evolve and change and you need to be able to evolve it. It’s just like asking like, can a startup just have like contracts off for engineers overseas? It’s like well, maybe to prototype something but in general, probably the answer’s no. Because that product’s going to keep evolving every month, every year and you need folks on the staff to do it.
Craig Cannon [46:12] – Makes sense to me.
Kevin Hale [46:13] – Great.
Craig Cannon [46:14] – Cool, so one last thing I wanted to talk about was just areas you’re excited about in particular? We mentioned health the other day but yeah. What’s exciting to you right now in the field?
Jake Klamka [46:26] – No, I mean there’s a bunch of stuff that’s exciting to me but health is the top one that I’m pumped about because it– the impact’s there, right? Like the example I just shared with you, I mean, early detection, disease monitoring, I mean you’re literally saving people’s lives if this stuff works and what’s interesting is people have been talking about the impact in data science, machine learning and health for years because you know, you start thinking about this stuff and pretty quickly you’re like, “Oh, this could make an impact.” Actually getting it to work is tough. And only I think in the last few years, we’ve been seeing a lot of teams actually making really amazing progress there. I’ll give you an example I love of like the impact here. Memorial Sloan Kettering Cancer Hospital in New York has hired out a team of data science and data engineers from us over the last few years. And what they do is they build essentially data products but that are used internally by their doctors. These are cancer doctors, really tough situations and they’re faced with a situation of what clinical trial do I recommend to my patient? And there’s thousands of clinical trials and there’s new ones coming online every day. Which one do you suggest? They’re building these kind of data products where the doctor gets, based on the specific, personalized, you know whether it’s genomic or clinical factors, “Hey, you should at least think about these new clinical trials that are coming online.”
Jake Klamka [47:48] – And again, the doctor makes the final decision but it’s, hey maybe one of those trials they hadn’t heard about now could save that patient’s life, right? And it’s really critical stuff.
Kevin Hale [48:00] – Fascinating, it’s a hospital that has this inside.
Jake Klamka [48:01] – Right and then soon thereafter, New York Presbyterian hired a fellow and then Mount Sinai hired a fellow and now pharma companies are hiring fellows. It’s really fascinating to see data broaden out when companies realize that they can just take it beyond oh, I want to optimize this like business and efficiency and really think, what can I create that’s going to have incredible value? Health is what I’m excited about, but there’s a ton more out there, yeah.
Craig Cannon [48:28] – That’s super cool. All right, well thanks for coming in.
Jake Klamka [48:29] – Thanks so much.
Kevin Hale [48:30] – Yeah, thanks Jake.
Craig Cannon [48:31] – All right, thanks for listening. As always, you can find the transcript and video at blog.ycombinator.com. If you have a second, it would be awesome to give us a rating and review wherever you find your podcasts. See you next time.