Written by Christopher Kelly
Dec. 8, 2016
[0:00:00]
Christopher: Hello and welcome to the Nourish Balance Thrive Podcast. My name is Christopher Kelly and today I'm joined by Dr. Pedro Domingos. Now, Pedro is not the normal sort of doctor that you're used to hearing on this podcast. Pedro is a professor of computer science at Washington University. He is the author or co-author of over 200 technical publications in machine learning and data mining. And he is the author of The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake our World. It's a phenomenal book and I've really, really been enjoying it. Pedro, thank you so much for joining me today.
Pedro: Thanks for having me.
Christopher: I'm sure that everybody listening will be familiar with how good the recommendation engines on websites like Amazon have gotten. And, of course, I was surprised and utterly delighted by the recommendation for your book.
Pedro: Yeah, this is perfect. I remember when I was -- When the book hadn't come out yet, I sent email to a friend of mine at Stanford saying, "Do you want a copy of my book?" And he said, "I already ordered it." And I said, "Well, how come?" He said, "Amazon recommended it to me." So, they recommended it before it even came out. Good job, Amazon.
Christopher: It's absolutely fantastic, the recommendation algorithm there, and elsewhere too. When I finished reading the book, I started reading it again. I've almost finished it for the second time. That's how good it is. So, I highly recommend to everybody, pause this podcast, go and order yourself a copy of The Master Algorithm. It really is that influential to me. So, Pedro, can you tell me a little bit about your background and your research interest?
Pedro: Sure. I'm a machine learning researcher. I've been doing it for 20 years. I've got my PhD in machine learning background. Nobody knew what it was. And I could tell people that one day machine learning will take over the world and now it's essentially starting to happen, which is both exciting and scary. I've done research in a lot of different machine learning areas including, for example, deep learning, which is using learning algorithms that mimic how the human brain works.
There's a whole bunch of different type of learning which, of course, I talk about in the book. I've worked in most of them. And, of course, I also teach machine learning, being a professor. I also work with people in industry on various applications. And then a few years ago, I decided to write this book because I felt that there was a huge need. I think we're at the point now where machine learning is something that affects everybody's lives enough in very deep ways that people aren't aware of it yet. So, they need to become aware of it. They need to have at least a high level understanding of what machine learning is and what it does so they can make the best of it and so that's why I wrote this book.
Christopher: Yeah, absolutely. And that's why I chose to get you on this or attempted to get you on this podcast. I was lucky that you gratefully accepted my invitation. I think that's right. I think that machine learning is now becoming a bit like electricity. It's everywhere. It's affecting everybody's lives. And you put forward a really brilliant analogy in the book which is maybe like a mechanic. You don't need to understand all of the inner workings of the car but, as an operator of a car, you sure as hell need to know which way the car is going to move when you turn the steering wheel. And so I think that everybody needs to understand this machine learning thing that's happening.
Pedro: Exactly. And machine learning does have -- You don't need to know how the engine works. That's for computer scientists to worry about. But machine learning does have a steering wheel and pedals. It's just that most people don't know even that they exist let alone how to use them right now.
Christopher: Right. And so at this point, I think maybe we should walk people through a day in their lives and explain to them how machine learning is affecting things that they do every day. Can you think of a way of doing that?
Pedro: The first and most obvious place where machine learning touches people's lives is online. Every e-commerce site [0:03:35] [Indiscernible] has a recommended system. We just mentioned Amazon is one. But Amazon use machine learning to recommend books or the products that you might want to buy. But it's not just Amazon. Netflix does it with movies and Spotify does it music.
Facebook uses machine learning to decide what updates and what news to show you. Twitter is basically a big machine learning algorithm deciding who's going to see what tweets. Search engines, Google, when you type in key words and get back pages, it's a machine learning algorithm that's deciding what to show you and so on. What people don't realize is that it's not just online anymore. It's even in your real lives. For example, your smartphone has machine learning algorithms to understand what you say, to correct your typos, to predict what you're going to do next and make recommendations, et cetera, et cetera.
Sensors can be used to predict what's going to happen with you. For example, whether you're going upstairs or not. One of these days, your smartphone is going to detect that you're about to have a heart attack and call 911 on your behalf. This is the kind of stuff that machine learning can do today. Another example that is very much from people's real lives is companies that use machine learning to select job applicants. There are people these days who have jobs because a machine learning algorithm actually picked them out as being the most promising for that job.
[0:04:58]
And even things like finding a spouse. Most marriages today start on the internet. Actually, not most. A third of our marriages in the United States start on the internet. And the match makers are machine learning algorithms at all these dating sites that basically pair people up based on their profiles and other information. So there are literally children alive today that wouldn't have been born if not for machine learning.
Christopher: And my daughter is one of them. I remember quite vividly standing on the platform at a train station and getting an email from OkCupid and inside the email, it said, "Oh, you really, really must check out this woman because she's absolutely perfect for you." And I read the profile and I thought, "Blimey, this thing is right. She really is perfect for me." And that woman is now my wife and we have a daughter. That was OkCupid. Yes, fantastic. It was so far ahead of the game at that point. I don't know whether it still is, obviously. And I know that there's other dating sites. I'm sure they're all using machine learning techniques but OkCupid was particularly good at that time.
Pedro: Yeah, they're one of the best. But what's interesting is that each of these sites has its own type of machine learning and its own approach to applying it to the problem and to each claim that they have. So, for example, for people that have a lot in common with you, because they think that that's the best approach. And other actually for people that are complementary to you. So, I think it will be interesting to see how this all evolves. And the truth is, there's still a long way to go. There's a lot of information that they could be using that they aren't and the learning algorithms could be better. So, it does need to get even better over time.
Christopher: And talk to me about the implications for science. So, one of the things I've been thinking about is traditional experiments done in the lab. Are they going to go away?
Pedro: They're not going to go away but they are going to be in many cases controlled by machine learning algorithms. There's actually another thing that people think is science fiction but it's already happening today. We have robot scientists that literally run the whole scientific process in an automated way. So, for example, at the University of Manchester, there's this robot called Eve that most biologists use, knows basic biology, DNA plotting and so on, uses machine learning to make hypothesis about whatever problem it's supposed to be studying. And then it literally carries out designs and carries out experiments that test the hypothesis using gene sequencers and microarrays and whatnot.
A couple of years ago, Eve discovered the new malaria drug that is now being tested. So, one way to think about machine learning is that it's really the scientific method except it's being done by computers. This thing of looking at data gendering hypothesis and refining them and so on, it's done by computers instead of by human scientists. As a result of which, it could be done on the scale and that is speed and discovering amounts of knowledge that would be totally beyond the reach of human scientists. So, machine learning is actually, you're right, it's having a huge impact on scientists and it's going to have even more because it's like giving a scientist an army of post docs that costs very little.
Christopher: Don't they already have those?
Pedro: Well, exactly, they have but somebody's army of post docs might be half a dozen post docs or maybe a dozen and then because of the bigger labs maybe 50. But post docs cost a lot of money. Whereas with something like Eve, you could actually have a million post docs costing you less than one human post doc does and discovering things that are correspondingly at greater pace.
Christopher: And then talk to me about the implications for cancer. Because that's something you talk about specifically in the book.
Pedro: Yeah. Cancer is a very, first of all, important but also very interesting example of a machine learning application. Machine learning algorithms are actually very good at medical diagnosis. For many diseases, you can actually get a machine learning algorithm to learn in literally seconds from a database of maybe a thousand patients to diagnose diabetes or cirrhosis or whatever better than human doctors, better than highly trained human doctors or radiology and so on and so forth.
But cancer is actually a much harder problem. Why haven't we cured cancer yet? Because cancer is not a single disease. Every patient's cancer is different. So, there's never going to be a single drug that's the cure for cancer. The cure for cancer is a machine learning system that learns to predict based on the patient's genome, their medical history, their mutations in the tumor, which drug to use that will kill the tumor cells and not harm the good ones.
Whether an existing drug or perhaps even a drug that is designed by machine learning, as is already separately happening. Of course, doing this requires a lot of deep knowledge of how cells work and metabolic networks, regulatory networks, et cetera, et cetera. But this is exactly the kind of thing that machine learning is good for. So, I actually think we are starting to make much more progress in fighting cancer than we were kind of 20 years ago and a lot of it comes down to machine learning. And I think we need to see more of it in the next decade.
[0:10:02]
Christopher: Is it possible for you to explain how a machine could acquire the knowledge required to do that?
Pedro: Yeah, sure. Well, in fact, the example that we were just talking about of Eve actually illustrates that. So, Eve, you start by telling it basic knowledge of molecular biology, how the whole DNA protein synthesis, et cetera, the central dogma as it's called in biology. And then you say, "Well, go study yeast." And then it starts to -- And then it gathers a bunch of data from sources, for example, from microarrays, gene sequencer.
So, with the microarray you can actually probe the cell with different conditions. You can inject something. You can raise the temperature. And you see what changes, which genes get more expressed, how the metabolism changes. And then based on those experiments, it refines its knowledge. So bit by bit it builds up an understanding of how the cell works and, therefore, how we would respond to a particular drug or another. So, at a high level, this is really not different from what a human medical researcher does. It's just done in an automated fashion.
Of course, the machine learning algorithms at this point are nowhere near as smart or capable as a human scientist. It's just that they can learn under much larger scale. And there are areas like, for example, basic physics where human beings with our brains and the small amount of data can figure it out. But something like cell biology, there's just so much of it that doing it with machine learning is really the only way it's going to work.
Christopher: And I watched a recent presentation by Robb Wolf at the Institute for Human and Machine Cognition. Of course, I'm sure, do a lot of machine learning for their robots and other projects. And in the presentation, Robb talked about how you can do a PubMed search for obesity or diabetes and just limit the number of results to the past five years and you'll still get hundreds of thousands of results. So, how could one human ever possibly interpret all that data, all that knowledge and make recommendations from it? Do you think that that could ever be, a machine learner could be capable of interpreting the scientific literature like that?
Pedro: There are projects existing today that do that. Again, they don't understand the articles as deeply as humans reading them would. They often understand them well enough to, number one, give you the relevant articles. Like there was this paper, I think, in Science or Nature some years ago that was showing that for a lot of scientific, for things in biology, for example, it costs more to find the results of the relevant experiment than to just do it yourself.
It's not even a question of interpreting the articles. It's a question of being able to even find them. Machine learning algorithms can actually do that pretty well because, again, they can read all of PubMed in a day. All the millions of articles that are on PubMed, machine learning algorithm can sift through. And then the better the machine learning the more it will be able to actually not just do keyword matching, which is, of course, what you can already do, but to actually have a deeper understanding of what the article is saying and whether it's relevant to you or not.
The other thing that you can do that in a way is even more exciting is that -- And again, there have already been examples of this. The machine learning algorithms can make connections between articles that people hadn't make. So, B causes disease A is not in one paper, C will stop B is not in another paper. No one has actually put these two things together. So, like, oh, here's a drug that you could take for this problem. But once a machine learning algorithm can read all those papers and can actually find those connections where it would just be beyond people. And there have been a few successful examples of this and there are going to be many more in the future.
Christopher: Yeah, so amazing. And I really wanted to talk about prevention. So, one of the things I've noticed with the existing machine learning applications, and maybe it's that I've just not seen enough of them, is that they're all looking for early detection or they're looking for a cure. So, I'll give you another example. I'm currently taking part in a class at the University of San Francisco that's being run by Jeremy Howard, who is an amazing guy. I should link to your Ted Talk and his actually, are both fantastic.
Jeremy started a company called Enlitic, which is finding tumors in x-rays of people's lungs. And it now does that better than a panel of the four world leading radiologists, which is absolutely incredible. But as someone that doesn't have cancer, what I really want is not an early detection system. I want a machine learner that can tell me what to do in order to not get cancer in the first place. Do you know if anybody is working on that type of problem?
Pedro: First of all, there's, of course, a lot of medical and health research along these lines and that research is being helped by machine learning just by the kinds of research are. So, that's one aspect of it. Another aspect which, of course, is becoming very big today is that anybody who has, for example, a smartphone. There's a lot of data about them and their health state and how their body is functioning that can be captured on a moment by moment basis. And so there's this whole area of -- So like monitoring your physical and even mental condition.
[0:15:03]
And then the machine learning algorithms can actually match that against databases to solve the same things happening saying like, "Oh, look, you are not doing enough of this. Or maybe you should try doing that. Or here's a red flag that you need to do something about." And again, this is still in its early days but, I think, we're going to move quite rapidly to a situation where in some sense you have a second immune system, which is a computational immune system.
It's like all this data gathering and all these machine learning algorithms that are constantly helping you fight off bugs and fight off things that are bad for your health. In fact, I remember seeing a talk by Craig Venter when he was saying that at the end of the day the goal is really for your immune system to be on the internet. He didn't put it exactly this way but it's like a new disease appears, the bacteria or the virus get sequenced, we find the vaccine or antibiotic or whatever, and then your immune system can automatically download it from the internet and start making it.
So, you'll be healthy because you will never catch these diseases because even without your knowledge your immune system has learned to fight them. There are many examples of this but, I think, this is something that can give a sense of just how extraordinary the transformation is that machine learning is bringing to predictive medicine, if you will.
Christopher: Yeah. And the sensors are just getting so much better as well. I was wearing a continuous blood glucose monitor for a while really, which is pretty non-invasive already. I'm sure it's only going to get better. And then someone recently [0:16:48] [Indiscernible] who's developing a food logging app generously sent me a continuous heart rate variability monitor. So, it's looking at the interbeat intervals of the heart, which is different from the heart rate, and you can glean so much information from that data. And now we're going to be able to get it continuously even when you're sleeping, which, I think is going to lead to all kinds of promising applications.
Pedro: And even small things like small changes in the tone of your skin can tell you how well your heart is functioning and what problems you might be having. And again, there are already algorithms that do this. These are things that are actually undetectable. There's another disadvantage that human doctors have compared to computers, is that the computers have these sensors that keep getting better and in many cases are much, much finer than anything anybody could do with the naked eye.
Christopher: I really wanted to ask you about some ideas that I might have with the data that we've already collected. We have blood chemistry and urinary organic acid and urinary hormones and stool culturomics data from about 800 athletes now. Almost anecdotally, maybe coincidentally is the right word, we see patterns in the data. And, of course, I know that if there was a machine that was trying to observe those patterns we'd find so much more and our knowledge would be advanced so much quicker.
So, can you think of any ways, any sort of algorithms or names of things that you can give me as pointers that I can look at that would maybe help me do some perhaps structure mapping, is what I'm looking for here. I don't want to bias you too much in your answer.
Pedro: Sure. Well, at this level, almost any kind of machine learning could be affected. It sounds like a very rich data set. I mean, in essence, what a machine learning algorithm does is it learns to predict some variables from others. And so part of what you want to ask yourself is what is it that I want to predict? Then you can see if that predictability is there. And usually, the way you do this, you have to be careful not to start hallucinating patterns because machine learning is very powerful. It's very easy for you to come up with all kinds of patterns in the data that just happen to be random noise.
So, what you want to do is you want to divide your data into two sets randomly. One, you're going to run the machine learning on and the second one is where you're going to test it to make sure that it didn't learn things that weren't just random variations of the data. And this is really the standard when that machine learning gets applied. It's very simple and yet very powerful because it means that you can do very intensive search for patterns but then still be fairly sure that you find something meaningful.
So, this type of learning where you're just predicting some variables given others is often called supervised learning because it's like having a teacher. You know what the right answer is. You're just trying to learn to produce it. There's also what's called unsupervised learning which you probably also want to do on this data which is literally just looking for patterns. Are there clusters, for example, of these athletes in terms of their metabolism or behavior or whatever?
Are there certain patterns that you can summarize in a way that people can understand? So, this is called unsupervised learning because, in this case, nobody is actually telling you what the right answer, what is it that you should predict. It's just generally noticing how things work. It's a little bit like what children do when they're playing. Small children, most of what they learn isn't from their parents teaching them. It's just by trying things and noticing things. And you can also do the type of learning here.
[0:20:05]
Christopher: Right. And can you name any types of unsupervised learning that I should look at?
Pedro: Well, there's two main ones. I mean, there's many, but the two main ones are clustering. So, clustering, as what I briefly mentioned there, is grouping people who have similar characteristics together. And another one is dimensionality reduction which is a bit of a mouthful. But dimensionality reduction just means the following. You might have -- I don't know how many variables you have describing each of these athletes but, let's say, it's a thousand or 10,000.
Nobody can look in a thousand or 10,000 variables and look at books and understand what's going on. There's just too many. So, dimensionality reduction is finding a way to reduce those variables to maybe just two or three that you can actually look at on the computer screen. So, these variables, it's not just selecting two or three variables. It's creating new variables that are derived from the original ones but are much, much more expressive.
So, for example, I suppose you have a bunch of pictures of faces and two of the dimensions that you might discover is are they smiling or are they frowning? So, there's little dimension where you're changing from the smile to neutral face to frown. Another dimension might just be, for example, which way you are looking? Are you looking at the camera or are you looking left or right and so on? So, remember, an image has a million pixels and this way you can just reduce it to a few meaningful things. And who knows what might be the case in your data set.
Christopher: Yeah, one of the things I'm really, really interested in is predicting the results of an expensive test using a cheap one. So, for example, we see lots of patterns with the sodium to potassium ratio and the amount of free cortisol. And there's a plausible physiological explanation for that because the sodium and the potassium are controlled by a part of the adrenal glands that also produces cortisol. So, first of all, is that pattern really real? Does it really happen in everyone? Is there a strong connection?
And if it was, then the sodium and potassium cost a few dollars to test. You can do that. Any blood chemistry has sodium and potassium on. Whereas the hormone test, it costs $400. You could greatly reduce the price of the testing that we do if this turns out to be true. And, of course, there's other examples as well. Like I always see an elevation of the percentage of eosinophils in white blood cell count, which again is a really inexpensive test. And that correlates really well with stuff that's going on in people's guts. So, overgrowth of yeast and other pathogens. That's kind of one of my goals, is to reduce the cost of the testing using machine learning techniques. But I'm not sure if it's possible.
Pedro: Well, this is a quintessential application of machine learning. Both because it's very well matched to what machine learning can do because you have these early signals which you're trying to learn from, but then you also have the real, at least for some people listening, in some cases, first you get the cheap test and the expensive test. You can use the expensive test to supervise the cheap test. And then you can check whether the predictions from the cheap test are accurate or not.
So, this is both a very natural match to supervised learning. And also, there are actually many important instances of this. A big one, of course, is just clinical trials. What makes drug development so expensive is actually the drugs that succeed at first but then once you go to full blown human trials then they fail. And those cost an enormous amount of money because, for example, the drug was good for that disease but it also has the side effect that is very harmful to people and that only become apparent later.
So, something that machine learning is already being used a lot for today is can I predict ahead of time that, for example, this drug is going to cause cancer, for example, or have some other complication? And even if all that does is cap down -- It won't necessarily catch all of them. But if it takes even a few drugs and causes, saves all the money that would be spent on testing those few drugs and then having them fail, it's already incredibly useful.
Christopher: I wanted to talk to you about how you think the best way is to get started as a data scientist? So, to give you a bit more background about me, I have an undergraduate degree in computer science and I've worked for several companies including Yahoo and Amazon and two hedge funds all of whom employed machine learning techniques. But, honestly, I didn't know anything about it. Somehow, it all evaded me until much more recently. And it was your book that really inspired me to start really looking at this very, very hard.
But I can see from watching other presentations online -- So, for example, I listened to this woman from London talk about how to become data scientist in a year or something, I think, it was. It was really, really interesting presentation in how she did it. And it was all kind of self taught stuff. And I realized that one of the main barriers to entry is the amount of mathematics that seems to be taught at the beginning of all of these machine learning courses. And in particular, you read -- There's a really good free books online. I'll link to stuff in the show notes if people want to find it. But there's a new book, a deep learning book that came out recently.
[0:25:01]
And it's the same thing. You've got to, page three -- You're all excited and then you get to page three and you hit this wall of linear mathematics. I'm done. I understand what's scalar is. I understand what an array is. I understand what a matrix is. Okay, get a tensor. And then my eyes glaze over and I'm not really listening after that. So, how do you think it should -- how do you become a data scientist?
Pedro: Well, first of all, I highly sympathize with you. And this is why I wrote the book. The text books on machine learning are just this incredibly heavy pile of mathematics which I just feel the pain of people trying to learn machine learning from that. But here's the secret, is that at the end of the day, the deep ideas in machine learning do not require any mathematics. So, you can actually understand machine learning quite well without having to learn all that math.
It's just that the people who do machine learning tend to love math and to them it's great fun and very intuitive. They don't realize that for most people that's actually not the case. There's really three different levels at which you can understand machine learning. There's the user. The user is something like the customer or the patient. At that level you really need to understand things at the level of the steering wheel and the pedals.
Then the other end of this, like the really deep level you could -- You understand the details of the algorithms. You could create new algorithms yourself and be a machine learning researcher and whatnot. And there you have no choice but to know some amount of mathematics. Yeah, some algorithms are very mathematical but others are less. So, even at that level, the deep mathematics is necessarily what's most important.
But then there's intermediate level which maybe is the most relevant one for you, which is the people who will use machine learning models and they understand what they're modeling and how you can apply the machine learning to it. They don't necessarily understand the mathematical details of how it happens inside the box. And, I think, part of my goal is actually to enable people like this to do the job of applying machine learning to whatever they're interested in without having to go into the gory details of the mathematics.
And luckily, there are an increasing number of resources that people can use for something like this. Another one is online courses. Andrew Ng has one. I have one. Yaser Abu-Mostafa has one. These courses are, funny enough, they don't go very deep into mathematics but they give enough of an understanding that you can start doing machine learning. And then there are these suites and toolkits that people can use and apply to the problem of their choice and say, "Well, let me try this kind of algorithm. Let me try that kind of algorithm."
And you can understand there are many people today who are practicing that as scientists, and they just understand the machine learning at that level. So you actually can do machine learning for whatever application you're interested in without those piles of math that you see in the textbooks.
Christopher: Right. Yeah, over the weekend I'd been developing an application using Theano. Honestly, I didn't write a single line of Theano. I used a library that sat on top of that called Keras. The application was sponsored by State Farm Insurance and they provided a data set that included a whole bunch, like thousands and thousands of images of distracted drivers, and then they give you different classes of what they were doing like texting with their right hand, texting with their left hand, are they messing with the radio, are they talking to the passenger?
And so your job is to create a machine learning algorithm that classifies these images and correctly identify it as a tester. And, I think, my algorithm was about 80% accurate. That got me 500th place on a machine learning competition site called Kaggle. That's really, really fun.
Pedro: Yeah. Kaggle is great. The amazing thing about Kaggle -- So, they run competitions for applications from all sorts of origins but one of the most interesting things about Kaggle is that the winners are typically neither experts in the problem -- So, for example, you could have a contest for drug design or you name it and the people are typically not biologists. They're also typically not machine learning experts. They're more machine learning enthusiasts. And they actually often end up winning these competitions.
Christopher: Right. Yeah, absolutely. I really don't know that much about machine learning yet still I was able to place 500th in this competition out of thousands. It's amazing what you can do without not understanding all of the details.
Pedro: Well, the beauty of machine learning is that on the one hand the data does the heavy lifting for you. You don't have to be an expert because the knowledge is going to come from the data. The beautiful thing about machine learning algorithm is that they are actually much, much simpler than the algorithms that they then learn replace. Basic machine learning algorithm is actually something that's as simple as it gets.
Like, for example, the newest [0:29:29] [Indiscernible] algorithm just involves, for example, if I'm diagnosing a new patient, I just find the most similar patient in my files and I predict the same diagnosis, which sounds very dumb but it's actually, if you give him enough data, it becomes extremely smart. I often joke to my students that if you want to be, if you're smart and hardworking you should be a computer developer and if you're lazy and dumb just do machine learning because the data will do the work for you.
Christopher: I think that's why I've spent my whole career programming database apps. You look at really sophisticated applications like the autopilot on an airplane or software that drives a car. I have no idea how you could try and explicitly program that with the if-then else type situation. So, yeah, the machine learning is what's making these things happen.
[0:30:12]
Pedro: Well, in fact, here's what's amazing. Things like self-driving cars are only possible with machine learning because we don't even know -- We know how to drive a car but we don't know how to explain to a computer how to drive a car. So, there are many things that it's not even a question of replacing a complex encoded program by machine learning. It's that machine learning is actually the only way we have of doing it.
Christopher: Yeah. And so I wanted to talk to you about is a robot going to take my job? The world of machine learning, as Jeremy Howard put it in his Ted Talk title, is both wonderful and terrifying. And maybe the robot taking my job is going to be one of the things that's terrifying. What types of jobs are going to go and which will stay?
Pedro: I think this is actually something that a lot of people are worried about and, I think, with some reason. But the first thing to realize is that, yes, some jobs will disappear but also a lot of new jobs will appear. You have to remember that in the 19th century, whatever, 98% whatever of people in America were farmers. And then farming was automated. That doesn't mean 98% of people are now unemployed. There are all these jobs that didn't exist before.
For example, there are millions of people today that make a living developing web apps or just apps for smartphones and those jobs didn't exist even ten years ago. Machine learning will actually create a lot of new jobs. There will be some jobs that disappear. I think those jobs are often a little bit different from what people think they're going to be. We used to think, and a lot of people still have this notion, that the easy jobs to automate are the blue collar jobs, the manual jobs, the less skilled jobs if you will.
And the hard one to automate are things like doctor or lawyer or tax accountants, et cetera, et cetera. Paradoxically, it's actually often the opposite. A job of a doctor is easy to automate than the job of a construction worker. And this process is an intuitive because, for us, the things that are required for a construction worker are sense and motor coordination, picking things up and not tripping over things and whatnot. Evolution spent 500 million years evolving just to do that. So, we're actually very, very good at. We're so good that we take it for granted.
Whereas things like being a doctor, well, you have to go to college for that precisely because humans aren't born being able to do it. But it also means that it's actually easier for computers to do their job. So, the jobs that are threatened are not actually a bit different from what most people think. But I think the most single important -- The single most important thing to remember is that the great majority of jobs are neither going to appear nor disappear. They're just going to be transformed.
What's going to happen is already happening is that we're not going to lose our jobs. We're just going to be doing them differently because we're going to be doing them with the aid of machine learning. What people have to do is understand what parts of their job they can automate and then they can do the more interesting stuff on top of that that they couldn't do before. In a way, the way to keep your job safe from automation is to automate it yourself.
There's this very nice example from chess. Deep Blue with Kasparov, so now the world's best chess players are computers. Actually, wrong. The world's best chess players are not computers. They're actually what people in the chess community call centaurs. A centaur is a combination of a human and the computer. Humans and computers have very different strengths and weaknesses. So, human with a computer today can actually beat any human and any computer. They call these centaurs. And, I think, the same thing is going to be true in many, many other fields.
It's the combination of the humans and the computers that's actually going to win. So, it's not really so much man versus machine as with man with machine versus man without. Automation complements us. For the most part, it doesn't replace us. It's like having a horse. If you have a horse, you don't try to outrun it. You ride it. And then you can go much farther.
Christopher: That's just a perfect analogy. I was going to say that right then. It's so perfect. It's so easy to understand. You don't try and outrun a horse. You ride it. So, do you think the online courses are the best thing for people to do? Because I know this is so important. I'd been kind of scoping out some of the people in this Slack channel that I'm on and all of those people are doing this data learning course in San Francisco.
And I looked at their LinkedIn profile to see who they are and why they might be doing this. And what's so interesting to me is lots of them have PhDs in unrelated subjects. They've got a PhD in English or Math or something like that. And so they've seen this coming and they're making a transformation. How do people -- Is the book the best way to get up to speed or is there any other resources that you'd recommend to people?
Pedro: Well, so, the first thing I will do is read the book because that's the purpose that I wrote it for. At the end of the book, there's a section called further readings that's actually a whole bunch of points. It's not just reading. It's resources. And it pointers to a whole bunch of things that people can use to go further.
[0:35:11]
Some of them are books and papers, some of them are online courses. So, I mentioned things like Kaggle. There's a whole bunch of things that people can start. I think there's learning the basics, that's step one. Step two is actually to just start. Machine learning is just like driving. You can only learn by doing it. So then there's also these suites that you can download and start playing with. There's websites that you can use.
The good news is there's actually a lot of resources that you can use. I have pointers to a lot of them in that further reading section of the book. There's also another. I wrote an article a few years ago called A Few Useful Things to Know About Machine Learning. And it's just like 12 things that are hard to find in the text books but are really important to your life as someone who's applying machine learning to some problem. So, I would also recommend that. Again, that has pointers to more things.
Christopher: Okay. Yeah, I'll find these things and I'll link to them in the show notes. I wanted to talk to you about data sharing. So, I've just been stunned by this pretty hard with some of the software that we use. Basically, it grabs all of our blood chemistry data and it stores it in a database. And it makes some slightly useful predictions about what's going on for an individual based on the data that it has. I know for a fact that it's an if-then-else algorithm and it's pretty crude. It's not super helpful but it's slightly helpful.
And the main thing this software does for me is it defines better reference ranges. So, a lot of the reference ranges on the blood chemistry now that has two standard deviations on either side of the mean, and that's not always very helpful when a lot of sick people are doing the test and that's not controlled for. So, the software is not useless but it grabbed all my data and they won't give it back to me. I contacted the software engineer and he said, "Oh no, it's all in the database. It's all encrypted. So, it will be a lot of effort for me to give you your data."
And this is like a total disaster for me. I'm ending up writing a Python program that scrapes the numbers out of a PDF document, which I have now done and it works but it's brutal. So, can you talk a bit about who you should share your data with, the things that you should look out for and how to generally proceed online?
Pedro: Yeah. Unfortunately, what you just gave is an example of how wrong things are today. We are the owners of our data. I think that's the first thing that's important to realize. Unfortunately, all these companies tend to capture the data and they're not giving it back and then who knows, they sell it to other companies and who knows what happens in your data. So, I think, partly -- But this will not happen unless there's pressure from our side.
So, we the users, we the customers, we need to demand, for example, the ability to have continuous access to our data and to move it anywhere else. So, for example, you put your money in the Bank of America and if tomorrow you want to move it to Wells Fargo you can. Imagine if you couldn't take it out. This is what happens with data today. People should have this ability to just take their data wherever they want.
And then people should have the ability to decide what their data can and cannot be used for. And this should all happen with a fairly easy interface that lets you make choices. Unfortunately, the way -- What's happening today is that for the most part people aren't thinking about this. So, the companies, they do the data capture, they do the data mining. It's often useful, data mining, because at the end of the day, they want to serve people.
But their goals are not quite aligned with yours. So, the problem is there's actually -- maybe we need a new kind of company. Maybe we need something like a databank to your data like the bank is to your money. It stores your data and then it uses your data on your behalf in the same way that the bank invests your money on your behalf. But it's your goals that are being served there. Maybe you need something like the databank, a new type of company to do this that doesn't have the wrong incentives like, for example, trying to make money by sharing your ads or whatever.
Maybe we need something like a data union. Labor unions arose in the 19th century to balance the power between workers and their bosses and today knowledge is power. The companies that have all this data have a lot more power than the people who actually generate the data but they don't have access to it. So, maybe people need to get together and form communities of people who have the same interests.
And then they will have -- and then the machine learning can be applied to their collective data to do things on their behalf and that will even -- they balance the power a little bit. People often talk a lot about privacy in this context. And this is actually unfortunate because privacy is really, I think, the wrong prism to look at this through. Privacy is the ability to withhold your data, which is fine. But withholding your data 99% of the time is not what you want to do because you will miss a lot of good opportunities. Like, for example, to have personalized medicine and who knows what just by sharing your data.
[0:40:00]
So, the real context, the real way to look at this is through data sharing. We should be able to decide how much data we share with whom and for what. And also, at the end of the day, in some cases, people have an ethical duty to share their data. For example, in cancer, there is this initiative now to try to pool data from different hospitals and patients and doctors and whatnot precisely because without that data pooling you can actually, you cannot cure cancer in machine learning.
But this requires the consent of the patients. And, first of all, I think it's in the patient's own interest to share their data because they might get another cancer tomorrow or somebody in their family might so the data is directly relevant that way. But even more than that, I think it's everybody's duty to share data for the common good.
Christopher: Yeah, absolutely. I thought about that with our testing. Why don't the laboratories like Quest, for example, or LabCorp that have far more resources than I do, why don't they just go ahead and build an amazing algorithm around the data that they have? But they don't actually have all the data. So, we do lots of different types of tests and maybe we're in a better place to put all the pieces together than anyone of the labs individually. And I think that's true of other places where Facebook has a lot of data but they don't have all of the data for you. These data banks might be in a better position to build better learners.
Pedro: Exactly. One of the big problems today is that data is very balkanized. And so everyone only knows a little sliver of you. Facebook knows you from what you do on Facebook. Amazon knows you from what you do on Amazon. And, therefore, what they learned from that data, the result is never that smart because it's based on very incomplete knowledge of you.
So, we need to put an end to this data balkanization and bring it all together so that a really good model of each one of us can be learned and then that model can actually help you with all sorts of things from medicine to finding a date to who knows what. But for that to happen people have to be able to trust that the data won't be used against them. So, I would like to have all my data in one place and one model being learned from. But it better be under my control.
Christopher: Right. Can you give people any tips for training these machine learners? They're everywhere now. So, for example, can you train Facebook to not show me anything about Donald Trump?
Pedro: Yes. So, actually, this is another very important thing. It gets back to the steering wheel and the pedal. People should realize that every time they're interacting with a computer these days there's really two things going on. One is accomplishing what to accomplish, reading your Facebook updates or buying a book from Amazon. But there's also the fact that you are teaching these systems about who you are and what you want.
And so you should be aware of that and try as much as possible to teach them what you want to teach. The first part of this is actually very simple. If you're doing something that you don't want it to learn, like for example you're buying something for a friend as opposed to for you, and you don't want the two to be confused, there's simple things that you can do. Like, for example, browsers have incognito mode where you are anonymous.
You can go on Chrome and choose incognito mode and then it basically treats you as if you are a complete stranger. So, whatever you do there, it will not learn about you. This is a very simple that you can do, which is make sure that at least you don't teach the wrong things. The other thing that you can do is precisely the type of thing of -- For example, when you get a page of search results, most of the time what you want is in those top ten.
But even if it is, and particularly if it isn't, you should make a point of reaching down into the next 20 or 30 to teach, not just to get what you want, which might or might not be there, but to teach Google that, "Hey, look, here are some of the things that I want that you didn't think were that important."And you can do that with updates. You can do what with everything. Having said that, this is a process that's much slower than it needs to be. You should actually be able to just tell Facebook, "Look, I don't to hear about Donald Trump."
Or tell Amazon, "Look, I just bought a watch. Why are you showing me more?" Or challenge the system like, "Why did you recommend that?" And then the system says, "Oh, I recommended that because blah, blah, blah." "Actually, no. Here's why." So, you should actually be able to have a much richer interaction with those learning algorithms that then you have right now except at this moment it's not possible. But the irony is that the machine learning algorithms that can take that kind of feedback, they exist. They're not just the algorithms that Facebook and Amazon and whatnot are using today. But, I think, it would actually be good for them as well at the end of the day if they would allow this rich interaction with the users.
Christopher: I noticed how good the YouTube suggests has got and it always suggests really brilliant presentations for me now. And usually, it will be a collection of five brilliant presentations and The Duck Song because I've got a three-year old daughter and 200 million people have seen that video. When you play The duck Song for your kid, go into incognito mode first and then you'll train your dragon in the right way.
Pedro: Precisely. I mean, in the case of things like Netflix, and I'm not sure if YouTube has this.
[0:45:01]
You can actually say -- For example, like I said, Netflix, you can actually say, "Well, this movie is for me. This movie is for my kids." They love you to do that because they don't want to be making those conclusions. So, yeah.
Christopher: Why do you think this is happening now? I think this is something that's really, really interesting. The tools I'd be using to get started with machine learning, they've all been in the cloud. I've been paying about 90 cents an hour for what I think is like a world class piece of hardware on Amazon EC2 and it's all on demand. So, I fire up this computer and it's got a very sophisticated GPU built into it that I can use to do this machine learning stuff. This is new. I was wondering. You've been doing research for decades. Do you think it's the GPU that's enabling this massive explosion that's going on now?
Pedro: It's not just the GPUs but the GPUs are a part of it. There's actually several important reasons that have come together to make this happen now as opposed to 20 years ago. The first one is actually just the progress in machine learning itself or in AI, more generally. People for the first several decades of artificial intelligence, they were trying to build intelligent computers by programming them in detail. And that didn't work. AI kind of like got wedged for a while there.
The answer to that was to do machine learning is actually, no, we're not trying to program these things. They're too complex and often we don't know how. We're just going to learn them from data. So, that shift from what was called knowledge based AI or knowledge engineering to machine learning has been crucial. And falling on from that, every few years there have been breakthroughs in machine learning where the algorithms get better and now are able to learn things that they couldn't before. This is one aspect.
Another very important aspect is the data. Like the whole thing in machine learning is that you get the smarts from the data and not by programming them in yourself. So, the more data you have, the smarter the computer gets with essentially no extra work from you. As the explosion of big data has happened in the last several years, this has been an amazing windfall from machine learning. Because just by taking advantage of the extra data, things suddenly get a lot better.
And then another one, as what you're mentioning, is really the hardware. It's the computing power. Having reams of data is no use if I don't have a computing power to apply the learning algorithms to it. And so the progress in just Moore's law and cloud computing has been amazingly useful and, in particular, GPU is a very interesting case because GPUs, as the name implies, were developed for computer graphics but by an accident of good fortune, they're also very -- The kind of math that they do very fast is also the kind of math that's required for certain types of learning, in particular deep learning and your networks GPUs are perfectly suited for. So now they can do that type of learning on a scale that, for example, the others can't be done yet.
And then, finally, there's another factor which is just economics. Machine learning these days is so valuable for so many companies. And it's so easy to quantify just how much of a return on investment you get from it that they have started to pour resources into it. As a result of which, there are many more people doing machine learning now than they were before. And even if all the other factors were the same, you have more progress in machine learning happening in one year today than used to happen in the decade. So, you combine all these things together and you get the explosion that you're seeing today.
Christopher: Yeah. I wondered for a while why Google open sourced TensorFlow, which is -- perhaps maybe you disagree with me but at that time I thought it was perhaps one of the most valuable pieces of software on the planet. Like why would they just give this away to everyone? And maybe part of the answer is that, well, you don't have the data. They've given you the car but you don't have the keys still. It's the data that's valuable.
Pedro: That's part of it, right. So, you have to remember that each of these companies has the thing that they might make money on and the things that actually just help them make money with them. And you have to remember that -- So, Facebook did not release TensorFlow until Facebook released Torch. Once Facebook released Torch, Google got in a hurry. And the thing to realize is that Facebook does not make their money -- Facebook, in some sense, they just want to have progress in AI because they want to apply.
They don't actually make money or have any plans, as far as I know, to make money by selling AI to people. And then releasing things as an open source has many advantages. One is that it creates a community of users. And so it makes your products more likely to succeed because there are many more people using them. It is also very useful for recruiting. The biggest bottle neck for these companies in data science today is recruiting the talent because the talent is very scarce.
And if you have these tools that people started using in school and have done things with, that makes them much more -- If I'm a big user of Torch but not of TensorFlow and I have job offers from Google and Facebook, that makes it more likely that I'll go to Facebook. And, I think, Google understands this very well and that's when they released TensorFlow.
[0:50:01]
And again, you have to remember that part of what some of these companies like Google, for example, are trying to do now is sell AI to people, sell machine learning as an API or as a toolkit. And so this is part of Google's ploy into that area. For example, they're also competing with Amazon that has a very strong cloud presence. I mean, it has machine learning that run it and so forth. So, you have to realize that all of these are factors in whether Google releases TensorFlow or not. Having said all that, you're also right that it's one thing to have the algorithms and it's another thing to have the data. And the truth is that the biggest asset that Google has is not the algorithms. It's the data.
Christopher: Right. Talk to me about -- Maybe I'm wrong about this. You should correct me if I am wrong. But it seems to me like you spend all of your life trying to unify the different tribes as you call it of machine learning. Can you explain to me why you've chosen to do that?
Pedro: Sure. So, there are these different tribes. There's the connectionists, who want to reverse the engineering of the brain, the evolutionaries who do machine learning by stimulating evolution, the bayesians, the analogizers, the symbolists. And they each have their own master algorithms. So, for example, for the connectionists, it's by propagation. And each of these algorithms, in theory, can learn anything. But in practice, it has limitations. When you have finite amounts of data and computing, there's many things that it can do.
Each of these algorithms is very good at solving an aspect of the machine learning problem. Very smart people have worked at it for a long time and it's very good at that problem. But they key thing to realize is that because all of these five problems are real, no single one of the algorithm solves all five of them. But at the end of the day, to do things like cure cancer and have home robots and whatnot, we do need to solve all five at the same time.
So, what we really need is something like a grand unified machine learning in the same way that physicists have the standard model that unifies the different forces and biologists have the central dogma and whatnot. So, the goal of my research and of a lot of other people's is to bring about this unified grand synthesis of machine learning. And we're actually pretty close to doing it. And once we have it, we will be able to solve all these problems at the same time and have a true master algorithm.
Christopher: And as a brand new practitioner, do you think I should worry about being sucked into one particular tribe? So, it seems like the deep neural nets that are trendy at the moment. It's all I seem to see on the blogs and indeed the training courses that I've done. So, do you think I should worry about that?
Pedro: Absolutely. The single biggest mistake that people make -- And again, part of why I wrote the book is that they learn about one particular paradigm whether it was in school or by having heard about it somewhere from other people at work, and then they just start thinking along those tracks. And I've seen so many people in industry like they waste enormous amounts of time and money using the wrong thing for their problem when if they only learned a little bit and have a broader view, they go like, "Oh, for this what I need is A and for that what I need is B." So, just having that little bit of awareness of the larger field and what's good for what and how, it could save you so much work and so much pain. So, definitely.
Christopher: But all the tools are available equally. So, I mentioned Keras, which is fantastic. TensorFlow. You mentioned Torch. There's so many great tools. Do they cover all of the five tribes or am I going to be always be limited by what's available?
Pedro: No, they do. So, if you look at, for example, like Kaggle competitions, even though, as you've said, deep learning right now is very popular, it actually tends to dominate only in vision problems and some sequential problems like speech and language. For most problems, actually what's still dominant is what I call decision tree ensemble which is a symbolic method. And there are some very -- For example, the system called XGBoost, which is actually developed by a student here at the University of Washington, that you can download and go win Kaggle competitions with.
And then there are these toolkits like, for example, Weka is an open source toolkit that has all the main types of machine learning included in it. So, if you want to do in your own network using Weka, you can. If you want to do a kernel machine, you can. If you want to do decision tree or ensemble decision tree, you can do all of those things.
Christopher: And what about the state of your software? I had a quick look at it and I couldn't really -- You warned me in the book that I need to have a PhD in computer science in order to understand it. So, what is the current state of Alchemy?
Pedro: Yes. So, Alchemy is an implementation of the master algorithm, if you will, that already unifies several of these things, not all of them. And it's been used by hundreds of different companies or more, hundreds that I know of. Having said that, there's really two parts to Alchemy. There's defining what you want to model and stating your knowledge. That part is actually pretty straightforward if you already know, first of all, the logic, which is essentially the language that we use in AI and computer science for everything. That part, so you need to know logic but then doing it is easy. But then what's actually not yet very easy but should be is just making the learning and the influence happen at the push of a button.
[0:55:01]
Where the state of the art is right now is that nobody can do that. This is why you needed the scientist. You need people who know the black arts and how to turn the knobs and whatnot. And for that at some point you do need to know the math. So, this is where we're trying to make progress is in trying to make it more and more push button, more and more easy to use for people without going into the innards. Again, the analogy with cars is useful.
The early days of cars, if you're going to take 100-mile trip, it was guaranteed that your car would break down a few times and you have to go and fix the car right there. This is what happens with machine learning today. You still need to be a little bit of a mechanic. But hopefully, in the not too distant future, you won't have to do it anymore.
Christopher: Right. So, do you think that computer science is not the right degree? Because I totally agree with you. I've seen some state of the art models that can tell the difference between a picture of a cat and a picture of a dog with 99.9% something accuracy. And it's done in six lines of Python. So, I'm like, okay, did I really mean to spend a lot of time learning Python? What degree would you choose if you, maybe not yourself, a son or daughter or someone that you wanted to get into science, what degree should they do?
Pedro: I think if I have to choose one degree, it would be computer science because of all the disciplines that are involved in machine learning, computer science is the most important one. But there are others. Precisely the reason why data scientists are so scarce is that to be a good data scientist you have to know into computer science, you have to know mathematics, calculus, linear algebra, you have to know probability and statistics, you have to know optimization.
And there are very, very few people who can do this. There's this joke that a data scientist is someone who knows statistics better than any software engineer and software engineering better than any statistician. What I would do is actually major in computer science but take a minor in something like probability and statistics and make sure that you learn these different areas. So, what you need is people -- We're starting to have these degrees that actually do have this.
So, you'll learn some computer science, not the gory details because you don't need that for machine learning, but you do need to know enough programming and data structures and whatnot. So, you need the key pieces of computer science but you also need the key pieces of probability and statistics and optimization and of calculus and so on.
Christopher: Well, this has been fantastic, Pedro. I'm so excited about machine learning at the moment. I'm about to get into a car and drive back to San Francisco from Santa Cruz, which is quite a long journey, to go and do another class. I'm so excited about it. It's absolutely amazing. Is there anything I've forgotten? Of course, The Master Algorithm is the name of the book that Pedro published late last year. Is there anything else that you'd want people to know about?
Pedro: I think we've covered the main things. Yeah, this is great.
Christopher: Well, thank you so much for your time. I really, really appreciate you. Thank you.
Pedro: Sure. Thanks for having me.
[0:57:46] End of Audio
© 2013-2024 nourishbalancethrive