-
Notifications
You must be signed in to change notification settings - Fork 5
/
lecture_01.txt
83 lines (42 loc) · 73.1 KB
/
lecture_01.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
Okay it's 9:35 let's go ahead and get started. Welcome to 6.036/6.862, Introduction to Machine Learning. So as you are experiencing right now, our lecture will start at 9:35 a.m every Tuesday, so in particular we're always going to start at MIT time, so that's five minutes past and we'll end five minutes before 11, so that should give you time to get between classes and things like that. The lecture is synchronous. This is totally live right now, so as you'll see in a moment if you have questions that can be a part of our discussion today. Something to keep in mind is that everything we're doing is in Boston's time zone and so the reason I say that is because at some point in November, U.S. daylight savings is going to change and so it's not exactly EDT or EST, you just want to look up what is Boston time zone just, you know, you can google it and that'll be the appropriate time zone. So a note about class numbers: so 6.036 is the class that we're in here. There's a graduate version, 6.862, that some of you are in. 6.862 acceptance has been finalized at this point and so you should have heard already if you're in 6.862 and you'll want to register appropriately based on that. Now that being said if you were accepted to 6.862 but you want to take 6.036 that's still totally fine, you can just switch into 6.036. If you're not signed up for 6.036, that's also totally fine you can just get into 6.036 fresh. So 6.036 is still open for signing up. Now our main course website is listed here. If you ever have any doubt, if you're looking for some information, just go to the course website. Everything is there with one caveat and that caveat being if you are in 6.862 then there's a separate Canvas website that has information there and so you'll also hopefully already have access to that and be accessing that. We're going to be taking questions at Discourse, so if you haven't already gone to this link, there's also a nice link that our awesome TA Crystal has been posting in the chat. Then I encourage you to go to it. At the very least I encourage you to check it out after the lecture is over and go to that. That's where we're going to be fielding questions. You can also find this at the forum link on the course website, so again you can really get to everything from the course website. You can ask about logistics at Discourse, you can ask about actual content at the Discourse website, anything related to the course.
They will be reading the Discourse, they'll be responding to the questions on Discourse and in particular if something comes up that you know everybody could benefit from or there's an interesting question that we all want to discuss, they'll bring it back to me and so you can start asking right now. The reason we're having questions at Discourse rather than in Zoom chat is particularly because we want to hear a lot of questions from you and we're hoping that by having it on Discourse, we can handle a higher volume, you know not just things that I can possibly read in the Zoom chat and the time that I have, and so I really encourage you to go there. The one thing that I'll mention is that if you're interested in a lecture question because you're participating in lecture, then we encourage you to use the lecture one category. So when you go over to Discourse, check out, there's this great example question that sort of tells you how to, you know, set things up so lecture one category is obviously for today. Today is lecture one and then there will be a similar category in the future for different lectures that we have in the future. Okay, so I do want to mention that all of our materials are going to be available at the course website, so the slides for this lecture are going to be available at the course website. If I write anything on my iPad I'm going to try to save it and make it available at the course website. In fact, there will be a recording of this lecture available at the course website and so if you're not able to view it live for any reason or if you want to look over it again in the future that should be there as well and so everything should be available there and you should be able to check that out. We will not be monitoring the Zoom chat for questions, so again I encourage those to go over to Discourse. Okay so today's plan is: we've already covered some logistics but I'm going to be covering some more logistics. There's a lot of logistics in the beginning and then we're going to get to the good stuff, we're going to start talking about machine learning, we're going to set that up, and then we're going to dive into some details with linear classifiers and that'll really just be the tip of the iceberg for the rest of the semester, so that's where we're going here. Okay, so I said we're going to do some logistics. Let's start by talking about prerequisites. So this course has a number of prerequisites. I'm just going to briefly go over them here. We have computer science prerequisites, so in particular we want you to be familiar with Python programming. We're going to be using numpy a lot. You want to be familiar with algorithms, you should be able to read and understand pseudocode. We're going to be seeing a lot of pseudocode and talking through it together. We have a number of math prerequisites we're going to see that that's just really integral to machine learning. So, you should be familiar with matrix manipulation things like the inverse and transpose and multiplication, sort of the standard things we do with matrices. We're going to be talking about points and planes and dimension greater than two even starting today and so that should definitely be something that you're able to you know be familiar with and deal with gradients. They'll be gradients all the time so you'll definitely want to be familiar with those and be able to take gradients and basic discrete probability. We'll talk about randomness and random variables, independence, conditioning and things like that. Now the most important thing here is that you're really going to want to do this readiness assessment, so what what I'm looking at here is a drop down menu, so if you go to that website that I talked about, our main course website on the previous slide, you'll find a welcome to 6.036 thing that you can click on. You'll see this drop down menu and then you're going to want to go to this readiness assessment and just check that in fact you know you you feel comfortable with the sort of material that's ready for this course and so if you haven't already done that we strongly encourage you to go and do that readiness assessment. Okay, so once you've decided whether 6.036 or 6.862 is right for you, if in fact it is, then let's keep going. So we have an amazing, amazing course staff for this. So I am just one of a number of instructors that you'll be meeting, so I'm gonna be doing the lectures mainly, but we're gonna see all these other instructors. Some of them are here right now, in particular, Duane and Ike and then you're going to be seeing all of these instructors in your labs and your office hours in talking about, you know, questions and things like that so you're going to have a lot of really exciting opportunities to talk with everybody and the same goes for our amazing set of teaching assistants which I think this list may even grow in the near future so do not consider it to be a complete list for the rest of the course, but they are fantastic. They've been doing all this amazing work already setting up the infrastructure for what we're doing, getting everything ready for you to be a part of this class. I believe we have right now with us Crystal and Satvat, but again you're gonna be meeting all these other fantastic TAs in your labs and office hours and then these other activities you're gonna have a lot of hands-on time with them and also with the lab assistants. I do not have a set of pictures for the lab assistants because there are so many of them, but they're also fantastic and you'll be interacting with them a lot so I just want to say, you know, there's a fantastic staff here and I think we're going to have a great time together.
Okay so let's talk a little bit about our weekly plan. We sort of have a weekly calendar in some sense that is going to go on in this course. Now part of that is going to require getting some information from you about your schedule and so a really important thing that you want to do is if you have not already, complete or update your schedule survey by noon today, so that's tuesday, the day that we're having our first lecture. Make sure to fill in your information about your schedule so that we can use it to plan where you are in the week and I'll say where that comes up in a moment. So first, of course, we have the lecture every tuesday. We're going to have the lecture and in addition to the lecture, which again is both live and recorded, we're going to have a number of course notes that are available at the website and so this is sort of this base of information on which you're going to build to then really engage with all of this material and so basically we're going to have a number of other components to the week that are all about using this material and applying it and I think that's, you know, where we believe that real learning is going to happen, this combination of taking in this information but then actually applying it and so where is that going to happen, well a few different places: so the first component is, in general, there are going to be a set of exercises due 9 a.m before lecture, so those were not due today, there was nothing due today, so don't worry about that. The first set will be due next week before the lecture next week and the idea here is to just prepare just a little bit so you get the most out of lecture, so making sure that you've done the reading, that you're ready to go in the lecture. We're going to have a lab, now this is a really cool and fun component of this course and I hope that you'll enjoy it as much as I have in previous semesters. It is synchronous because we're all going to be chatting together. This is going to be very, you know, highly interactive and involved and so you really need to show up to your assigned lab time at the time that it's happening and so in particular this is why we need to do this schedule survey because we need to know when you're available for the lab. The first one is this week, it's going to be very chill. We're mostly just going to be making sure everything works and so don't feel like, you know, you need to be on top of absolutely everything in terms of that lab. We're really just going to be sort of checking things out. After everyone fills out the scheduling surveys, the staff are going to make the assignments to labs, so by this afternoon we expect to have a first round of assignments so you'll be able to see those later. There'll be instructions about that, as well as how to self swap. So especially in the beginning we'll be letting people swap times and that will be subject to space availability. Okay so what exactly is happening here? So what's going to happen is we're going to break everybody down into so-called MLyPod, so that's how you pronounce this ML-y pod thing: it's a MLyPod. It's 10 students, you're going to basically be, you know, subject to maybe some changing in the beginning with the same 10 students throughout the semester, so you'll be seeing some familiar faces as well as TAs and LAs and you're going to work each time in a group of sort of two to three and work through a bunch of really cool problems that are interesting to talk about and then at the end you're gonna check off with staff and during you can also ask questions of staff and engage with staff or we're gonna be there and I think this is just a really fun thing to do that you get to have a discussion about what's going on, you get to ask questions, you get to really sort of engage with the material and so hopefully you'll find that that is a fun thing as well. Okay so now there's the classic weekly homework, so that's another part of our weekly setup. Now here, what you're going to do to find this sort of thing is you're going to go to the same home page as before but now you're going to see a link that says week one basics. Basically it's the link for week one stuff, you can scroll down and or once you activate it, you'll see a bunch of options. One of these is the homework, it's due on Wednesday and you can just sort of see everything that's going on in week one and you'll see there's no exercises here because there are no exercises for week one, and that'll change in week two. We'll have that done. So the first one again is due September 9th, as you can see that'll be next Wednesday. There is a nano quiz each week, so we are not having a midterm in this class, and we are not having a final. There's no really big exam. Instead, we're going to have this sort of smaller exam every week. It's timed, the first one is this week, but it's just going to be in lab and there's basically no content to it, it's just checking the mechanics, trying it out, so don't worry about that. It's also ungraded, but we're going to go through that and make sure that you're comfortable with the nano quiz format because we're going to start doing it for real next week and starting next week you're going to have 24 hours to complete this before your lab section. As soon as you start it, it is time. Okay, so other components of your week that you may choose to participate in are office hours. So we have tons of office hours, almost every day has office hours options, they're at all kinds of different times and I believe that they're going to start this Sunday as the first office hours, and then finally if you're in 6.862, you'll also have various project aspects that you're involved with. You have a project that you're completing, and again, the information for that is the one thing that is not at this first course webpage that I’ve pointed out for 6.036, it's in Canvas, so you'll want to check that out. Okay so that's all the logistics. I'm just going to take a second and see if anything came up in the Discourse that we absolutely have to cover right now, but I think a lot of this is just going to be, you know, you're going to be getting familiar with all of this during the semester. It'll become, you know, much more smooth and clear as we go on and a lot of this week is just making sure that you are familiar with these logistics, that you're you're sort of ready to go and getting used to the setup.
Okay so with that, let's get into the good stuff, which is machine learning. So I think, you know, even though you're here for a machine learning course, it's worth asking, you know, why are we talking about machine learning? Why do we have a whole course on this, and in fact many courses? I think a really short answer which we're gonna expand upon quite a lot is it's everywhere, so let's let's see some examples to dig a little deeper and see if we can come up with a better answer to why are we talking about machine learning and what is machine learning for that matter, and so in order to do that I actually went on Google News the other night and just figured out, you know, what were some recent articles about machine learning and unsurprisingly there were a ton just in the past week or two and so let's just talk about a few of these. So here's one: machine learning algorithm confirms 50 new exoplanets in historic first. So exoplanets, in case you're not familiar, are these planets that are outside our solar system. Scientists are really interested in discovering them, they'll be super interested to learn about new ones. So here scientists have data which are roughly something like images from telescopes. They see certain types of signal that tell them: “hey maybe there's an exoplanet here,” and then they use machine learning to decide which of these candidate exoplanets are really exoplanets and which ones are something else.
So here's another news article that I saw, or another article that came up in Google News. It's from “The Lancet”: a machine learning algorithm for neonatal seizure recognition, a randomized controlled trial, and so here, they're looking at newborns and they want to be able to detect seizures in them so that scientists or medical practitioners can provide appropriate medical care, and so the data that they have is EEG data that provides some monitoring of the brain here, but it's super labor intensive to interpret it very quickly and so what they do is they want something that's going to be much more automated to be able to tell them, well if a newborn makes some kind of movement, is that a seizure and something we're really concerned about, or is it just something normal and something we don't have to worry about? So they want to take each candidate movement and decide, you know, is this newborn experiencing a seizure or not? So here's another one, this one's from actually a little bit longer ago, but it's this analysis by Reuters of the Supreme Court in the U.S., in the United States. So as part of this analysis, they did the sort of in-depth news analysis and they saw all these petitions that go before the Supreme Court, sort of decide the Supreme Court takes these decisions and petitions and decides are they going to hear the case that's associated with them, and as one part of the analysis, these news people took thousands of petitions and they looked at the text of those petitions and decided what are the topics or the themes that are in those petitions? They want to know sort of what are they about, but in an automated way, because it's hard to read so many petitions, that's really sort of an involved kind of thing to do.
Another aspect of machine learning and sort of government is in the Bureau of Labor Statistics. So this one is also in the United States, so here the Bureau of Labor Statistics has text data on interviews and surveys on various occupations and events that happen, and they want to turn those into codes: they want to know “oh, was this about a particular occupation, like say a janitor? Was it about a particular event like a workplace injury?” and that'll help them figure out what's going on in workplaces and, you know, plan things appropriately, but it turns out people are super expensive to train to do this task and actually people aren't that great at it. They often don't agree and so they'd like to have a better way to do it, to use machine learning to automate that. Phishing and spam detection are a super classic example of machine learning and yet this article you can see is just from the past few days. So phishing is when somebody tries to get your password or your sensitive information by pretending to be somebody else often in an email and so if you're an email client like gmail you really want to detect this kind of thing before it even hits somebody's inbox so they can't possibly be duped by it. And so the data that they have is the text of the email and they want to decide: “Is that email phishing or not?” Now machine learning can sometimes be controversial: there's some really controversial examples that have come up recently, so for instance, this is an article talking about the controversy around facial recognition. So in facial recognition, the data that's available is often surveillance footage of images of people, maybe security cameras, and a decision might be: “Who is this? Who is this person?” and perhaps, you know, that person might then be arrested if they were caught doing some kind of crime, and so one aspect that might make this controversial is what if you catch the wrong person? What if you misidentify who that person is? What if you say it's somebody else and so that could be, you know, an interesting and important part of machine learning in society and we want to know what's going on there. Machine learning is also used in things like finance, so here somebody is deciding how to distribute loans in India. So the data that they have is they're interested in: “Should I give a loan to some farmer?” And they have data about satellite images of their farm, the weather information, and other data and so they can figure out: “Is this farm gonna make enough money to repay the loan and should they give this farm the loan?” Okay so that's a bunch of examples of machine learning in the news, often from just the past week or two, and so now let's try to, you know, answer these questions that that I posed at the top of the slide: what is machine learning? Well if we look at all of these examples, we see that they're all cases where somebody has a bunch of data and they're using some kind of method to make a decision, so we saw, you know, various types of data like maybe I have data about newborns and I'm trying to make a decision about are they having a seizure and whether I'm going to give them a certain type of medical care or not. Okay so why are we studying machine learning? You know, I think that in a lot of cases, a really natural answer is that we want to apply it, you know, it seems like it's a very powerful set of tools. We've seen already in these examples that it often has the potential to save time and energy and resources, and so that can be really helpful in a lot of cases. We want to be able to apply it to new areas and get the benefits of that automation in new areas but I think it's worth highlighting at least two other reasons that we're studying machine learning: so one is to understand, you know, people are already using machine learning in many different areas or they have big plans to use machine learning and we can see here that that has the potential to impact your medical care, your finance, your security, your experience of government and so these are important decisions that affect you on a day-to-day basis and to understand them, to really understand what's going on, you have to understand how machine learning works and that's also sort of related to this issue of evaluation. So lots of people are using machine learning, but also lots of people are claiming to use machine learning these days and so you want to ask when somebody says “hey, I've got this new machine learning method,” does that really work? And maybe even more to the point, does it work as intended? What are the effects of that method? What is it doing exactly? Even if it's something that works well, does it need improvement? Are there ways that we can make it even better? And so these are the sorts of things that we're hoping that, you know, throughout this course. Obviously this is not a full answer to “what is machine learning?” and why studying machine learning, and I think we're really going to be going to be spending the rest of this course trying to answer these questions in some sense. Okay, so before we leave this sort of overview slide, let me just make a couple more points that I think are important. So one is machine learning is a tool: it's not magic, it's not going to, you know, just answer any question that you possibly have. Just like any other tool it has times it's useful and times it's not useful and actually there are some famous examples of it, of this not being useful sometimes. So there was this interesting study that was done very recently where a collaboration of international researchers had this data, this extremely rich sociology data, and sort of the life course of various children and they wanted to predict what was going to happen in the future with these children and even though they said anybody could participate and they had tons of fantastic groups using, you know, all the best modern machine learning methods, they just couldn't predict very well and I think one of the things that we're going to talk about in the course of this is when does machine learning work well, when can you expect it to work well, and when might it not work well, and how can you detect that as well. I'm also going to bring up here that machine learning is built on math, fundamentally math is its real foundation, and so that's why we have all these math prerequisites because we're definitely going to be using math and I think the farther you go in machine learning, the more you're going to see that you do need math and you do build on some really interesting mathematics that goes on there. Okay so in some sense this slide is about motivating us to to care not just about machine learning but about understanding the deep inner workings of machine learning, but it's all pretty high level and vague at some sense and so let's dive deeper into an example and into making a lot of this formal to get a better handle on exactly what's going on and, you know, start really starting to answer these questions.
Okay so let's go ahead and get started. So let's think about, you know, essentially what do we have at our disposal and what questions are we trying to answer? So what do we have at our disposal? Well essentially we have data, so this is what we saw in all those examples before and let's maybe make a cartoon version of one of those examples. Let's focus on this idea of we have data on a bunch of newborns, for instance, and we're interested in trying to say “hey if a new newborn comes into my hospital I want to know if it's going to have a seizure” and so from that perspective what we have right now could be seen as training data, because we're going to train a method, a machine learning method, that we can then later use to predict, you know, to decide what's going to happen with future newborns and so the only thing we have access to right now is the data we're going to use to train and so we're going to say that the number of data points that we have let's call that little n. So for instance you know maybe we have observed little n newborns in our hospital in the past. Now what exactly makes a data point? So in this particular example with the newborns we have, well first of all let's say that i is the index of our data point so we can talk about data point one, data point two all the way up to data point n. So if we look at data point i it's going to have associated with a feature vector, so this is sort of everything we measure about. For instance, this newborn in our example, let's call that feature vector x^(i). So that superscript denotes which data point it is and then x here itself is a vector and so we can denote the different elements of the vector by x_1 through x_d. So d is the dimension of the vector and we're saying this all lives in the Euclidean space of length d, so I'm going to draw a cartoon of this now. A fundamental limitation that we should always be aware of is that we as humans can only see in two dimensions and so the reality is that when you're doing machine learning d is probably going to be bigger than 2. So for instance in the actual newborn seizure study that we just looked at, d was 55. Now here though, I'm going to draw my cartoon in two dimensions because that's all we can see, so what's data going to look like here? Well a particular feature vector is going to have an x_1 value and an x_2 value so it's going to be a point in two dimensions and so we'll just get a few of these points for our data and what's going to happen here is that we're going to consider labels for this data, so basically one way to think about this is that in the particular case at the newborns, some expert has come in and told us: was this newborn experiencing a seizure or was it not? And so for instance, in this example let's imagine that x_1 could be how much oxygen the newborn is breathing. This is totally a cartoon. If you actually care about the medical example, go read that paper, but for my cartoon today let's say x_1 is how much oxygen the newborn is breathing, x_2 is how much the newborn is moving, and then -1 is that this newborn did not have a seizure and 1 is that they did have a seizure. Now something that's worth keeping in mind here is that getting the x's is super non-trivial, so if you actually look at what happened for instance in this paper, they have this very complex EEG data and then they find this 55-dimensional representation of features that expresses important points about that and so for the moment, we're just going to assume we have these features but it's worth keeping in mind that how you turn a newborn into a feature is a really involved process and something that you know you should be aware of. Okay so what's this going to look like? So actually we had this expert come in and we got all of these labels on our data, so here x^(1) and x^(2) represent newborns who did not have a seizure, x^(3) represents one that did have a seizure, and in fact we collect a lot of data, you know, on these various newborns, and so once you have all of this data I think you can look at this and you have this intuition that, you know, if a new newborn came by and you saw this you could probably predict whether they were going to have a seizure or not. You could probably predict whether it was going to get a minus value or a plus value based on the x's that we're seeing and that's basically what we're going to be doing for the rest of our time in some sense is formalizing that intuition. Okay, so this is all called our training data and we're just going to give it a name, we're going to call it D_n so D_n is going to represent all of the training data, so each data point is a pair. It's got the feature vector and the label, it's the x and the y, and then we collect all of these pairs into our set of training data D_n and now we want to ask, “okay well we have all this data but what are we doing with it, you know, what's the point what are we trying to accomplish?”
Okay, well somehow at least in this example what we really want to do is we want to say “hey, we saw all of these newborns before. An expert labeled all of them, but we want to know if a new newborn came in, are they going to have a seizure or not”, because if we knew that then we could provide better medical care, we can make sure that, you know, somebody's really monitoring them and careful and we could have, you know, sort of good actions that we could take and so we want a good way to be able to label new points and somehow there are two things that are implicit in this statement that we're going to try to make concrete: one is how do we label new points? What is a way to label new points, and separately what makes it good you know? We have intuitions, I think, about both of these things at this point, but we're not precise yet and so we're going to make both of these ideas precise going forward and we're going to start by focusing on a way to label new points. How can we label new points and then we'll get to the question of what is a good way to label new points. Okay, so let's start by saying how to label new points. So okay here's a new point. So look at this: I've got this little “x”, this little black “x” that's out in my space of points in x_1, x_2 and I want to say “hey, here's a new point. How do I label it?” and I want to be able to do this for any point that comes along, any possible value of these covariates, these features x, this feature vector x, and so if you think about it, what I'm essentially describing is a function. What this function, let's call it h, should do is it should take in any value of this Euclidean space, any value, potential value, of the feature vectors, and it should return a label, in this case -1 or 1 and so we're going to call this function a hypothesis. Now, just another way to think about this is that, again, this is a function that takes values of x, it goes through our function h, and it returns a value y, a value label. Okay, so this x could be anywhere, so here I'm just moving around this little x, you know up here in my data, and I just want that for any possible x. I could have that this is going to give me some label y. Okay, so here's an h: so I have an h, I'm just going to define an h that says for any x, h(x) = 1. Okay, so now we're going to try something out: so I told you that all your questions are going to be on Discourse but every now and then, I'm going to ask you, the audience, a question and I'm going to see if you can respond to that question and I'm going to see if you can respond to me on Zoom. So go to Zoom, find the private chat, and find the ability to write to Tamara Broderick. Don't write to Tamara's iPad, don't write to anybody else, just to Tamara Broderick, and here's my question for you: is this a hypothesis I just wrote down, an h for any x, h(x) = 1, is this a hypothesis?
Okay awesome, everybody is responding totally correctly. Absolutely, it's a hypothesis. Now here's a follow-up question: so we asked is this a hypothesis, now here's my second question. So yes it is a hypothesis because it's a function. Any function that goes from R^d to -1 and 1 is going to be a hypothesis. Now here's my new question: is this a good hypothesis? Okay awesome, everybody's totally nailing it. No this is not a good hypothesis because this hypothesis is telling me to just say that every newborn had a seizure and that's just not useful. You know we have this intuition that a good hypothesis should be able to discern between the newborns who had seizures and the ones who didn't and so now we're going to start thinking about, you know, clearly we need some better hypotheses to choose from than this one that I just talked about here. Great yeah it's as somebody said it's terrible, I agree. Okay awesome. Okay so now what we're gonna do is we're going to start by talking about a richer set of hypotheses and then we're going to talk about what makes a good hypothesis. We, right now, we're still talking about intuition about what makes something good, which is great. You all clearly have fantastic intuition but we'll make that precise, but first let's come up with a richer set of hypotheses: the so-called linear classifiers. Okay so first, let's define something called a hypothesis class, it's just a collection of hypotheses so let's call that “script H”.
So here's an example hypothesis class, it's the class of all hypotheses that label 1 on one side of a line and -1 on the other side of a line. So let's see some examples of this. So here's a line. That's not a hypothesis though, right, because a hypothesis is a function on all of the x's and so I need to say that for every possible x, I could get every value of x_1 and x_2. What is my hypothesis going to tell me, so it's got to be a function over all the x's but with a line I can now specify such a function. So here's one such function. I'm going to predict plus on the upper side of this line if I go sort of up and to the right, and I'm going to predict minus on the other side so that's one hypothesis in this class. Here's a different hypothesis that's based on the same line, let's take the same line but now let's predict plus on this side of the line and minus on this side of the line so that's a different hypothesis in this class. Here's yet another hypothesis in this class: here's a line and I'm going to predict plus on this side of the line and minus on this side line. Now I think you'll probably already have some intuition that there's three hypotheses we just named, some of them are better than others. Now what we're going to do on the rest of this slide is we're going to make this idea of having a collection of lines with a label of plus one on one side and minus one on the other side concrete and mathematically precise and so we're going to go into some math for this. If it doesn't all immediately click, don't worry about it because you're going to be spending a lot of time and problems, you know, engaging with this idea of a line and this linear classifier and if all you get from this slide is that we're going to come up with a set of hypotheses that look like this that's fine, you can always go back to the math later because that's literally all we're doing for the rest of the slide is taking what you already see here and making it precise. Okay so let's do that, so it's time for some math facts. Okay get ready for the math facts. So here, we're taking the exact same space as before. I'm drawing a cartoon of x_1 and x_2, but you should really think of this as the whole R^d space.
So this could be much more than two dimensions, that's certainly what we expect in a real machine learning problem, and suppose I look at a particular point x. Now, I can think of this point x as a vector if I draw a line, a ray, from the origin to x_1, x_2, that's a vector. Now suppose I come up with another vector, let's call it theta for the moment. This is just any other vector, it's a vector I chose, it's just some vector out there. Now something I can do if I have these two vectors is I can take what's known as their dot product, you can also just think of this as theta transpose x. Now something you should always do whenever you're doing matrix vector multiplication, is do a little unit test to make sure that this is even something that makes sense and so what I mean by that is let's do a dimensionality analysis. So x is a d by one vector so this is an important point that I sort of, you know, went by pretty quickly on the previous slide but we're going to be thinking of x as a column vector in general so it's really going to be the number of dimensions d by 1. It's a column vector. Now theta is also a column vector, so when I take its transpose, and it's in the same space, so when I take its transpose it's going to be 1 by d. So first question, can I multiply these two vectors together? Yes because their inner dimensions agree, or more to the point, you could think of these as two matrices: a one by d and a d by one matrix. Their inner dimensions agree so this is an okay multiplication to do. What am I going to get out when I do this multiplication? Well I'm going to look at the outer dimensions. I'm going to get out a one by one matrix, aka a scalar. It's just a number. I'm going to get out a number. Okay so here's the real math fact that's going to come up: what does this number mean? Here's one way to interpret this number. If I took x and I looked at its projection onto theta. What I mean is how much of x is in the direction of theta? So some of it's in the direction of theta and some of it's in a direction perpendicular to theta and I want to ask how much of it is in the direction of theta? It turns out that that is exactly this dot product divided by the size of theta, the length of theta, so that's what that notation means: it just means the length of theta and so that's what we have here.
So that's the meaning of this, it's just the projection of x onto theta. Now here's another unit test. It's always good to sort of unit test any idea that you have. Let's think about what's going on here. Okay let's do a little check. Notice that if I multiply theta by a constant, what would happen to this fraction? Well the constant would come out of the numerator and it would come out of the denominator and then I would get back the exact same value and so, does that make sense? Well yeah, if I multiply theta by a constant I just make it bigger or smaller but the projection of x in its direction doesn't change, you know, the amount of x that's in the direction of theta doesn't change and so that seems like a nice little unit test there. Okay so if this quantity represents the amount of x in direction of theta and I choose an x that is perpendicular to theta, what is this quantity, this theta transpose x divided by size of x for this x? Sorry divided by size of theta for this x. So what is the projection of this x, which is perpendicular to theta and you're all answering in Zoom and you're totally nailing it. It is zero. The projection of x here onto theta zero because there's no amount of x that's in the direction of theta. Awesome, fantastic answers. Okay so here's another x, here's a different x that's also perpendicular to theta. Same question: what's the projection of this x, this new x, onto the theta vector? Great and I love that some of you are saying “still zero” to distinguish from your previous answer. Great it indeed is still zero, this is also zero. Fantastic, great, okay, perfect. So we saw for each of these two points the projection of x onto theta is zero. I think that you can see that by a similar argument, we're expecting that for these points, if we looked at these values of x, the projection of x onto theta would also be zero. In fact it's not just these points, right, it's any x on this line. If I take any x for which its vector is perpendicular to theta, I'm going to get that its projection is zero and so here's one way I could write that observation. I could say, let's look at the set of x such that (so this colon should be interpreted as the word “such that”) I'm interested in the set of x such that the projection onto theta is equal to zero and that describes a line and that's exactly the line that we're seeing here.
You can play the same game and ask yourself okay, well suppose I'm not interested in the set of x whose projection is zero. Suppose I'm interested in the set of x whose projection is a.
Well you can go through the same set of reasoning, you know, just think it over, go back to our notion of a projection and what that means, and you'll see that you're also going to get a line and it's going to be essentially a distance away from our first line, and this line is the set of x such that the projection of x onto theta is equal to a. Okay now here's something that's just a little bit tricky that'll make sense hopefully if all of this has made sense so far. What happens if we go in the other direction? What if we say the distance is b but it's in sort of the opposite direction from theta so then if you think about it for a second, the projection of x onto theta has size b but we're going to get a negative in there because we're sort of going in the negative theta direction.
Okay so we've defined three lines at this point and in fact what's cool about this is we've basically defined lines, we have a way to define lines. Now, I want to emphasize in defining lines here we've totally thrown out if you remember maybe from high school or something this notion of a line being y = mx + b. You just want to completely forget about that partly because you'll notice y doesn't make an appearance here anywhere. y and y = mx + b is just a totally different y. It doesn't mean the same thing as what we're doing here. We want to treat all the x's somehow the same because they're all different features and the y's vary differently because they're labels and so this way of defining a line is going to be really useful for us for that reason. Now I want to emphasize what we haven't yet done, though, is we haven't come up with a hypothesis, right, because at this point all we've done is we've defined a line but we haven't said how to label everything in the space and that's what makes a hypothesis. It's got to be a function over the whole space and so let's think about how we can do that and in order to go in that direction, I’m going to focus just on this single line and I'm going to make the observation that as we go in the direction of theta, the projection becomes more positive, right, so we just saw a couple of lines where the projection was equal to zero. It was equal to a, it's more positive than negative b, and if we go in the opposite direction—so here the projection's going to be greater than negative b—and if we go in the opposite direction, we'll see the opposite effect. The projection will be even more negative, the projection will be less than negative b. Okay so now we're pretty close to actually having a hypothesis, an ability to label things that are going on in the space. What we're essentially going to do is we're going to say “hey once we have a line
and a direction defined by theta, we can say as you go in the direction of theta away from that line, you can have one label, and as you go in the direction of theta, or the opposite direction from theta, away from that line, you're going to have another label. Okay, now in order to write this out, I'm just going to slightly change the way that we've been writing things. So in particular here, a completely equivalent way to write the set of x such that the projection is equal to negative b is the following. So all that's happened here is we multiplied both sides by the length of theta and then we brought that b length of theta over to the left-hand side, so a completely equivalent way to describe this line is to say the set of x such that theta transpose x plus b length of theta equals zero. Okay now we're going to find it useful to not have to worry about the length of theta and how it interacts with b, we're just going to call that a particular new constant let's call that theta naught. So here all we're saying is that instead of using b length of theta, we're going to use a constant called theta naught again, completely equivalent so long as we choose theta naught appropriately, or equivalently if I choose a theta naught that implies that there's a particular value of b.
Okay great, so at this point we have a way to define a line and we know what's happening on both sides of that line and so let's write that out carefully.
So what I'm doing next is I'm just taking what we have on this slide and compressing it, so i'm just getting rid of that middle equation there. So we have these three equations, all completely equivalent ways to write the same thing, so long as we have this relationship between b and theta naught, and I'm just going to get rid of that middle equation so that's all that's happening here. So now we just have these two equivalent ways to write the same equation.
Okay so now we're ready to define a particular linear classifier.
So a particular linear classifier, it's going to be some h and h has got to be a function that takes inputs that are in our x space, so that's what we have here. We have a function h and our function h is the following: it says let's look at the sign of this value that we've just been calculating. In particular, on one side of the line where that sign is positive, where theta transpose x + theta naught > 0, we're going to apply the label 1. On the other side of that line, the line that is defined, remember the line itself is defined by theta transpose x + theta naught = zero, so on the other side of that line, theta transpose x + theta naught < 0 and we're going to assign the label -1 there. So these are just two completely equivalent ways of writing the same thing, which is that we'll assign a label 1 on one side of the line, a label -1 on the other side of the line, and the line itself is theta transpose x + theta naught = 0. Okay, so there's this sort of annoying thing, which is that technically we also need to assign a label on the line itself and we haven't done that just yet, and it's a little bit arbitrary, but this is what we're going to do. We're just going to take one of these directions and add equality to it, so we're going to say that we're going to assign the label -1 on the line itself. It's a choice, it's not super important, but we're going to make that choice. Okay so now, we have the definition of a linear classifier, so this is a linear classifier, but something that's going to be really useful to us is to be able to distinguish different linear classifiers. We can't call them all h you know. If I talk about h and I want to compare to h and then how about this other h, I mean that's going to be a problem right? I need to be able to talk about a particular h versus another h and so on, and so we're going to introduce this new notation which indicates which h I'm talking about. So if you look at this h, you can tell that this h is defined by the values of theta and theta naught. Once I know those values, I know what h I'm talking about, and so we're going to add those values into our notation of h and we're going to put them after a semicolon to show that they are not inputs to our h function. So h is still a function that goes from x's to y's but sometimes it has values that tell you about it, you know, that index a particular h function, and we call these values “parameters,” again to distinguish from the inputs of the function, to distinguish from the feature itself. These just tell you how to apply h or let you distinguish between different h's. Okay so in this case our parameters are theta and theta naught.
And now our hypothesis class H. So we said at the beginning of this slide, a hypothesis class is a set of hypotheses that makes sense. I mean it's almost tautological, it's a class of hypotheses, but then we wanted to define a particular hypothesis class, we wanted to define a particular hypothesis class. That was all the hypotheses that label 1 on one side of some line and -1 on the other side and so now we define a hypothesis that labels 1 on one side of a line and -1 on the other side. We can do this for every line, but not just every line, every direction that you might label plus and minus, and we can see that that's determined by theta, and so now when we collect all of those linear classifiers together, we can define exactly the hypothesis class that we wanted to, which is all the hypotheses that label 1 on one side of a line and -1 on the other, and we call these the linear classifiers. So any example of this is a linear classifier and then the particular, you know, example we might be looking at is defined by theta and theta naught but then the collection of all of them is this script H.
Okay so let's take a step back. Where are we right now? We have our data and then we decided we want to do something with it, we wanted to be able to, say, you know, make some prediction in the future based on that data. In order to do that, we needed a way to make predictions, and we needed it to be good and at this point, essentially we have just come up with ways to make prediction. They're not even informed by the data, we've just named a bunch of ways that one could make a prediction, but you don't need any data to define a linear classifier, you could just define a linear classifier. You can choose your favorite theta and your theta naught and so essentially what we really need to do now is to talk about what is a good linear classifier. We've talked about that in intuition, but we haven't specified very precisely what that means, and then we need to be able to find a good linear classifier, and so that's what we're going to talk about next. First, what makes a good linear classifier, and then how do we find such a good linear classifier. Okay so let's first talk about how good is a classifier.
Again, we're just going to be taking the intuition that you already have and formalizing it here.
Okay, so what makes a good classifier? What would be a good classifier? Well in some sense again, let's think about, you know, what are our data analysis goals? I think it's always good to come back to that, to ask ourselves: what are we trying to do? And so if we have a bunch of data on newborns who have had seizures, we don't just want to look at that data, we want to be able to say for future newborns who come into this hospital: are they going to have a seizure? We're trying to make predictions about future data points, so in that sense, what we really care about is that we want to get those right. We don't want to say that these newborns aren't going to have seizures and then they actually have seizures. We would feel really bad about that. We want something that's going to be really accurate, so that people can make the best possible judgments about medical care, and so in that sense again, we want to predict well on future data, future data that comes along. Now in some sense what's really going to happen is that a number of newborns are going to enter into this hospital, we're going to be getting, you know, data on all these newborns as they come in and maybe it's going to be hundreds or thousands of them and we'd like to do well on all of them, but in order to talk about all of them, I want to be able to talk about a particular point and then we can talk about multiple points. By the way, I'm just going to remind people very quickly if you have any questions about the lecture, make sure to go to the Discourse link, and so hopefully you can see it in the chat, maybe one of the staff can just repost it. So I'm only using the chat on Zoom for answers to my questions, but if you have any questions in general, make sure to post them on Discourse and then the staff will either filter them to me or answer them on Discourse. Great, thanks very much. But okay great, so let's start by asking: we want to figure out how good is a classified single point. Oh I see that there actually is a question, sorry about that yeah do you wanna just put that out there? Yes, so the question was just a clarification of what theta naught is with respect to the linear classifier. Oh great okay yeah so let me just go back to that slide really briefly and then we'll come back here to talk about how good is a classifier.
Okay so something that might even help us here a little bit, is yeah. I'll stop here for a moment. So remember the way that we set up this linear classifier was that we said we had a particular line, that line is defined as the projection being equal to a particular value, so this projection was theta transpose x divided by size of theta and then it's equal to a particular value, let's just call that value -b. It could be any value, we're just talking about being interested in a particular value, so that defines a line. That tells us a line and now the classifier is, okay, well if I go in the direction of theta, we'll call that plus. What we'll label those plus, and if I go in the direction opposite theta, I'll label those minus. Now in order to talk about that line, we could say this is the projection equal to -b, but a completely equivalent way to write that is theta transpose x plus b times the size of theta = 0, and a completely equivalent way to write that is theta transpose x + theta naught = 0 if I define theta naught to be b times the size of theta, and so what is theta naught? Here, one way to think about theta naught is it is the distance from my line to this vector times the vector theta itself. Another way you can think about it is that—you don't even have to go through this by the way to think about this. You could just say, I have an equation defined by theta transpose x + theta naught = 0. That is just a well-defined equation and you don't need to have understood anything I said about projections or anything I said about b or anything like that to make sense of that equation. That's just an equation and it's an equation that you can check certain values of x will satisfy and if you check all the values of x that satisfy that they'll define a line, and in that sense, theta naught is just part of a definition of our line. Together theta and theta naught define that and you can see something that I think you'll be doing probably in one of the problems coming up but you could even plot this right now for yourself, is just try changing the values of theta and plotting this and seeing what you get and I think you'll see how theta naught affects this definition of a line, and therefore the definition of the classifier. Great.
Okay, let's just go back to what makes a good classifier. Okay, so we want it to predict well in future data but we're going to start by asking how good is it at a single point in order to talk about multiple points, and so, in particular, let's imagine I have this new point that comes along and I think you have a sense again, an intuition by looking at this plot, of what would be a good label for this point and so let's again, let's make this precise. What does it mean to label a point well? Well in order to do that we're going to have to introduce a function that we call the loss, so L is the name of the loss function and it takes two arguments: those arguments are g which is our guess, and a which is the actual value, and so somehow we're going to be exploring a lot of different losses in this class. We're going to talk about some right now for classification but you can have them for other machine learning problems. You can have even different ones than the ones I'm going to mention right here, but somehow the whole idea of a loss is always going to be that our guess should be close to the actual value. You know, somehow we want to guess something that is like the actual value and so how can we express that?
By the way, if you have ever experienced perhaps in another class or your other work the notion of a utility, a loss could be thought of as a negative utility. Somehow a worse/ a larger loss is bad, whereas a larger utility is good. Okay, so here's an example of a loss, sort of the most basic loss you can have for classification: it's a zero/one loss, and so the idea of a zero/one loss is that, if my guess was right on, if it was perfect, if it agreed with the actual value what actually happened, then I don't lose anything, that's like the best possible thing that could happen, I haven't lost anything. But if it was not right on, if it was not equal, then I incur a loss of one, and any positive loss is bad. It's sort of like how sad am I that I got this wrong, and so here we're saying that we are sad to a level of one.
Okay, so a problem with this that you can almost immediately see, especially in the example we've been talking about, is that it's symmetric. So for instance, let's think about this newborn case. Suppose that I have a newborn who comes in and they are not having seizures and they never have seizures and I diagnose them as having seizures. Well, what's going to happen is a doctor is going to come in, we're going to spend some time with that doctor, but ultimately the newborn's going to be okay, because they don't even have seizures. Alternatively, here's another way I could get my prediction wrong: a newborn could have seizures but I diagnose them as never having seizures and so we never have a doctor come in, nobody sees the newborn, the newborn has seizures, and then they have all kinds of really bad medical outcomes that could have been prevented if we had correctly diagnosed them as having seizures, and so this seems just much worse than the first scenario. There's a real difference in this case between false positives and false negatives and so sometimes we want our loss to express that, so here's an example of an asymmetric loss and what we mean by asymmetric here is that guessing 1 when the actual is -1 is very different in terms of, you know, how much it matters to us versus guessing -1 when the outcome, the actual, is 1 and so here I might say, oh well if I guess that the newborn was going to have a seizure, but the newborn didn't, I incur some loss because, you know, that was some resources that we spent that we didn't have to. You know, we could have saved money, we could have saved time, that doctor could have done something else, and so there is a loss, but if I guessed that the newborn was not going to have a seizure, -1, and then the newborn did, that's a much bigger loss because we've lost the ability to really medically help this newborn. Maybe they have some bad medical outcome and so we want to say that that is a much worse outcome and therefore the loss is much bigger. Now in practice, it's going to be a difficult and important question to say: well exactly how do you balance these losses? You know, how do they relate to each other? And that's something that you're really going to have to think about in any particular application, but I just want to open up this possibility of having this asymmetry, because it can be really important in a lot of cases, including this one, the one we've been talking about.
Okay so now we said we want to predict well on future data, and so what is future data? Well let's imagine that n’ new points come in, so we had our original n training data, and now we have our n’ new points, and so now we could say, what's our loss over all of those new points? We might call that the test error to distinguish it from our training data. So what are we defining here? Well what's happening is we're defining our error, let's call it E, it's a function of h, so our loss is obviously going to depend on however we classify things, and h is what tells us how to do that. We're doing a sum over all of the new data, so we imagine that the new data is indexed from n + 1 to n + n’ prime because there are these n’ new data points, and let's just start the indexing after we did this. Okay and now what we're going to do, is for each of those, we're going to take the loss and we're going to compare h(x), so that is what we are guessing at x, h(x) is exactly our guess, and then our actual value is the actual observed y, and then we might average this all up to say: what's the average error? So that's the 1 over n’ in the beginning. Okay so this is, I mean this is sort of exactly what we want. We want to be able to predict well in the future data, and the whole problem with this is that we do not have access to the future data, so this is a total fantasy that we can never actually do in real life. So yes, I would love to be able to say, ah yes in the next 500 newborns that come in, I'm going to get it right, you know, 99 of the time, and in order to even make that statement I would have to know whether those newborns were going to have a seizure or not, in which case I should have just used that information, and so being able to calculate something like this for actual future data would require knowing that future data, which we don't have, and so we're gonna spend a lot of time thinking about, you know, how can we actually think about classifier quality, and how do we, you know, deal with this, but for the moment, here's another idea: something that we can calculate which we might think of as some kind of proxy which is, let's look at the loss on the data that we do have, and we'll see that there are pluses and minuses to thinking about this kind of thing, but it will be a useful thing to think about, and it's basically the same idea: we're going to take our training data, our n data points, and we're going to add up the losses across those data points between our guess and the actual, which you can see in the case of linear classification doesn't always have to be the same. We saw some linear classifiers earlier that would have misclassified some points and then we get all of this together and we'll call it our training error E_n.
Okay so now what we're going to do is we're going to say that once we have this notion of error, this E_n, we can decide between two classifiers: we can say “I prefer classifier h to classifier h tilde, so let's just say these are two classifiers, h and h tilde, if the error is lower, if this notion of, sort of an average loss, is lower, then i'm going to choose E_n, or sorry, I'm going to choose h.” Okay so at this point, I have a way to decide between two classifiers, but that's not what I want to do, right? I want to have my data and come in and choose a classifier, in fact I'd like to choose the best classifier in some sense, now that I have a notion of what it means to be a good classifier, what I'd love to do is just say, “hey here's my class of classifiers, let's just pick the best one, let's pick the one with the lowest error”, and that just turns out to be very computationally difficult and so that's why we're going to have to think about other things, other methods to deal with this, because if we could just do that we might just do that.
Okay so at this point, we have a notion of what makes a good classifier, and now let's start thinking about how we can learn a good classifier. If I have some data, how can I choose a classifier that is good?
Okay, so imagine that I have data and I have my hypothesis class and I want to choose my good classifier, so this is where we are at this point. We know, we know, what data looks like, we know what a hypothesis class looks like, like maybe I chose all these linear classifiers and I want to choose a good classifier, now first of all let's just remember what is a classifier. A classifier is a function that takes in the values of x and gives out the values of y and so this is an example of a classifier, and it's always useful to think: what does it do when it acts so if I take this classifier and I want to put an input in, that input is going to look like a value of x and then what I'm going to get out is I'm going to get out a label for that input.
Now something we're going to start talking about now is a different idea called a learning algorithm. So the learning algorithm is going to take in a whole data set, now you'll notice that for a classifier, we didn't need a data set at all, it's just a function over all of the possible x values. So in order to define that, there's absolutely no reference to data, but for a learning algorithm, we're going to take in a data set, D_n, and we're going to spit out a classifier, and we hope it's a good classifier, and so for instance, what might that look like? Well maybe, this is my data. I put that as an input to my learning algorithm and I'm going to get out a classifier and maybe this is the classifier I get, and if I got off this classifier, I might be pretty happy it looks like a pretty good classifier.
Okay here's another data set. So since I have a new data set, I can apply my learning algorithm again and I can get out a new classifier and you'll notice that it doesn't have to be the same classifier, just like if I apply a classifier to different x values, I don't have to get out the same label y. Here if I apply a learning algorithm to different data sets, I don't have to get out the same classifier h, these are both functions but they're functions on very different spaces, they apply to very different things.
Okay so just as we've seen examples of h, let's now look at an example of a learning algorithm. Now I want to say we're talking about values, possible things for h, we're talking about examples of h, we’re talking about examples of learning algorithms, just as we saw some examples of h that were not good, we will see some examples of learning algorithms that are not necessarily good learning algorithms, but they're still learning algorithms, and so let's take a look at a potential learning algorithm.
So here's our example learning algorithm. Now before we get into the learning algorithm itself, let's imagine that my friend did some work for me and came up with a bunch of classifiers, so in particular, my friend went and for a trillion different times, generated some classifiers for me. They randomly sampled some lines in this x space, I don't know exactly how they did it: they have some distribution over lines, they got all these different lines, you know, maybe they'll tell me if I ask, but they came up with a whole bunch of different lines, in fact a trillion different lines, and they made them all into classifiers, so now they gave to me a trillion different lines and they're all different and they're all really interesting.
Okay so there's a question about how good is a classifier. So to compare between two classifiers which is better, to use test error or training error, now I want to emphasize this point: what we really want to do at the end of the day is we want to predict on, let's say the next 500 newborns that come into my hospital. And so that’s the best way to compare is to say “hey the next 500 newborns that come into my hospital, how well did I do at predicting whether they actually had seizures or not?” I mean that's what I really care about the end of the day, was were those newborns given appropriate medical care? So that's the best way to compare. The problem is that I can't do that. It's definitely what I want to do, but it's impossible because I haven't yet seen those newborns and so I'm going to have to think about how do I get around the fact that I didn't have access to the future newborns who have not yet come into my hospital. How can I possibly deal with that fact? I don't want to suggest that we have answered that in this lecture because we have not, that's going to be a question that we're dealing with in the future in this class. In fact, in next week, we're going to be talking about it a lot more, and so I just want to put in your minds that the this error on the future data that we really want to predict on, that we haven't seen, that's what we wish we could do, that's what we really want to evaluate on, and the whole problem is that it's in the future so we can't, so how could we possibly deal with that fact, and so all we're doing for the moment is we're saying that one possible idea, one possible notion of goodness use is training error. I'm not saying it's the best, I'm not saying it's what you should use in absolutely everything that you do when you're thinking about goodness, but it's something that is useful for the moment, in fact we're about to use it in this learning algorithm in just a second, but we're going to be thinking about this question so much more again next week and throughout this course, and so don't think that we have, we have learned it all, that this is the end of the discussion about what is good. Okay great, great question and I hope you keep, you know, wrestling with that question even after the lecture. Okay, so let's go back here, so my friend has generated this really, really long, trillion long list of classifiers and now here's my algorithm, here's my example learning algorithm. In fact, let's call it “example learning algorithm”. Now remember a learning algorithm takes as input a data set, so this is going to take as input a data set, but just as we had a parameter that we could set in h, something that sort of modulates h, that changes h, so we are going to here have a hyper parameter. So a parameter is something that changes our hypothesis h, a hyper parameter is something we change in our learning algorithm that isn't the input, it's like an adjustment to our code, you know, maybe just some some value that we'd like to be able to change and so here we're going to choose some hyper parameter, let's call it k, and suppose that I restrict k to be less than a trillion.
Okay, so here is my algorithm, I'm going to explain what this notation says. So first of all, what we're going to do, is for every classifier, from classifier 1, classifier 2, up to classifier k, we're going to calculate the training error for that classifier, that's that E_n(h^(j)). Now what the next part says, this arg min, it says once I've calculated all of these errors, I'm going to find which one is the smallest, the min, and then I’m going to figure out which j, which argument, went into that, so which particular index gives me the classifier with the smallest error, the minimum error and let's call that j*, and then what I'm going to do is I'm going to return the classifier for j*.
Okay so, this is a learning algorithm right: it takes in a data set and it returns a classifier. Let's say that I only had k = 1. Can anybody tell me in the zoom chat this time what classifier will be returned?
Awesome, great. So I'm seeing a bunch of people saying h^(1). h^(1) is what's going to be a return so what's going to happen here is if I have k=1, then this is going to say, okay let's calculate the error for all the classifiers from 1 to 1. Okay there's only one classifier, so I'm going to calculate just the error for that, I'm going to say what minimizes all of those errors, well there's only one choice, it's the first classifier, it's the only one I'm looking at, and then I’m going to return that first classifier. Now this only gets interesting when I have two classifiers. So here's a question: let's say I take the training error of the classifier when I run this algorithm with k=1 and now I run the algorithm with k=2, how does that training error compare?
Okay great, so let's talk through this. I'm seeing some great answers here, so we just said that if I run this algorithm with k=1, I just get out the first classifier, I get out h1. If I run this algorithm k=2 which happens, what happens is I compare the training error of h1 to h2, whichever one is lower, I choose that, and so by definition, that has to have at least as low error as if I ran this algorithm with k=1, and so it doesn't have to have strictly lower error, maybe they have the same error, but it can't be any greater when I run it with k=2. Okay so at this point, we've seen an example of a classifier, we've seen an example of a hypothesis class, a whole class of classifiers, the linear classifiers, we've seen an example of a learning algorithm, we've seen an example of data, and how we can take this learning algorithm and turn data into a classifier that has some kind of error, and in fact something you might check for yourself is that as k increases, you're going to see progressively lower training error. You might want to just double check that that's true, see if you can, you know, figure that out for yourself, and so now what we want to think about is can we do better? You know, we've talked about ways to evaluate and there was a great question about is this how we should evaluate. Well let's think about that. Let's engage with that question. Here's a learning algorithm. Is this the learning algorithm we should use, you know, probably it will be no surprise to you that it's not because we still have many lectures to have together and so surely we will do better in those remaining lectures and so we're going to be talking about these things. How can we have better learning algorithms? How can we accomplish our goals? What exactly are our goals? What is the best way to encapsulate those and make them rigorous? Okay we're at the end of our lecture time today, we will see you all for labs this week. I think that'll be exciting and fun, we'll probably mostly be ironing out bugs this time, but I think ultimately these are gonna be a really great time and I will see you all for lecture again same time on next Tuesday, have a good one.
Bye.