- Hello. In this video lecture, we will cover the fundamental ideas behind classification. This is a supervised learning method, so we are interested in mapping a set of inputs into outputs but assuming that these outputs are given by a set of predefined classes. Let's start. We are now dealing with a supervised learning process to be able to infer a function or predictive model from labeled training data. The training data consist of a set of training samples where each sample is a pair consisting of an input object, typically a vector, and a desired or target output value. In the context of classification, this output is given by a discrete set of quantities that represents classes of objects. In this lecture, we will see several predictive models that could be trained to generate a desired output, as the figure suggests. In any of these models, there is an optimization procedure that allows to incrementally reduce the training error. Once the predictive model is trained, it is tested with data that has not seen before. Ideally, one would like to see the same level of accuracy with the testing samples as well. However, we would be generally content with the simplest predictive model that can generate the right output for a new set of samples within a certain level of approximation error. Now that we have an idea of how a predictive model could be created, it is important to establish a distinction between predictive and causal models. In the case of predictive models, we use the data on some objects to predict values for another object. So it's important to observe that if X predicts Y, it does not mean that X causes Y. Accurate prediction depends heavily on measuring the right set of variables. Although there are better and worse prediction models, more data and a simple model works really well. Prediction is very hard, especially about future references. In the case of causal models, we're mainly concerned about finding what happens to one variable when you make another variable change. So, usually randomized studies are required to identify causation. There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions. Think about, for example, when we compute permeability as a function of porosity. Causal relationships are usually identified average effects, but may not apply to every individual. Causal models are usually the gold standard for data analysis since they lead to the development of physical principles. The main objective of classification is to be able to define a boundary that properly separates the output domain into a given number of classes. Sometimes the separation can be made with a few intersecting hyperplanes coming from a linear classifier or discrimination model. However, in general, an accurate class separation model will imply the generation of a complex boundary. The best theoretical possible boundary is known as the Bayes decision boundary, and it requires to know or characterize perfectly well the distribution of parameters in each class. In the animation, we can see how the decision boundary evolve through iterations of a nonlinear classification method, such as the support vector machine, SVM. The nonlinear boundary is unavoidable here, given how the class members are spread all over the domain. Now what makes classification hard? We can mention three major challenges, the number of dimensions, the number of classes, and the nonlinearity associated to the problem due to distributions of variables in each class, as depicted in the animation. Here are a few examples on how classification could be used in oil and gas problems. A geologist wants to classify rocks with similar petrophysical properties so that each class is characterized by a set of unique properties. This is known as rock typing. A geoscientist wants to relate the previous rock types with different levels of oil production. Given a set of high, medium, and low production well groups, a reservoir engineer wants to figure out what type of production can be associated with a new well in order to perform some future estimations. Given the trend of several drilling parameters in time, such as RPM, torque, and weight on bit, a group of drilling engineers wants to determine when the drill bit may fail. Of course, there are many more interesting and challenging applications in oil and gas. There are many classification methods. Here, I'm only listing six of them. In this video lecture, I will briefly describe the first three of them. In the regression lecture, we will have the opportunity to go over the remaining ones. In linear discriminant analysis, we assume that instances of a class are linearly separable from instances of other classes. The linear discriminant is used frequently due to its simplicity. Both the space and time complexities are of the order d, where d is the dimension of the problem. The linear model is easy to understand. The final output is a weighted sum of the input attributes xi with corresponding weights bi. The signs of the discriminant function f determines which of the elements belong to a class or the other. We can see this more clearly from the plot separating the classes C one and C two via the decision boundary described by the function f. In many applications, the linear discriminant is quite accurate. We know, for example, that when classes are Gaussian and share the same covariance matrix, the optimal discriminant is linear. The linear discriminant, however, can be used even when these assumptions do not hold. Thus, it is not bad practice to use the linear discriminant before trying a more complicated model to make sure that the additional complexity is justified. It is not hard to realize a linear model may generate probability values outside the range zero, one for a given set of parameters. As we can see in the figure, the logistic model allows to overcome this problem by mapping all outputs within the zero, one range. Thus, class separation can be easily achieved by setting a threshold probability value to separate two classes, in this case, C one and C two. It is important to remark that both LDA and logistic regression can be extended to handle more than two classes and also to handle a large number of variables. As we have seen, linear discriminant analysis and logistic regression can be used to assess the same type of problems. Their functional form is the same, but they differ in the way they estimate their coefficients. Nevertheless, logistic regression tends to be more effective than LDA when we have more than two classes. When the assumption of normality is fulfilled, discriminant analysis makes robust estimations and can perform better than logistic regression when the classes are well separated and share the same covariance. In the absence of normality, logistic regression is always a preferred method, as it makes no assumptions about the distributions, explanatory variables, or the variances associated to different classes. Also, logistic regression is more suitable for handling both continuous and categorical variables. The K Nearest Neighbor, or KNN, algorithm is a non-parametric method. This means that it does not make any assumptions on the underlying data distribution. It is a practical method, as most real world cases do not obey to typical theoretical assumptions, such as normality or linear separability. KNN is also a lazy algorithm in the sense that it makes decisions based on the entire training data set. Thus, algorithm could potentially involve a high computational cost in terms of processing time and memory. Its complexity is proportional with the number of points describing the domain. The K number decides how many neighbors influence the classification. K is usually an odd number, to avoid possible ties in deciding to what class a point should belong to. We illustrate the KNN functionality on a simple case consisting of 10 blue observations and 10 red observations and assuming that K is equal to three. The black cross indicates a particular test observation. The three closest points to the test observations are identified, and it's predicted that the test observation belongs to the most commonly-occurring class, in this case, red. After repeating the process for each observation point in the domain, we obtain a map described by the separation of the red and blue curve, as we can clearly see in the figure. The choice of k plays a major role in the classification. When k is low, the decision boundary tends to show a high variance and low bias. On the other hand, when k is high, the decision boundary tends to be rigid, showing almost a consistent linear trend despite the number of elements in each class. That is, the boundary has low variance but high bias. In the plots below, we compare the decision boundaries for K equal one and K equal seven. With K equal one, the decision boundary is overly flexible and accommodates more points in the right class. In contrast, with K equal seven, the decision boundary is less flexible and misplaces more points outside the class. Now that we can seen three algorithms for classification, it is important to define ways to measure the performance of these algorithms. One effective device is a confusion matrix that contrast actually values against predicted values. The dimension of this matrix corresponds to the number of classes. Here we are showing a two by two matrix associated to a classification problem with two classes. The main diagonal contains the counts of correct decisions. The errors of the classifier are the false positive, that is negative instances classified as positive, and false negative, that is positive classified as negative. From the confusion matrix, there are many evaluation metrics that can be derived. All of them are fundamentally summaries of the confusion matrix. Here I will list a few of them. Precision, when it predicts yes, how often is it correct? Accuracy, overall, how often is a classifier correct? Specificity, when it's actually no, how often does it predict no? Sensitivity or true positive rate, when it's actually yes, how often does it predict yes? False positive rate or fall-out, when it's actually no, how often does it predict yes? Then we have false positive rate, which is the same than one minus specificity. Informedness is basically the difference between true positive rate and false positive rate. Besides the metrics that we can derive from the confusion matrix, there are also ways to graphically describe the entire space of performance possibilities of classification algorithms. The Receiver Operating Characteristics, or ROC, graph is a two-dimensional plot of a classifier with false positive rate on the x-axis against true positive rate on the y-axis. As such, the ROC graph depicts relative trade-off that a classifier makes between benefits, true positive, and costs, false positive. A good classification algorithm will describe a ROC curve converging quickly to a true positive rate equal to one. A bad classifier will tend to describe a ROC curve close to the diagonal. That is, the classification performance is almost as good as flipping a coin. An important summary statistic is the area under a ROC curve, namely the AUC value. As the name implies, this is simply the area under a classifier's curve expressed as a fraction of the unit square. Its value ranges from zero to one. Though a ROC curve provides more information than its area, the AUC is useful when a simple number is needed to summarize performance. With this slide, we conclude our lecture on classification. See you next time.