Written by Bethany Barnwell
WHAT IS CLASSIFICATION
Classification is the process of predicting the outcome of a set of inputs based on a model that is trained with another existing and related dataset of inputs (x) and outputs (y). To create this model, the existing dataset of inputs and outputs is split into a training dataset and testing dataset. Usually, the training dataset is made up of 80% of the original dataset, while the testing dataset contains the remaining 20%. A model is trained using an algorithm using the training dataset. There are many algorithms to choose from based on the makeup of the dataset. The model will explore how the given input values correlate with the output values associated with the training dataset. After the model has been accurately trained, the model is then tested using the testing dataset. The model will predict the output values based on the given input values from the testing dataset and compare these predictions to the given output values of the testing dataset. The comparison of predicted values to actual values can provide insight into the accuracy of the model for future predictions of other datasets.
ACCURACY OF A CLASSIFICATION MODEL
The accuracy of the algorithms goes beyond the number correctly predicted over the total number of examples. Other classification metrics can give information on how well trained a model or algorithm is. For example, there are four rudimentary metrics: true positive, false positive, true negative, and false negative which are illustrated in the image below.
These values can be used to find other metrics such as precision and recall. Precision can be represented as true positive / (true positive + false positive); in other words when a positive was predicted, how often is the algorithm correct. Recall, on the other hand, is true positive / (true positive + false negative), or when the actual value is positive, how often did the algorithm predict a positive as well. Moreover, these values make up the confusion matrix. The confusion matrix (which is shown below) is a 2 x 2 matrix that contains all of the classification metrics in one place.
But how is all of this information useful for the programmer? One way is by evaluating the F1 score, which is also known as the Precision-Recall Tradeoff.
The F1 score can be represented as 2 x (precision x recall / precision + recall). This score is low when either precision or recall is also low. However, depending on the algorithm and model, a higher precision or recall value may be desired. Usually, it is optimal to have precision = recall or very close to equal.
A quick overview of the terms we used and their representations-
One well-known type of classification is a binary classification. This is where there are only two possible output values, such as 0 and 1 or true and false, for several inputs. An example of this is disease datasets. The y or output variables can only be one of two classes: “does have the disease” or “does not have the disease”. When coding algorithms for binary classifications, these two classes can be represented as true and false.
There are generally two types of classification algorithms: discriminative and generative. Discriminative classification revolves on predicting which class a new observed dataset belongs to. For example, after modeling and training on a known dataset of inputs and outputs, the algorithm can predict the class (output or y value) for each input (or x value). Some examples are logistic regression, neural networks, and decision trees. Generative classification focuses more on how the data is generated. To predict new data, a generative classification algorithm will look for underlying patterns in how the data was created. Some examples include Gaussian discriminant analysis and quadratic discriminant analysis. There are many potential applications for these advanced classifiers. They can be used for image classification, disease prediction, email organization (SPAM or not SPAM), etc. These types of supervised learning can impact and improve a wide range of entities from businesses to disease treatment and everything in between.