A Little Here, A Little There: [Data Science] Machine Learning

In a Classification task, the learning algorithm allocates a number 1 if an instance of the object is to a class (True), and 0 otherwise. When predicting, the outcome will be a number between 1 and 0, which can be interpreted as the probability of the object belonging to a certain class.

The common algorithm to do so is the Logistic Regression. The hypothesis function of algorithm is

\[\hat{y} = \frac{1}{1 + e^{-Z}}\]

known as the sigmoid or logistic function, or the S-curve. $Z$ can also be any function with respect to the features and the corresponding weights, for example, $Z = w_1X_1 +w_2 X_2 + w_3X_3$. In the learning algorithm, the weights, $w_1, w_2, w_3$ are determined through optimization, such as Gradient Descent, of the cost function.

The output of the sigmoid function will always be between 0 and 1, as the chart below shows for $Y = \frac{1}{1 + e^{-Z}}$

I shall apply Logistic Regression on the Iris data set. We can call out the dataset in Scikit-Learn using this code:

from sklearn import datasets

iris = datasets.load_iris()

The data is a dictionary, so we can see the keys using iris.keys(). There are three species of the flower with 50 instances each. Hence there are 150 instances of data.

Next, the feature and label data, X and y respectively.

X = iris['data'][:, 3:] #petal width

y = (iris['target']==2).astype(np.int) #1 if Iris-Virginica

As shown, only the petal width will be used to identify if the species is Iris-Virginica in this example.

It is always a good idea to split the data into training and test set randomly. I will split 30% the data as test data (that is 45 instances). This can be done easily using Scikit-Learn:

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(X, test_size = 0.3, random_state = 123)

y_train, y_test = train_test_split(y, test_size = 0.3, random_state = 123)

Now, we can apply Logistic Regression on the data!

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

The model's attributes such as the intercept and coefficient can be called out using model.intercept_ and model.coef_ respectively. Note the underscore '_'!

Model Coefficient = [[ 2.17900148]]

Model Intercept = [-3.67603174]

Hence the model is:
\[\hat{y} = \frac{1}{1+e^{2.179X -3.3676}}\]

We can use the model to make predictions.

pred = model.predict(X_test)

To determine how the model fare, we can use a Confusion Matrix to see the number of right classifications.

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, pred))

The confusion matrix will come in a form:

[[ TN, FP]
[FN, TP] ]

The result will be:

[[28 0]

[ 2 15]]

Hence,

TP = 15
TN = 28
FP = 0
FN = 2

The rows of the matrix are the True classes; the columns are the Predicted classes. Hence there are 15 + 28 instances where the model classfied correctly (15 True predicted as True, or also known as the True Positive; 28 Negative as Negative, or True Negative).

From the confusion matrix, the True Positive Rate, or Precision, the proportion of positives predicted by the model that are truly positive, can be determined.
\[precision = \frac{TP}{TP + FP}\]

Another useful metrics is the proportion of positive instances that are correctly predicted by the model. This is known as the Recall.
\[recall = \frac{TP}{TP + FN}\]

The precision and recall can be computed using the following code:

from sklearn.metrics import precision_score, recall_score

print('Precision Score = {:0.3f}'.format(precision_score(y_test, pred)))

print('Recall Score = {:0.3f}'.format(recall_score(y_test, pred)))

The result:

Precision Score = 1.000 (15/15+0)
Recall Score = 0.882 (15/15+2)

If we were to calculate the accuracy, the instances that were predicted correctly, it will be 33% (15/15+28+2). This is why accuracy is never used in classification tasks.

Here's the graphical representation of the things were carried out.

This marks the end of the Logistic Regression example. The logistic regression learning algorithm is one of the important learning algorithm. It is also frequently used as a building block to understand Artificial Neural Networks, or Neural Nets for short, which I hope to touch on soon.

I hope this has been useful in providing a bigger picture to the Logistic Regression learning algorithm. I have deliberately left out the mathematics behind the learning algortihm because there are many resources that can do a better job than I. By the way, the reference quoted below is awesome!

~ZF

References:

[1] Hands-On Machine Learning with Scikit-Learn and Tensor Flow, Aurelien Geron, O'reilly

A Little Here, A Little There

07 January 2018

[Data Science] Machine Learning - Part 3, Logistic Regression

No comments:

Post a Comment