A Little Here, A Little There: [Data Science] Machine Learning

In my previous post, I briefly described what machine learning is. I shall attempt to dwell into more (but enough) details in this posts.

As mentioned, machine learning is about deploying algorithms on a computer to apply statistical methods in data analytics. An algorithm is a set of instructions to perform a specific tasks, so a machine learning algorithm is a set of instruction for the computer/programme to learn patterns from a set of given data.

Types of Machine Learning Algorithms

There are many machine learning algorithms. The common ones are:

Linear Regression
Logistic Regression
Nearest Neighbours
Support Vector Machines
Decision Trees (and Random Forests)
Neural Networks
Clustering
Dimensionality Reduction (e.g. Principle Component Analysis, or PCA)

Supervised and Unsupervised Learning Algorithms

Machine learning algorithms are categorized as supervised learning or unsupervised learning - basically it is data with a known outcome, or labels, for the former (supervised learning) and data without labels for the latter (unsupervised learning).

The outcome of supervised learning is to find a general pattern of the data that validates the labels. Items 1-6 in the list above are supervised learning algorithms.

The outcome of unsupervised learning is to find and group general characteristics within the data. Items 7-8 are unsupervised learning algorithms.

Data Types and Coresponding Objectives

Data can be quantitative (numbers, or measurables on a standard scale), or qualitative (description, or non-measureables because not on a standard scale). Qualitative data is also known as categorical data.

Quantitative data is generally used for projection. For example, what would be next year's sales of a shopping mall be like, given the number of visitors to the shopping mall this year?

Qualitative data is generally used for classification, which can be broken down into two tasks: object verification (if a cat is a cat), and object identification (if an object is a cat).

In practice, the two types of data co-exists. It is unlikely to have a pure data of either form. Thus extra care has to be made to handle data because categorical data is usually assigned a number to represent groups or level, and usually this might give algorithms a false sense of scale if adequate consideration is not taken.

Hypothesis Function

Each machine learning algorithm has an assumption about the data, perhaps except neural networks. This assumptions are normally described as a function, called the hypothesis function. For example, the linear regression assumes linear relationship between the data and the label that we are interested in.

For example, the hypothesis function for linear equation is:
\[\hat{Y} = mX + c\]

By convention, $\hat{Y}$ is the approximation (to the real label, $Y$). It is related to the input data $X$ by $m$ and $c$, the parameters of the model; that is, a different set of $m$ and $c$ yields different value of $\hat{Y}$.

Cost Function

The desired $\hat{Y}$ is one that it is closest to $Y$. This means that $Y - \hat{Y}$ is minimum. Any deviation of $\hat{Y}$ from $Y$ is akin to a costs, hence $Y - \hat{Y}$ is the cost function of the algorithm. However, the mean square error is commonly used as the cost function:
\[\min \frac{1}{m}\sqrt{\sum (Y-\hat{Y})^{2}}\]

Optimization

This can be done by finding the optimal set of $m$ and $c$ to satisfy the above condition. We could have also taken the derivative $\frac{dY}{dX}$ and set it equals to 0 to find the minima or the maxima. However, this is not possible when the dimensions of $X$ increases, meaning there are $X_1, X_2, X_3, ..., X_n$ to be considered, in the case of Multivariate Regression. Fortunately there are optimization algorithms to help us find the optimal set of $m$ and $c$, the most common being Gradient Descent.

An Example

For the set of data given below (Figure 1), we can fit a line (in red) that generalizes the pattern (Figure 2).

Figure 1

Figure 2

We can calculate the gradient $m$ and intercept $c$ by hand in this case to find the equation of the line that generalizes the pattern of the data. Or we can use Python and Scikit-Learn to do so. Here's the code:

import numpy as np
import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x, y)
print('The coefficient, m = {}'.format(lm.coef_))
print('The intercept, c = {}'.format(lm.intercept_))

This has the following outputs: $m = 2.03364$ and $c = 2.30736$

Graphically,

Data and the fitted Linear Regression model.

As I have showned, I first deployed the statistical approach - fitting a straight line to a model, and then the computer science approach - run a code to perform Linear Regression. I could have computed the gradient ($m$) and intercept ($c$) manually, using derivatives and all but coding negate the hassle - all in 3 lines of code.

I hope this give some idea about machine learning and also the merits of learning to code. Data analytics gets complicated when the data becomes massive - with multi-dimensions and examples.

Till then~

~ZF

A Little Here, A Little There

01 January 2018

[Data Science] Machine Learning - Part 2, Essentials of Machine Learning

No comments:

Post a Comment