Decision Trees
Naive Bayes
Linear Discriminant Analysis
k-Nearest Neighbors
Logistic Regression
Neural Networks
Supoort Vector Machines(SVM)

K-Nearest Neighbors

Multi-class classifier: A classifier that can predict a field with multiple discrete values.
KNN: K-Nearest Neighbor, a method for classifying cases based on their similarity to other cases. Based on similar cases with same class labels are near each other. can be used to estimate values for a continous target

Procedure

Pick a value for K.
Calculate the distance of unknown case from all cases, can be Euclidean distance.
Select the K-observations in the training data that are “nearest” to the unknown data point
Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.

which K

1-NN

5-NN

Based on Evaluation

Use evaluation to calculate the accuracy and then determine the best value K for KNN.

Used for Regression

Based not only on the distance, but also on all the attributs to calculate a “distance”.

Evaluation – Classification Accuracy

Jaccard index

Also known as Jaccard Similarity Coefficient/Score(Intersection over Union)

$y$ : Actual labels
$\hat{y}$ : Predicted labels

$J(y, \hat{y}) = \dfrac{|y \cap \hat{y}|}{|y \cup \hat{y}|} = \dfrac{|y \cap \hat{y}|}{|y| + |\hat{y}| - |y \cap \hat{y}|}$

Example

F1-score

Confusion Matrix

$Precision = \dfrac{TP}{TP + FP}$
$Recall = \dfrac{TP}{TP + FN}$
$F1-score = 2 \times (prc \times rec) / (prc + rec)$

Log loss

Performance of a classifier where the predicted output is a probability value between 0 and 1.

$LogLoss = -\dfrac{1}{n}\Sigma(y \times log(\hat{y}) + (1-y)log(1-\hat{y}))$

Decision Trees

Each internal node correspondes to a test
Each branch corresponds to a result of the test
Each leaf node assigns a classifications

Building Procedure

Choose an attribute from dataset
Calculate the significance of attribute in splitting of data(entropy of data, and then information gain)
Split data based on the value of the best attribute
Go to step 1

Find the best attribute

Bad attribute

Better atrribute

More Predictiveness
Less Impurity
Lower Entropy

Entropy

Measure of randomness of uncertainty

$Entropy = -p(A)log(p(A)) - p(B)log(p(B))$

If totally homogeneous, the entropy is 0, if half and half, the entropy is 1

The lower the entropy, the less uniform the distribution, the purer the node

$Gain(s, Sex)$ $=0.940 - [(8/14)0.811 + (6/14)1.0]$ $=0.048$

$Gain(s, Sex)$ $=0.940 - [(7/14)0.985 + (7/14)0.592]$ $=0.151$

Sex attribute has more information gain, so choose sex as the splitting attriubte

Information Gain

Information gain is the information that can increase the level of certainty after splitting

$Information \space Gain = (Entropy \space before \space split) - (Weighted \space entropy \space after \space split)$

Python Programming

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

Get the data

$ wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv

Show the first 5 lines

my_data = pd.read_csv("drug200.csv", delimiter=",")
my_data[0:5]

The data size

my_data.size

Preprocess the data

X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]

from sklearn import preprocessing
le_sex = preprocesing.LabelEncoder()
le_sex.fit(['F', 'M'])
X[:, 1] = le_sex.transform(X[:, 1])

le_BP = preprocessing.LabelEncoder()
le_BP.fit(['LOw', 'NORMAL', 'HIGH'])
X[:, 2] = le_BP.transform(X[:, 2])

le_Chol = preprocessing.labelEncoder()
leChol.fit(['NORMAL', 'HIGH'])
X[:, 3] = le_Chol.transform(X[:, 3])

X[0: 5]

Setting up the decision tree

Split the dataset

from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testst = train_test_split(X, y, test_size = 0.3, random_state = 3)

Modeling

drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

drugTree.fit(X_trainset, y_trainset)

Prediction

predTree = drugTree.predict(X_testset)

To make an intuitive comparison

print(predTree[0:5])
print(y_testset[0:5])

Evaluation

from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTree's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

To calculate the accuracy without sklearn

le_Drug = preprocessing.LabelEncoder()
le_Drug.fit(['drugA', 'drugB', 'drugC', 'drugX', 'drugY'])
testDrug = le_Drug.transform(y_testset.values)
predDrug = le_Drug.transform(predTree)
1 - np.mean((testDrug - predDrug) ** 2)

Logistic Regression

Logistic Regression is a classification algorithm for categorical variables.

If the data is binary(multi-class also supported)
IF a probabilistic decision is required
find a linear boundary
understand the impact of a feature

Logistic Function

Also called the Sigmoid funciton $\sigma(\theta^TX) = \dfrac{1}{1 + e^{-\theta^TX}}$

Training Process

Initialize $\theta$
Calculate $\hat{y} = \sigma (\theta^TX)$ for a customer
Compare the output with the actual one, and record the error
Calculate the errors for all customers
Change the $\theta$ to reduce the cost.
Go back to step 2.

Cost Function

Complex Version

$Cost(\hat{y}, y) = \dfrac{1}{2}(\sigma(\theta^TX) - y)^2$

$J(\theta) = \dfrac{1}{m}\Sigma_{i-1}^m Cost(\hat{y},y )$

Simplified

Minimize the Cost function

The gradient is a vector that is along the steepest direciton

initilize the parameters randomly
Feed the cost function with training set, and calculate the error
Calculate the gradient of cost function
Update weights with new values
Go to step 2 until cost is small enough

SVM – Support Vector Machine

Mapping data to a high-dimensioal feature space
Finding a separator

Kernelling – The transformation

Kernelling is about doing data transformation, may try the following models

Linear
Polynomial
RBF(Radial basis function)
Sigmoid

Find the hyperplane

Prons and cons

Advantages:
- Accurate in high-dimensional spaces
- Memory efficient
Disadvantages
- Prone to over-fitting
- No probability estimation

Applications

Image Recognition
Text category assignment
Detecting spam
sentiment analysis
Gene Expression Classification
Regression, outlier detection and clustering

Python Programming

Dependencies

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from skleran.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt

Load the Cancer data

$ wget -O cell_samples.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv

Load Data from CSV File

cell_df = pd.read_csv("cell_samples.csv")
cell_df.head(10)    # to have a look at the data

To have an intuitive look,

ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()

Data Pre-processing and selection

Have a look at column data types

cell_df.dtypes

transform the non-numerical value to numerical

cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
cell_df.dtypes

coerce means by force

Transform the table to array

feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feacture_df)
X[0:5]

Before: After:

transform the value of Class

cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
y [0:10]

Split into Train/Test dataset

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

random_state is about which pseudo-random generator to take

Modeling(SVM with Scikit-learn)

Fit

from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

clf: classifier
SVC: support Vector Classification

Preict new

yhat = clf.predict(X_test)
yhat[0:5]

Evaluation

from sklearn.metrics import classification_report, confusion_matrix
import itertools

Confusion Matrix

To plot the confusion matrix

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

print(classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)', 'Malignant(4)'], normalize= False, title='Confusion matrix')

f1_score

from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')

Jaccard index for accuracy

from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

Practice with Linear Kernel

K-Nearest Neighbors

Procedure

which K

1-NN

5-NN

Based on Evaluation

Used for Regression

Evaluation – Classification Accuracy

Jaccard index

Example

F1-score

Confusion Matrix

Log loss

Decision Trees

Building Procedure

Find the best attribute

Bad attribute

Better atrribute

Entropy

Information Gain

Python Programming

Setting up the decision tree

Split the dataset

Modeling

Prediction

Evaluation

Logistic Regression

Logistic Function

Training Process

Cost Function

Complex Version

Simplified

Minimize the Cost function

SVM – Support Vector Machine

Kernelling – The transformation

Find the hyperplane

Prons and cons

Applications

Python Programming

Dependencies

Load the Cancer data

Load Data from CSV File

Data Pre-processing and selection

Split into Train/Test dataset

Modeling(SVM with Scikit-learn)

Evaluation

Confusion Matrix

f1_score

Jaccard index for accuracy

Practice with Linear Kernel

Leave a Reply Cancel reply