Posted on
• Decision Trees
• Naive Bayes
• Linear Discriminant Analysis
• k-Nearest Neighbors
• Logistic Regression
• Neural Networks
• Supoort Vector Machines(SVM)

# K-Nearest Neighbors

• Multi-class classifier: A classifier that can predict a field with multiple discrete values.
• KNN: K-Nearest Neighbor, a method for classifying cases based on their similarity to other cases. Based on similar cases with same class labels are near each other. can be used to estimate values for a continous target

## Procedure

1. Pick a value for K.
2. Calculate the distance of unknown case from all cases, can be Euclidean distance.
3. Select the K-observations in the training data that are “nearest” to the unknown data point
4. Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.

## which K

### Based on Evaluation

Use evaluation to calculate the accuracy and then determine the best value K for KNN.

## Used for Regression

Based not only on the distance, but also on all the attributs to calculate a “distance”.

## Evaluation – Classification Accuracy

### Jaccard index

Also known as Jaccard Similarity Coefficient/Score(Intersection over Union)

• $y$: Actual labels
• $\hat{y}$: Predicted labels

$J(y, \hat{y}) = \dfrac{|y \cap \hat{y}|}{|y \cup \hat{y}|} = \dfrac{|y \cap \hat{y}|}{|y| + |\hat{y}| - |y \cap \hat{y}|}$

### F1-score

#### Confusion Matrix

• $Precision = \dfrac{TP}{TP + FP}$
• $Recall = \dfrac{TP}{TP + FN}$
• $F1-score = 2 \times (prc \times rec) / (prc + rec)$

### Log loss

Performance of a classifier where the predicted output is a probability value between 0 and 1.

$LogLoss = -\dfrac{1}{n}\Sigma(y \times log(\hat{y}) + (1-y)log(1-\hat{y}))$

# Decision Trees

• Each internal node correspondes to a test
• Each branch corresponds to a result of the test
• Each leaf node assigns a classifications

## Building Procedure

1. Choose an attribute from dataset
2. Calculate the significance of attribute in splitting of data(entropy of data, and then information gain)
3. Split data based on the value of the best attribute
4. Go to step 1

### Find the best attribute

#### Better atrribute

• More Predictiveness
• Less Impurity
• Lower Entropy

#### Entropy

Measure of randomness of uncertainty

$Entropy = -p(A)log(p(A)) - p(B)log(p(B))$

If totally homogeneous, the entropy is 0, if half and half, the entropy is 1

• The lower the entropy, the less uniform the distribution, the purer the node

$Gain(s, Sex)$ $=0.940 - [(8/14)0.811 + (6/14)1.0]$ $=0.048$

$Gain(s, Sex)$ $=0.940 - [(7/14)0.985 + (7/14)0.592]$ $=0.151$

Sex attribute has more information gain, so choose sex as the splitting attriubte

##### Information Gain

Information gain is the information that can increase the level of certainty after splitting

$Information \space Gain = (Entropy \space before \space split) - (Weighted \space entropy \space after \space split)$

## Python Programming

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier


Get the data



### Load Data from CSV File

cell_df = pd.read_csv("cell_samples.csv")
cell_df.head(10)    # to have a look at the data


To have an intuitive look,

ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()


### Data Pre-processing and selection

Have a look at column data types

cell_df.dtypes


transform the non-numerical value to numerical

cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
cell_df.dtypes

• coerce means by force

Transform the table to array

feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feacture_df)
X[0:5]


Before:  After:

transform the value of Class

cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
y [0:10]


### Split into Train/Test dataset

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

• random_state is about which pseudo-random generator to take

### Modeling(SVM with Scikit-learn)

Fit

from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

• clf: classifier
• SVC: support Vector Classification

Preict new

yhat = clf.predict(X_test)
yhat[0:5]


### Evaluation

from sklearn.metrics import classification_report, confusion_matrix
import itertools


#### Confusion Matrix

To plot the confusion matrix

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

print(classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)', 'Malignant(4)'], normalize= False, title='Confusion matrix')


#### f1_score

from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')


#### Jaccard index for accuracy

from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)