- Decision Trees
- Naive Bayes
- Linear Discriminant Analysis
- k-Nearest Neighbors
- Logistic Regression
- Neural Networks
- Supoort Vector Machines(SVM)
K-Nearest Neighbors
- Multi-class classifier: A classifier that can predict a field with multiple discrete values.
- KNN: K-Nearest Neighbor, a method for classifying cases based on their similarity to other cases. Based on similar cases with same class labels are near each other. can be used to estimate values for a continous target
Procedure
- Pick a value for K.
- Calculate the distance of unknown case from all cases, can be Euclidean distance.
- Select the K-observations in the training data that are “nearest” to the unknown data point
- Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors.
which K
1-NN
5-NN
Based on Evaluation
Use evaluation to calculate the accuracy and then determine the best value K for KNN.
Used for Regression
Based not only on the distance, but also on all the attributs to calculate a “distance”.
Evaluation – Classification Accuracy
Jaccard index
Also known as Jaccard Similarity Coefficient/Score(Intersection over Union)
- : Actual labels
- : Predicted labels
Example
F1-score
Confusion Matrix
Log loss
Performance of a classifier where the predicted output is a probability value between 0 and 1.
Decision Trees
- Each internal node correspondes to a test
- Each branch corresponds to a result of the test
- Each leaf node assigns a classifications
Building Procedure
- Choose an attribute from dataset
- Calculate the significance of attribute in splitting of data(entropy of data, and then information gain)
- Split data based on the value of the best attribute
- Go to step 1
Find the best attribute
Bad attribute
Better atrribute
- More Predictiveness
- Less Impurity
- Lower Entropy
Entropy
Measure of randomness of uncertainty
If totally homogeneous, the entropy is 0, if half and half, the entropy is 1
- The lower the entropy, the less uniform the distribution, the purer the node
Sex attribute has more information gain, so choose sex as the splitting attriubte
Information Gain
Information gain is the information that can increase the level of certainty after splitting
Python Programming
import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier
Get the data
$ wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv
Show the first 5 lines
my_data = pd.read_csv("drug200.csv", delimiter=",") my_data[0:5]
The data size
my_data.size
Preprocess the data
X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values X[0:5]
from sklearn import preprocessing le_sex = preprocesing.LabelEncoder() le_sex.fit(['F', 'M']) X[:, 1] = le_sex.transform(X[:, 1]) le_BP = preprocessing.LabelEncoder() le_BP.fit(['LOw', 'NORMAL', 'HIGH']) X[:, 2] = le_BP.transform(X[:, 2]) le_Chol = preprocessing.labelEncoder() leChol.fit(['NORMAL', 'HIGH']) X[:, 3] = le_Chol.transform(X[:, 3]) X[0: 5]
Setting up the decision tree
Split the dataset
from sklearn.model_selection import train_test_split X_trainset, X_testset, y_trainset, y_testst = train_test_split(X, y, test_size = 0.3, random_state = 3)
Modeling
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4) drugTree.fit(X_trainset, y_trainset)
Prediction
predTree = drugTree.predict(X_testset)
To make an intuitive comparison
print(predTree[0:5]) print(y_testset[0:5])
Evaluation
from sklearn import metrics import matplotlib.pyplot as plt print("DecisionTree's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
To calculate the accuracy without sklearn
le_Drug = preprocessing.LabelEncoder() le_Drug.fit(['drugA', 'drugB', 'drugC', 'drugX', 'drugY']) testDrug = le_Drug.transform(y_testset.values) predDrug = le_Drug.transform(predTree) 1 - np.mean((testDrug - predDrug) ** 2)
Logistic Regression
Logistic Regression is a classification algorithm for categorical variables.
- If the data is binary(multi-class also supported)
- IF a probabilistic decision is required
- find a linear boundary
- understand the impact of a feature
Logistic Function
Also called the Sigmoid funciton
Training Process
- Initialize
- Calculate for a customer
- Compare the output with the actual one, and record the error
- Calculate the errors for all customers
- Change the to reduce the cost.
- Go back to step 2.
Cost Function
Complex Version
Simplified
Minimize the Cost function
The gradient is a vector that is along the steepest direciton
- initilize the parameters randomly
- Feed the cost function with training set, and calculate the error
- Calculate the gradient of cost function
- Update weights with new values
- Go to step 2 until cost is small enough
SVM – Support Vector Machine
- Mapping data to a high-dimensioal feature space
- Finding a separator
Kernelling – The transformation
Kernelling is about doing data transformation, may try the following models
- Linear
- Polynomial
- RBF(Radial basis function)
- Sigmoid
Find the hyperplane
Prons and cons
- Advantages:
- Accurate in high-dimensional spaces
- Memory efficient
- Disadvantages
- Prone to over-fitting
- No probability estimation
Applications
- Image Recognition
- Text category assignment
- Detecting spam
- sentiment analysis
- Gene Expression Classification
- Regression, outlier detection and clustering
Python Programming
Dependencies
import pandas as pd import pylab as pl import numpy as np import scipy.optimize as opt from sklearn import preprocessing from skleran.model_selection import train_test_split %matplotlib inline import matplotlib.pyplot as plt
Load the Cancer data
$ wget -O cell_samples.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv
Load Data from CSV File
cell_df = pd.read_csv("cell_samples.csv") cell_df.head(10) # to have a look at the data
To have an intuitive look,
ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant'); cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax); plt.show()
Data Pre-processing and selection
Have a look at column data types
cell_df.dtypes
transform the non-numerical value to numerical
cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()] cell_df['BareNuc'] = cell_df['BareNuc'].astype('int') cell_df.dtypes
- coerce means by force
Transform the table to array
feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']] X = np.asarray(feacture_df) X[0:5]
Before: After:
transform the value of Class
cell_df['Class'] = cell_df['Class'].astype('int') y = np.asarray(cell_df['Class']) y [0:10]
Split into Train/Test dataset
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4) print ('Train set:', X_train.shape, y_train.shape) print ('Test set:', X_test.shape, y_test.shape)
random_state
is about which pseudo-random generator to take
Modeling(SVM with Scikit-learn)
Fit
from sklearn import svm clf = svm.SVC(kernel='rbf') clf.fit(X_train, y_train)
- clf: classifier
- SVC: support Vector Classification
Preict new
yhat = clf.predict(X_test) yhat[0:5]
Evaluation
from sklearn.metrics import classification_report, confusion_matrix import itertools
Confusion Matrix
To plot the confusion matrix
# Compute confusion matrix cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4]) np.set_printoptions(precision=2) print(classification_report(y_test, yhat)) # Plot non-normalized confusion matrix plt.figure() plot_confusion_matrix(cnf_matrix, classes=['Benign(2)', 'Malignant(4)'], normalize= False, title='Confusion matrix')
f1_score
from sklearn.metrics import f1_score f1_score(y_test, yhat, average='weighted')
Jaccard index for accuracy
from sklearn.metrics import jaccard_similarity_score jaccard_similarity_score(y_test, yhat)