import numpy as np
import pandas as pd
What is KNN?
In this post, I’m sharing the code that was created from following Kirill Eremenko and the SuperDataScience Team’s “Machine Learning A-Z” course on Udemy.
The prediction scenario is this: which demographic would social networking marketing ads affect best? We work for a car dealership and have data regarding consumers’ age and estimated salary. To where should marketing efforts be aimed as we try to predict which consumers will purchase our newest & best SUV model?
Euclidean Distance between two points: \(\sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}\)
k-Nearest Neighbors (KNN) is machine learning technique used to classify a new data point to a nearby cluster. We will set up our algorithm to calculate the Euclidean distances from our new data point to existing data points. Then, using the predetermined k number of nearest neighbors (we’ll be using 5 neighbors), assign that new point to the closest cluster with at least three like \(>k/2\) neighbors.
To think of it in really simple terms, all of our existing customers (points) are scattered in the \((x, y)\) space. Our new customer has \(x\) age and \(y\) salary, so we’ll plot this onto our existing grid. Then we draw circles around the new customer until we hit the closest existing point – that’s one “neighbor”. We repeat the process until we have our chosen number of 5. How many of the neighbors purchased the SUV and how many did not? Whichever group has more, that’s what we’re going to predict the new customer would do too!
Import libraries
Import dataset
= pd.read_csv('data/Social_Network_Ads.csv')
dataset = dataset.iloc[:, :-1].values
X = dataset.iloc[:, -1].values y
Splitting the dataset to training & testing
from sklearn.model_selection import train_test_split
= train_test_split(
X_train, X_test, y_train, y_test
X, y,=0.25,
test_size=0
random_state )
Feature scaling
from sklearn.preprocessing import StandardScaler
= StandardScaler()
sc = sc.fit_transform(X_train)
X_train = sc.fit_transform(X_test) X_test
```
# array([[-1.455..., -0.784...],
# [ 2.067..., 1.372...],
# [-0.253..., -0.309...],
# ...,
# [-0.253..., -0.309...],
# [ 2.067..., -1.113...],
# [-1.455..., -0.309...]])
```
Train & fit the KNN model
To learn the more technical details of sklearn
’s classes and functions, checkout the sklearn
API Reference.
from sklearn.neighbors import KNeighborsClassifier
= KNeighborsClassifier(
classifier =5, # default
n_neighbors=2, # euclidean distance; default
p='minkowski'
metric )
classifier.fit(X_train, y_train)
Predicting a new result
30y/o $87k/yr – first observation of X_test
= X_test[[0]]
person = classifier.predict(person)
single_pred = classifier.predict_proba(person)
single_prob print('1="Yes", 0="No"\n')
print(f'Single prediction for 30 y/o earning $87k/yr: {single_pred[0]} at a probability of {single_prob[0][0].round(3)}')
```
# 1="Yes", 0="No"
#
# Single prediction for 30 y/o earning $87k/yr: 0 at a
# probability of 0.8
```
Predicting the test set results
= classifier.predict(X_test) y_pred
Creating the confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
```
# [[64 4]
# [ 3 29]]
# Accuracy: 0.93
```
Visualizing the training set results
These next two code chunks will take a while. The KNN algorithm is already compute-expensive and we’re adding to the heavy lifting by creating a grid of many values to be calculated. The final result is two plots with a visual mapping of our decision boundary and our training & predicted data points appearing as the dots within the field.
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
= sc.inverse_transform(X_train), y_train
X_set, y_set = np.meshgrid(
X1, X2
np.arange(= X_set[:, 0].min() - 10,
start = X_set[:, 0].max() + 10,
stop = 0.25
step
),
np.arange(= X_set[:, 1].min() - 1000,
start = X_set[:, 1].max() + 1000,
stop = 0.25
step
)
)
plt.contourf(
X1, X2,
classifier.predict(
sc.transform(np.array([X1.ravel(), X2.ravel()]).T)
).reshape(X1.shape),= 0.75,
alpha = ListedColormap(('red', 'green'))
cmap
)
min(), X1.max())
plt.xlim(X1.min(), X2.max())
plt.ylim(X2.
for i, j in enumerate(np.unique(y_set)):
plt.scatter(== j, 0],
X_set[y_set == j, 1],
X_set[y_set = ListedColormap(('red', 'green'))(i),
c = j
label
)
'KNN Regression (Training set)')
plt.title('Age')
plt.xlabel('Estimated Salary')
plt.ylabel(
plt.legend() plt.show()
Visualizing the test set results
= sc.inverse_transform(X_test), y_test
X_set, y_set = np.meshgrid(
X1, X2
np.arange(= X_set[:, 0].min() - 10,
start = X_set[:, 0].max() + 10,
stop = 0.25
step
),
np.arange(= X_set[:, 1].min() - 1000,
start = X_set[:, 1].max() + 1000,
stop = 0.25
step
)
)
plt.contourf(
X1, X2,
classifier.predict(
sc.transform(np.array([X1.ravel(), X2.ravel()]).T)
).reshape(X1.shape),= 0.75,
alpha = ListedColormap(('red', 'green'))
cmap
)
min(), X1.max())
plt.xlim(X1.min(), X2.max())
plt.ylim(X2.
for i, j in enumerate(np.unique(y_set)):
plt.scatter(== j, 0],
X_set[y_set == j, 1],
X_set[y_set = ListedColormap(('red', 'green'))(i),
c = j
label
)
'KNN Regression (Test set)')
plt.title('Age')
plt.xlabel('Estimated Salary')
plt.ylabel(
plt.legend() plt.show()