The next plot reveals the log loss when y = 1:
The log loss equals to 0 solely in case of an ideal prediction (p = 1 and y = 1, or p = 0 and y = 0), and approaches infinity because the prediction will get worse (i.e., when y = 1 and p → 0 or y = 0 and p → 1).
The value perform calculates the typical loss over the entire information set:
The price perform could be written in a vectorized kind as follows:
the place y = (y₁, …, yₙ) is a vector that incorporates all of the labels of the coaching samples, and p = (p₁, …, pₙ) is a vector that incorporates all the expected chances of the mannequin for all of the coaching samples.
This value perform is convex, i.e., it has a single world minimal. Nonetheless, there isn’t any closed-form answer for locating the optimum w* (as a result of non-linearities launched by the log perform). Subsequently, we have to use iterative optimization strategies akin to gradient descent so as to discover the minimal.
Gradient descent is an iterative strategy for locating a minimal of a perform, the place we take small steps in the wrong way of the gradient so as to get nearer to the minimal:
In an effort to use gradient descent to search out the minimal of the least squares value, we have to compute the partial derivatives of J(w) with respect to every one of many weights.
The partial by-product of J(w) with respect to any of the weights wⱼ is:
Proof:
Thus, the gradient vector could be written in vectorized kind as follows:
And the gradient descent replace rule is:
the place α is a studying charge that controls the step measurement (0 < α < 1).
Word that everytime you use gradient descent, you will need to guarantee that your information set is normalized (in any other case gradient descent could take steps of various sizes in numerous instructions, which is able to make it unstable).
We’ll now implement the logistic regression mannequin in Python from scratch, together with its value perform and gradient computation, optimizing the mannequin utilizing gradient descent, analysis of the mannequin, and plotting the ultimate resolution boundary.
For the demonstration we are going to use the Iris data set (BSD license). The unique information set incorporates 150 samples of Iris flowers that belong to one in all three species (Setosa, Versicolor and Virginica). We’ll make it right into a binary classification drawback by utilizing solely the primary two sorts of flowers (Setosa and Versicolor). As well as, we are going to use solely first two options of every flower (sepal width and sepal size).
Loading the Knowledge Set
Let’s first import the required libraries and repair the random seed so as to get reproducible outcomes:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsnp.random.seed(0)
Subsequent, we load the information set:
from sklearn.datasets import load_irisiris = load_iris()
X = iris.information[:, :2] # Take solely the primary two options
y = iris.goal
# Take solely the setosa and versicolor flowers
X = X[(y == 0) | (y == 1)]
y = y[(y == 0) | (y == 1)]
Let’s plot the information:
def plot_data(X, y):
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], type=iris.target_names[y],
palette=['r','b'], markers=('s','o'), edgecolor='ok')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.legend()
plot_data(X, y)
As could be seen, the information set is linearly separable, due to this fact logistic regression ought to be capable of discover the boundary between the 2 lessons.
Subsequent, we have to add a column of ones to the options matrix X so as to signify the bias (w₀):
# Add a column for the bias
n = X.form[0]
X_with_bias = np.hstack((np.ones((n, 1)), X))
We now break up the information set into coaching and take a look at units:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)
Mannequin Implementation
We at the moment are able to implement the logistic regression mannequin. We begin by defining a helper perform to compute the sigmoid perform:
def sigmoid(z):
""" Compute the sigmoid of z (z generally is a scalar or a vector). """
z = np.array(z)
return 1 / (1 + np.exp(-z))
Subsequent, we implement the fee perform that returns the price of a logistic regression mannequin with parameters w on a given information set (X, y), and in addition its gradient with respect to w.
def cost_function(X, y, w):
""" J, grad = cost_function(X, y, w) computes the price of a logistic regression mannequin
with parameters w and the gradient of the fee w.r.t. to the parameters. """
# Compute the fee
p = sigmoid(X @ w)
J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) # Compute the gradient
grad = (1/n) * X.T @ (p - y)
return J, grad
Word that we’re utilizing the vectorized types of the fee and the gradient features which have been proven beforehand.
To sanity examine this perform, let’s compute the fee and gradient of the mannequin on some random weight vector:
w = np.random.rand(X_train.form[1])
value, grad = cost_function(X_train, y_train, w)print('w:', w)
print('Value at w:', value)
print('Gradient at w (zeros):', grad)
The output we get is:
w: [0.5488135 0.71518937 0.60276338]
Value at w: 2.314505839067951
Gradient at w (zeros): [0.36855061 1.86634895 1.27264487]
Gradient Descent Implementation
We now implement gradient descent so as to discover the optimum w* that minimizes the fee perform of the mannequin on a given coaching set. The algorithm will run at most max_iter passes over the coaching set (defaults to 5000), except the fee has not decreased by at the very least tol (defaults to 0.0001) because the earlier iteration, during which case the coaching stops instantly.
def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):
""" Optimize the mannequin utilizing gradient descent.
X, y: The coaching set
alpha: The educational charge
max_iter: The utmost variety of passes over the coaching set (epochs)
tol: The stopping criterion. Coaching will cease when (new_cost > value - tol)
"""
w = np.random.rand(X.form[1])
value, grad = cost_function(X, y, w)for i in vary(max_iter + 1):
w = w - alpha * grad
new_cost, grad = cost_function(X, y, w)
if new_cost > value - tol:
print(f'Converged after {i} iterations')
return w, new_cost
value = new_cost
print('Most variety of iterations reached')
return w, value
Usually at this level you would need to normalize your information set, since gradient descent doesn’t work nicely with options which have completely different scales. In our particular information set normalization just isn’t obligatory because the ranges of the 2 options are related.
Let’s now name this perform to optimize our mannequin:
opt_w, value = optimize_model(X_train, y_train)print('opt_w:', opt_w)
print('Value at opt_w:', value)
The algorithm converges after 1,413 iterations and the optimum w* we get is:
Converged after 1413 iterations
opt_w: [ 0.28014029 0.80541854 -1.48367938]
Value at opt_w: 0.28389717767222555
There are different optimizers you should use which are sometimes quicker than gradient descent, akin to conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for extra particulars on how one can use these optimizers.
Utilizing the Mannequin for Predictions
Now that we’ve got discovered the optimum parameters of the mannequin, we will use it for predictions.
First, let’s write a perform that will get a matrix of recent samples X and returns their chances of belonging to the optimistic class:
def predict_prob(X, w):
""" Return the chance that samples in X belong to the optimistic class
X: the function matrix (each row in X represents one pattern)
w: the realized logistic regression parameters
"""
p = sigmoid(X @ w)
return p
The perform computes the predictions of the mannequin by merely taking the sigmoid of Xᵗw (which computes σ(wᵗx) for each row x within the matrix).
For instance, let’s discover out the chance {that a} pattern situated at (6, 2) belongs to the versicolor class:
predict_prob([[1, 6, 2]], opt_w)
array([0.89522808])
This pattern has 89.52% likelihood of being a versicolor flower. This is smart since this pattern is situated nicely throughout the space of the versicolor flowers removed from the border between the lessons.
Alternatively, the chance {that a} pattern situated at (5.5, 3) belongs to the versicolor class is:
predict_prob([[1, 5.5, 3]], opt_w)
array([0.56436688])
This time the chance is far decrease (solely 56.44%), since this pattern is near the border between the lessons.