The next plot reveals the log loss when *y* = 1:

The log loss equals to 0 solely in case of an ideal prediction (*p* = 1 and *y* = 1, or *p* = 0 and *y* = 0), and approaches infinity because the prediction will get worse (i.e., when *y* = 1 and *p* → 0 or *y* = 0 and *p* → 1).

The** value perform **calculates the typical loss over the entire information set:

The price perform could be written in a vectorized kind as follows:

the place **y** = (*y*₁, …, *yₙ*) is a vector that incorporates all of the labels of the coaching samples, and **p** = (*p*₁, …, *pₙ*) is a vector that incorporates all the expected chances of the mannequin for all of the coaching samples.

This value perform is convex, i.e., it has a single world minimal. Nonetheless, there isn’t any closed-form answer for locating the optimum **w*** (as a result of non-linearities launched by the log perform). Subsequently, we have to use iterative optimization strategies akin to gradient descent so as to discover the minimal.

Gradient descent is an iterative strategy for locating a minimal of a perform, the place we take small steps in the wrong way of the gradient so as to get nearer to the minimal:

In an effort to use gradient descent to search out the minimal of the least squares value, we have to compute the partial derivatives of *J*(**w**) with respect to every one of many weights.

The partial by-product of *J*(**w**) with respect to any of the weights *wⱼ* is:

**Proof**:

Thus, the gradient vector could be written in vectorized kind as follows:

And the gradient descent replace rule is:

the place *α* is a studying charge that controls the step measurement (0 < *α *< 1).

Word that everytime you use gradient descent, you will need to guarantee that your information set is **normalized **(in any other case gradient descent could take steps of various sizes in numerous instructions, which is able to make it unstable).

We’ll now implement the logistic regression mannequin in Python from scratch, together with its value perform and gradient computation, optimizing the mannequin utilizing gradient descent, analysis of the mannequin, and plotting the ultimate resolution boundary.

For the demonstration we are going to use the Iris data set (BSD license). The unique information set incorporates 150 samples of Iris flowers that belong to one in all three species (Setosa, Versicolor and Virginica). We’ll make it right into a binary classification drawback by utilizing solely the primary two sorts of flowers (Setosa and Versicolor). As well as, we are going to use solely first two options of every flower (sepal width and sepal size).

## Loading the Knowledge Set

Let’s first import the required libraries and repair the random seed so as to get reproducible outcomes:

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as snsnp.random.seed(0)

Subsequent, we load the information set:

`from sklearn.datasets import load_iris`iris = load_iris()

X = iris.information[:, :2] # Take solely the primary two options

y = iris.goal

# Take solely the setosa and versicolor flowers

X = X[(y == 0) | (y == 1)]

y = y[(y == 0) | (y == 1)]

Let’s plot the information:

`def plot_data(X, y):`

sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], type=iris.target_names[y],

palette=['r','b'], markers=('s','o'), edgecolor='ok')

plt.xlabel(iris.feature_names[0])

plt.ylabel(iris.feature_names[1])

plt.legend()

`plot_data(X, y)`

As could be seen, the information set is linearly separable, due to this fact logistic regression ought to be capable of discover the boundary between the 2 lessons.

Subsequent, we have to add a column of ones to the options matrix *X* so as to signify the bias (*w*₀):

`# Add a column for the bias`

n = X.form[0]

X_with_bias = np.hstack((np.ones((n, 1)), X))

We now break up the information set into coaching and take a look at units:

`from sklearn.model_selection import train_test_split`X_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)

## Mannequin Implementation

We at the moment are able to implement the logistic regression mannequin. We begin by defining a helper perform to compute the sigmoid perform:

`def sigmoid(z):`

""" Compute the sigmoid of z (z generally is a scalar or a vector). """

z = np.array(z)

return 1 / (1 + np.exp(-z))

Subsequent, we implement the fee perform that returns the price of a logistic regression mannequin with parameters **w** on a given information set (*X*, **y**), and in addition its gradient with respect to **w**.

`def cost_function(X, y, w):`

""" J, grad = cost_function(X, y, w) computes the price of a logistic regression mannequin

with parameters w and the gradient of the fee w.r.t. to the parameters. """

# Compute the fee

p = sigmoid(X @ w)

J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) # Compute the gradient

grad = (1/n) * X.T @ (p - y)

return J, grad

Word that we’re utilizing the vectorized types of the fee and the gradient features which have been proven beforehand.

To sanity examine this perform, let’s compute the fee and gradient of the mannequin on some random weight vector:

`w = np.random.rand(X_train.form[1])`

value, grad = cost_function(X_train, y_train, w)print('w:', w)

print('Value at w:', value)

print('Gradient at w (zeros):', grad)

The output we get is:

`w: [0.5488135 0.71518937 0.60276338]`

Value at w: 2.314505839067951

Gradient at w (zeros): [0.36855061 1.86634895 1.27264487]

## Gradient Descent Implementation

We now implement gradient descent so as to discover the optimum **w*** that minimizes the fee perform of the mannequin on a given coaching set. The algorithm will run at most *max_iter* passes over the coaching set (defaults to 5000), except the fee has not decreased by at the very least *tol* (defaults to 0.0001) because the earlier iteration, during which case the coaching stops instantly.

`def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):`

""" Optimize the mannequin utilizing gradient descent.

X, y: The coaching set

alpha: The educational charge

max_iter: The utmost variety of passes over the coaching set (epochs)

tol: The stopping criterion. Coaching will cease when (new_cost > value - tol)

"""

w = np.random.rand(X.form[1])

value, grad = cost_function(X, y, w)for i in vary(max_iter + 1):

w = w - alpha * grad

new_cost, grad = cost_function(X, y, w)

if new_cost > value - tol:

print(f'Converged after {i} iterations')

return w, new_cost

value = new_cost

print('Most variety of iterations reached')

return w, value

Usually at this level you would need to normalize your information set, since gradient descent doesn’t work nicely with options which have completely different scales. In our particular information set normalization just isn’t obligatory because the ranges of the 2 options are related.

Let’s now name this perform to optimize our mannequin:

`opt_w, value = optimize_model(X_train, y_train)`print('opt_w:', opt_w)

print('Value at opt_w:', value)

The algorithm converges after 1,413 iterations and the optimum **w*** we get is:

`Converged after 1413 iterations`

opt_w: [ 0.28014029 0.80541854 -1.48367938]

Value at opt_w: 0.28389717767222555

There are different optimizers you should use which are sometimes quicker than gradient descent, akin to conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for extra particulars on how one can use these optimizers.

## Utilizing the Mannequin for Predictions

Now that we’ve got discovered the optimum parameters of the mannequin, we will use it for predictions.

First, let’s write a perform that will get a matrix of recent samples *X* and returns their chances of belonging to the optimistic class:

`def predict_prob(X, w):`

""" Return the chance that samples in X belong to the optimistic class

X: the function matrix (each row in X represents one pattern)

w: the realized logistic regression parameters

"""

p = sigmoid(X @ w)

return p

The perform computes the predictions of the mannequin by merely taking the sigmoid of *Xᵗ***w **(which computes *σ*(**w***ᵗ***x**) for each row **x** within the matrix).

For instance, let’s discover out the chance {that a} pattern situated at (6, 2) belongs to the versicolor class:

`predict_prob([[1, 6, 2]], opt_w)`

`array([0.89522808])`

This pattern has 89.52% likelihood of being a versicolor flower. This is smart since this pattern is situated nicely throughout the space of the versicolor flowers removed from the border between the lessons.

Alternatively, the chance {that a} pattern situated at (5.5, 3) belongs to the versicolor class is:

`predict_prob([[1, 5.5, 3]], opt_w)`

`array([0.56436688])`

This time the chance is far decrease (solely 56.44%), since this pattern is near the border between the lessons.