from solutions import LogisticRegression
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
all='ignore')
np.seterr(
# make the data
= 3
p_features = make_blobs(n_samples = 500, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])
X, y
= plt.scatter(X[:,0], X[:,1], c = y)
fig = plt.xlabel("Feature 1")
xlab = plt.ylabel("Feature 2") ylab
The optimization for logistic loss can be implemented by using my source code solutions.py
, which can be found HERE. In this code, loss optimization is implemented in three ways with the gradient descent: plain gradient descent, batch gradient descent, and gradient descent with momentum.
Below is our data, which is not linearly separable.
def sigmoid(z):
# the sigmoid function
return 1 / (1 + np.exp(-z))
= X.shape
n, p
# weight tilda and X tilda that has one extra term
= np.random.rand(p+1)
w_tilda = np.append(X, np.ones((X.shape[0], 1)), 1)
X_tilda
= np.dot(X_tilda, w_tilda) #inner product between X and w
y_hat = 1/n*(sigmoid(y_hat) - y)@X_tilda
deltaL deltaL
array([-0.22233893, -0.1366533 , 0.09319262])
1. Gradient Descent
The gradient descent algorithm is implemented when the LogisticRegression.fit(X, y)
is called. \(\textbf{X}\) is the feature matrix, and \(\textbf{y}\) is the target vector, as we know from the perceptron blog post. We need to find a weight vector that can minimize the logistic loss on \(\textbf{X}\) and \(\textbf{y}\) by updating the weight in a loop. At every loop:
- compute the gradient descent of the logistic loss, which is given by \[\nabla L(\textbf{w}) = \frac{1}{n} \sum^n_{i=1} (\sigma (\langle \textbf{w}, \textbf{x}_i \rangle), y_i) \textbf{w}_i, \] where \(\sigma(z)\) denotes the sigmoid function.
- update the weight until the logistic loss does not change \[\textbf{w}^{(t+1)} = \textbf{w}^{(t)} - \alpha \nabla L(\textbf{w}^{(t)})\]
# fit the model
= LogisticRegression()
LR = 10, max_epochs = 1000)
LR.fit(X, y, alpha
# inspect the fitted value of w
LR.w
def draw_line(w, x_min, x_max):
= np.linspace(x_min, x_max, 101)
x = -(w[0]*x + w[2])/w[1]
y = "black")
plt.plot(x, y, color
= plt.subplots()
figy, axy 0], X[:,1], c = y)
axy.scatter(X[:,-2, 2)
draw_line(LR.w, "Feature 1")
axy.set_xlabel("Feature 2")
axy.set_ylabel(
= plt.subplots()
figx, axv
axv.plot(LR.loss_history)"Iteration")
axv.set_xlabel("Empirical Risk")
axv.set_ylabel( plt.show()
In this case, the learning rate was set very high to 10, which caused the algorithm to converge too soon and leave the logistic loss at 0.24. As we change the learning rate to 0.1, we can see that the loss fell to 0.24.
= LogisticRegression()
LR =0.1, max_epochs=1000)
LR.fit(X, y, alpha
= plt.subplots()
figy, axy 0], X[:,1], c = y)
axy.scatter(X[:,-2, 2)
draw_line(LR.w, "Feature 1")
axy.set_xlabel("Feature 2")
axy.set_ylabel(
= plt.subplots()
figx, axv
axv.plot(LR.loss_history)"Iteration")
axv.set_xlabel("Empirical Risk")
axv.set_ylabel( plt.show()
2. Stochastic Gradient Descent
The stochastic gradient descent algorithm is implemented by the LogisticRegression.fit_stochastic(X, y)
method. In this algorithm, instead of computing the complete gradient as in part 1, we compute a stochastic gradient by picking a random subset \(S \subseteq [n] = \{1, ..., n\}\) and computing
\[\nabla_S L(\textbf{w}) = \frac{1}{|S|} \sum_{i \in S} \nabla (\sigma (\langle \textbf{w}, \textbf{x}_i \rangle), y_i)\textbf{x}_i \]
The size of \(S\) is the batch size, which we can refer to as \(k\). The batch gradient descent is performed as follows: 1. Shuffle the points randomly. 2. Pick the first \(k\) random points, compute the stochastic gradient, and then perform an update. 3. Pick the next \(k\) random points and repeat.. 4. When we have gone through all \(n\) points, reshuffle them all randomly and proceed again.
all='ignore')
np.seterr(
# make the data
= 3
p_features = make_blobs(n_samples = 200, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])
X, y
= plt.scatter(X[:,0], X[:,1], c = y)
fig = plt.xlabel("Feature 1")
xlab = plt.ylabel("Feature 2") ylab
In this experiment, we can see how the stochastic gradient descent algorithm can outperform the original gradient descent method.
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 1000,
m_epochs = False,
momentum = 10,
batch_size = .1)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR = .1, max_epochs = 1000)
LR.fit(X, y, alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "gradient")
plt.plot(np.arange(num_steps) "Iteration")
plt.xlabel("Empirical Risk")
plt.ylabel(= plt.legend() legend
The stochastic gradient method used only 150 loops, while the gradient method still had not converged at loop 1000.
3. Stochastic Gradient Descent with Momentum
This only difference in this method comes from the momentum factor \(\beta\). The weight update is given by
\[\textbf{w}^{(t+1)} = \textbf{w}^{(t)} - \alpha \nabla L(\textbf{w}^{(t)}) + \beta (\textbf{w}^{(t)}-\textbf{w}^{(t-1)})\]
The idea here is that if the previous weight update was good, we may want to continue moving along this direction. To test how the momentum can help with accelerating convergence, we choose a small sample size of 150, distributed among 10 features.
all='ignore')
np.seterr(
# make the data
= 11
p_features = make_blobs(n_samples = 150, n_features = p_features - 1, centers = [(-1, -1), (1, 1)])
X, y
= plt.scatter(X[:,0], X[:,1], c = y)
fig = plt.xlabel("Feature 1")
xlab = plt.ylabel("Feature 2") ylab
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 100,
m_epochs = True,
momentum = 10,
batch_size = .1)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient (momentum)")
plt.plot(np.arange(num_steps)
= LogisticRegression()
LR
LR.fit_stochastic(X, y, = 100,
m_epochs = False,
momentum = 10,
batch_size = .1)
alpha
= len(LR.loss_history)
num_steps + 1, LR.loss_history, label = "stochastic gradient")
plt.plot(np.arange(num_steps) = plt.legend()
legend "Iteration")
plt.xlabel("Empirical Risk")
plt.ylabel( plt.show()
The momentum algorithm converges much faster than the other method.