LASSO Proximal Gradient Descent formula derivation and code

tags: Machine learning Python optimization

Article Directory

LASSO by Proximal Gradient Descent

Proximal Gradient Descent Framework
Proximal Gradient Descent Details
Simplified Code
Speed Comparison

LASSO by Proximal Gradient Descent

Prepare:

from itertools import cycle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
from copy import deepcopy

X = np.random.randn(100,10)
y = np.dot(X,[1,2,3,4,5,6,7,8,9,10])

Proximal Gradient Descent Framework

randomly set $\beta^{(0)}$ for iteration 0
For $k$ th iteration:
----Compute gradient $\nabla f(\beta^{(k-1)})$
----Set $z = \beta^{(k-1)} - \frac{1}{L} \nabla f(\beta^{(k-1)})$
----Update $\beta^{(k)} = \text{sgn}(z)\cdot \text{max}[|z|-\frac{\lambda}{L}, \; 0]$
----Check convergence: if yes, end algorithm; else continue update
Endfor

Here $f(\beta) = \frac{1}{2N}(Y-X\beta)^T (Y-X\beta)$ , and $\nabla f(\beta) = -\frac{1}{N}X^T(Y-X\beta)$ ,
where the size of $X$ , $Y$ , $\beta$ is $N\times p$ , $N\times 1$ , $p\times 1$ , which means $N$ samples and $p$ features. Parameter $L \ge 1$ can be chosen, and $\frac{1}{L}$ can be considered as step size.

Proximal Gradient Descent Details

Consider optimization problem:
$\text{min}_x {f(x) + \lambda \cdot g(x)},$
where $x\in \mathbb{R}^{p\times 1}$ , $f(x) \in \mathbb{R}$ . And $f(x)$ is differentiable convex function, and $g(x)$ is convex but may not differentiable.

For $f(x)$ , according to Lipschitz continuity, for $\forall x, y$ , there always exists a constant $L$ s.t.
$|f'(y) - f'(x)| \le L|y-x|.$
Then this problem can be solved using Proximal Gradient Descent.

Denote $x^{(k)}$ as the $k$ th updated result for $x$ , then for $x\to x^{(k)}$ , the proximation of $f(x)$ can be a function of $x$ and $x^{(k)}$ :
$\hat{f}(x,x^{(k)}) = f(x^{(k)}) + \nabla f^T(x^{(k)}) (x-x^{(k)}) + \frac{L}{2} \left\lVert x - x^{(k)} \right\rVert^2 \\ \;\;\;= \frac{L}{2}[x - (x^{(k)} - \frac{1}{L} \nabla f(x^{(k)}))]^2 + \text{CONST}$
where $\text{CONST}$ is not related with $x$ , so can be ignored.

The original solution for $x$ at iteration $k+1$ is
$x^{(k+1)} = \text{argmin}_x\{ f(x) + \lambda \cdot g(x)\},$
In proximal gradient method, we use below instead:
$x^{(k+1)} = \text{argmin}_x\{ \hat{f}(x,x^{(k)}) +\lambda \cdot g(x) \} \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \text{argmin}_x\{ \frac{L}{2}[x - (x^{(k)} - \frac{1}{L} \nabla f(x^{(k)}))]^2 + \text{CONST} +\lambda \cdot g(x) \}.$

Given $g(x) = \left\lVert x \right\rVert_1$ , and let $z = x^{(k)} - \frac{1}{L} \nabla f(x_k)$ , we have
$x^{(k+1)} = \text{argmin}_x\{ \frac{L}{2} \left\lVert x - z \right\rVert^2 + \lambda \left\lVert x \right\rVert_1 \}.$

To solve Equation above, let $F(x) = \frac{L}{2} \sum_{j=1}^p(x_j - z_j)^2 + \lambda\sum_{j=1}^p|x_j|$ . For $j$ th element in $x$ , it’s easy to know that the optimal $x^*_j$ satisfies
$\frac{\partial F(x)}{\partial x_j} = L(x_j - z_j) + \lambda \cdot \text{sgn}(x_j)=0,$
and we can get
$z_j = x_j + \frac{\lambda}{L} \text{sgn}(x_j)$ .

The goal is to express $x^*_j$ as a function of $z_j$ .This can be done by swapping the $x$ - $z$ axes of the plot of $\;\;z_j = x_j + \frac{\lambda}{L} \text{sgn}(x_j)$ :
softthreshold

Then $x_j$ can be expressed as
$x_j = \text{sgn}(z_j)(|z_j| - \frac{\lambda}{L})_+ \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \text{sgn}(z_j) \cdot \text{max}\{|z_j| - \frac{\lambda}{L},\; 0\}$

Simplified Code

The code is Python version of proximal gradient descent from Stanford MATLAB LASSO demo.

In the original code, to make sure the differentiable part of objective function $f(\beta) = \frac{1}{2N} \left\lVert Y - X\beta \right\rVert^2$ will decrease after each time weight update, there is a check to see whether the first-order approximation of $f(\beta)$ is smaller than the value from previous $\beta$ :
$\text{Check whether } \frac{1}{2N} \left\lVert Y - X\beta^{(k)} \right\rVert^2 \le \frac{1}{2N} \left\lVert Y - X\beta^{(k-1)} \right\rVert^2 + \nabla f^T(\beta^{(k-1)}) (\beta^{(k)} - \beta^{(k-1)}) + \frac{L}{2} \left\lVert \beta^{(k)} - \beta^{(k-1)} \right\rVert^2$

def f(X, y, w):
    n_samples, _ = X.shape
    tmp = y - np.dot(X,w)
    return 2*np.dot(tmp, tmp) / n_samples

def objective(X,y,w,lmbd):
    n_samples, _ = X.shape
    tmp = y - np.dot(X,w)
    obj_v = 2 * np.dot(tmp,tmp) / n_samples + lmbd * np.sum(np.abs(w))
    return obj_v
    

def LASSO_proximal_gradient(X, y, lmbd, L=1, max_iter=1000, tol=1e-4):
    beta = 0.5 # used to update L for finding proper step size
    n_samples, n_features = X.shape
    w = np.empty(n_features, dtype=float)
    w_prev = np.empty(n_features, dtype=float) # store the old weights
    xty_N = np.dot(X.T, y) / n_samples
    xtx_N = np.dot(X.T, X) / n_samples
    prox_thres = lmbd / L
    h_prox_optval = np.empty(max_iter, dtype=float)
    for k in range(max_iter):
        while True:
            grad_w = np.dot(xtx_N, w)- xty_N
            z = w - grad_w/L
            w_tmp = np.sign(z) * np.maximum(np.abs(z)-prox_thres, 0)
            w_diff = w_tmp - w
            if f(X, y, w_tmp) <= f(X, y, w) + np.dot(grad_w, w_diff) + L/2 * np.sum(w_diff**2):
                break
            L = L / beta
        w_prev = copy(w)
        w = copy(w_tmp)
        
        h_prox_optval[k] = objective(X,y,w,lmbd)
        if k > 0 and abs(h_prox_optval[k] - h_prox_optval[k-1]) < tol:
            break
    return w, h_prox_optval[:k]

def pgd_lasso_path(X, y, lmbds):
    n_samples, n_features = X.shape
    pgd_coefs = np.empty((n_features, len(lmbds)), dtype=float)
    for i, lmbd in enumerate(lmbds):
        w, _ = LASSO_proximal_gradient(X, y, lmbd)
        pgd_coefs[:,i] = w
    return lmbds, pgd_coefs

experiment:

lmbds, pgd_coefs = pgd_lasso_path(X, y, my_lambdas)
# Display results
plt.figure(1)
colors = cycle(['b', 'r', 'g', 'c', 'k', 'y','m'])
neg_log_lambdas_lasso = -np.log10(my_lambdas)
for coef_l, c in zip(coefs_lasso, colors):
    l1 = plt.plot(neg_log_lambdas_lasso, coef_l, c=c)

plt.xlabel('-Log($\lambda$)')
plt.ylabel('coefficients')
plt.title('PGD Lasso Paths')
plt.axis('tight')

pdg lasso path
the result looks similar with that using Coordinate Descent Method

Speed Comparison

print("Coordinate descent method from scikit-learn:")
%timeit lambdas_lasso, coefs_lasso, _ = lasso_path(X, y, eps, fit_intercept=False)

print("PGD method:")
%timeit lmbds, pgd_coefs = pgd_lasso_path(X, y, my_lambdas)

Results:

Coordinate descent method from scikit-learn:
4.88 ms ± 97 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

PGD method:
778 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pure python PGD is faster than that of coordinate descent method in my previous blog, but slower than Cython coordinate descent method.

Intelligent Recommendation

Logistic (log probability) regression through gradient descent, including formula derivation and code implementation

As we all know, the logistic regression algorithm is a very classic machine learning algorithm that can be used for regression and classification tasks. It is also a linear regression model in a broad...

Mathematical Derivation of Gradient Descent Formula of Logistic Regression Cost Function

The Derivation Process of Gradient Descent Formula of Logistic Regression Cost Function Logistic regression cost function gradient descent formula mathematical derivation process The mathematical deri...

Derivation process of gradient descent formula (Wu Enda course)

The course is when the gradient is falling. There is a formula. As shown below The difficulty is the derivation process. The results are as follows: But the teacher gave the answer directly. Did not e...

Logtic regression loss function and gradient descent formula derivation

Logistic regression cost function derivation process. The algorithm solution uses the following cost function form: Gradient descent algorithm For a function, we need to find its minimum...

The gradient descent algorithm and its promotion algorithm involve formula derivation

Gradient descent algorithm ideas and other popularization algorithms This article introduces the general format of gradient descent, Gradient descent algorithm (small batch, batch, random), Formula de...

LASSO Proximal Gradient Descent formula derivation and code

Article Directory

LASSO by Proximal Gradient Descent

Proximal Gradient Descent Framework

Proximal Gradient Descent Details

Simplified Code

Speed Comparison

Intelligent Recommendation

Logistic (log probability) regression through gradient descent, including formula derivation and code implementation

Mathematical Derivation of Gradient Descent Formula of Logistic Regression Cost Function

Derivation process of gradient descent formula (Wu Enda course)

Logtic regression loss function and gradient descent formula derivation

The gradient descent algorithm and its promotion algorithm involve formula derivation

More Recommendation

Gradient Descent Derivation and Examples

Gradient descent method derivation

Derivation of gradient descent

Lasso regression method of coordinate descent derivation

Derivation of Lasso regression's coordinate descent method

Copyright DMCA © 2018-2026 - All Rights Reserved - www.programmersought.com User Notice