LASSO Proximal Gradient Descent formula derivation and code

tags: Machine learning  Python  optimization  

LASSO by Proximal Gradient Descent

Prepare:

from itertools import cycle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
from copy import deepcopy

X = np.random.randn(100,10)
y = np.dot(X,[1,2,3,4,5,6,7,8,9,10])

Proximal Gradient Descent Framework

  1. randomly set β ( 0 ) \beta^{(0)} for iteration 0
  2. For k k th iteration:
    ----Compute gradient f ( β ( k 1 ) ) \nabla f(\beta^{(k-1)})
    ----Set z = β ( k 1 ) 1 L f ( β ( k 1 ) ) z = \beta^{(k-1)} - \frac{1}{L} \nabla f(\beta^{(k-1)})
    ----Update β ( k ) = sgn ( z ) max [ z λ L ,    0 ] \beta^{(k)} = \text{sgn}(z)\cdot \text{max}[|z|-\frac{\lambda}{L}, \; 0]
    ----Check convergence: if yes, end algorithm; else continue update
    Endfor

Here f ( β ) = 1 2 N ( Y X β ) T ( Y X β ) f(\beta) = \frac{1}{2N}(Y-X\beta)^T (Y-X\beta) , and f ( β ) = 1 N X T ( Y X β ) \nabla f(\beta) = -\frac{1}{N}X^T(Y-X\beta) ,
where the size of X X , Y Y , β \beta is N × p N\times p , N × 1 N\times 1 , p × 1 p\times 1 , which means N N samples and p p features. Parameter L 1 L \ge 1 can be chosen, and 1 L \frac{1}{L} can be considered as step size.

Proximal Gradient Descent Details

Consider optimization problem:
min x f ( x ) + λ g ( x ) , \text{min}_x {f(x) + \lambda \cdot g(x)},
where x R p × 1 x\in \mathbb{R}^{p\times 1} , f ( x ) R f(x) \in \mathbb{R} . And f ( x ) f(x) is differentiable convex function, and g ( x ) g(x) is convex but may not differentiable.

For f ( x ) f(x) , according to Lipschitz continuity, for x , y \forall x, y , there always exists a constant L L s.t.
f ( y ) f ( x ) L y x . |f'(y) - f'(x)| \le L|y-x|.
Then this problem can be solved using Proximal Gradient Descent.

Denote x ( k ) x^{(k)} as the k k th updated result for x x , then for x x ( k ) x\to x^{(k)} , the proximation of f ( x ) f(x) can be a function of x x and x ( k ) x^{(k)} :
f ^ ( x , x ( k ) ) = f ( x ( k ) ) + f T ( x ( k ) ) ( x x ( k ) ) + L 2 x x ( k ) 2        = L 2 [ x ( x ( k ) 1 L f ( x ( k ) ) ) ] 2 + CONST \hat{f}(x,x^{(k)}) = f(x^{(k)}) + \nabla f^T(x^{(k)}) (x-x^{(k)}) + \frac{L}{2} \left\lVert x - x^{(k)} \right\rVert^2 \\ \;\;\;= \frac{L}{2}[x - (x^{(k)} - \frac{1}{L} \nabla f(x^{(k)}))]^2 + \text{CONST}
where CONST \text{CONST} is not related with x x , so can be ignored.

The original solution for x x at iteration k + 1 k+1 is
x ( k + 1 ) = argmin x { f ( x ) + λ g ( x ) } , x^{(k+1)} = \text{argmin}_x\{ f(x) + \lambda \cdot g(x)\},
In proximal gradient method, we use below instead:
x ( k + 1 ) = argmin x { f ^ ( x , x ( k ) ) + λ g ( x ) }                                                    = argmin x { L 2 [ x ( x ( k ) 1 L f ( x ( k ) ) ) ] 2 + CONST + λ g ( x ) } . x^{(k+1)} = \text{argmin}_x\{ \hat{f}(x,x^{(k)}) +\lambda \cdot g(x) \} \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \text{argmin}_x\{ \frac{L}{2}[x - (x^{(k)} - \frac{1}{L} \nabla f(x^{(k)}))]^2 + \text{CONST} +\lambda \cdot g(x) \}.

Given g ( x ) = x 1 g(x) = \left\lVert x \right\rVert_1 , and let z = x ( k ) 1 L f ( x k ) z = x^{(k)} - \frac{1}{L} \nabla f(x_k) , we have
x ( k + 1 ) = argmin x { L 2 x z 2 + λ x 1 } . x^{(k+1)} = \text{argmin}_x\{ \frac{L}{2} \left\lVert x - z \right\rVert^2 + \lambda \left\lVert x \right\rVert_1 \}.

To solve Equation above, let F ( x ) = L 2 j = 1 p ( x j z j ) 2 + λ j = 1 p x j F(x) = \frac{L}{2} \sum_{j=1}^p(x_j - z_j)^2 + \lambda\sum_{j=1}^p|x_j| . For j j th element in x x , it’s easy to know that the optimal x j x^*_j satisfies
F ( x ) x j = L ( x j z j ) + λ sgn ( x j ) = 0 , \frac{\partial F(x)}{\partial x_j} = L(x_j - z_j) + \lambda \cdot \text{sgn}(x_j)=0,
and we can get
z j = x j + λ L sgn ( x j ) z_j = x_j + \frac{\lambda}{L} \text{sgn}(x_j) .

The goal is to express x j x^*_j as a function of z j z_j .This can be done by swapping the x x - z z axes of the plot of      z j = x j + λ L sgn ( x j ) \;\;z_j = x_j + \frac{\lambda}{L} \text{sgn}(x_j) :
softthreshold

Then x j x_j can be expressed as
x j = sgn ( z j ) ( z j λ L ) +                                  = sgn ( z j ) max { z j λ L ,    0 } x_j = \text{sgn}(z_j)(|z_j| - \frac{\lambda}{L})_+ \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;= \text{sgn}(z_j) \cdot \text{max}\{|z_j| - \frac{\lambda}{L},\; 0\}

Simplified Code

The code is Python version of proximal gradient descent from Stanford MATLAB LASSO demo.

In the original code, to make sure the differentiable part of objective function f ( β ) = 1 2 N Y X β 2 f(\beta) = \frac{1}{2N} \left\lVert Y - X\beta \right\rVert^2 will decrease after each time weight update, there is a check to see whether the first-order approximation of f ( β ) f(\beta) is smaller than the value from previous β \beta :
Check whether  1 2 N Y X β ( k ) 2 1 2 N Y X β ( k 1 ) 2 + f T ( β ( k 1 ) ) ( β ( k ) β ( k 1 ) ) + L 2 β ( k ) β ( k 1 ) 2 \text{Check whether } \frac{1}{2N} \left\lVert Y - X\beta^{(k)} \right\rVert^2 \le \frac{1}{2N} \left\lVert Y - X\beta^{(k-1)} \right\rVert^2 + \nabla f^T(\beta^{(k-1)}) (\beta^{(k)} - \beta^{(k-1)}) + \frac{L}{2} \left\lVert \beta^{(k)} - \beta^{(k-1)} \right\rVert^2

def f(X, y, w):
    n_samples, _ = X.shape
    tmp = y - np.dot(X,w)
    return 2*np.dot(tmp, tmp) / n_samples

def objective(X,y,w,lmbd):
    n_samples, _ = X.shape
    tmp = y - np.dot(X,w)
    obj_v = 2 * np.dot(tmp,tmp) / n_samples + lmbd * np.sum(np.abs(w))
    return obj_v
    

def LASSO_proximal_gradient(X, y, lmbd, L=1, max_iter=1000, tol=1e-4):
    beta = 0.5 # used to update L for finding proper step size
    n_samples, n_features = X.shape
    w = np.empty(n_features, dtype=float)
    w_prev = np.empty(n_features, dtype=float) # store the old weights
    xty_N = np.dot(X.T, y) / n_samples
    xtx_N = np.dot(X.T, X) / n_samples
    prox_thres = lmbd / L
    h_prox_optval = np.empty(max_iter, dtype=float)
    for k in range(max_iter):
        while True:
            grad_w = np.dot(xtx_N, w)- xty_N
            z = w - grad_w/L
            w_tmp = np.sign(z) * np.maximum(np.abs(z)-prox_thres, 0)
            w_diff = w_tmp - w
            if f(X, y, w_tmp) <= f(X, y, w) + np.dot(grad_w, w_diff) + L/2 * np.sum(w_diff**2):
                break
            L = L / beta
        w_prev = copy(w)
        w = copy(w_tmp)
        
        h_prox_optval[k] = objective(X,y,w,lmbd)
        if k > 0 and abs(h_prox_optval[k] - h_prox_optval[k-1]) < tol:
            break
    return w, h_prox_optval[:k]

def pgd_lasso_path(X, y, lmbds):
    n_samples, n_features = X.shape
    pgd_coefs = np.empty((n_features, len(lmbds)), dtype=float)
    for i, lmbd in enumerate(lmbds):
        w, _ = LASSO_proximal_gradient(X, y, lmbd)
        pgd_coefs[:,i] = w
    return lmbds, pgd_coefs

experiment:

lmbds, pgd_coefs = pgd_lasso_path(X, y, my_lambdas)
# Display results
plt.figure(1)
colors = cycle(['b', 'r', 'g', 'c', 'k', 'y','m'])
neg_log_lambdas_lasso = -np.log10(my_lambdas)
for coef_l, c in zip(coefs_lasso, colors):
    l1 = plt.plot(neg_log_lambdas_lasso, coef_l, c=c)

plt.xlabel('-Log($\lambda$)')
plt.ylabel('coefficients')
plt.title('PGD Lasso Paths')
plt.axis('tight')

pdg lasso path
the result looks similar with that using Coordinate Descent Method

Speed Comparison

print("Coordinate descent method from scikit-learn:")
%timeit lambdas_lasso, coefs_lasso, _ = lasso_path(X, y, eps, fit_intercept=False)

print("PGD method:")
%timeit lmbds, pgd_coefs = pgd_lasso_path(X, y, my_lambdas)

Results:

Coordinate descent method from scikit-learn:
4.88 ms ± 97 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

PGD method:
778 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pure python PGD is faster than that of coordinate descent method in my previous blog, but slower than Cython coordinate descent method.

Intelligent Recommendation

Logistic (log probability) regression through gradient descent, including formula derivation and code implementation

As we all know, the logistic regression algorithm is a very classic machine learning algorithm that can be used for regression and classification tasks. It is also a linear regression model in a broad...

Mathematical Derivation of Gradient Descent Formula of Logistic Regression Cost Function

The Derivation Process of Gradient Descent Formula of Logistic Regression Cost Function Logistic regression cost function gradient descent formula mathematical derivation process The mathematical deri...

Derivation process of gradient descent formula (Wu Enda course)

The course is when the gradient is falling. There is a formula. As shown below The difficulty is the derivation process. The results are as follows: But the teacher gave the answer directly. Did not e...

Logtic regression loss function and gradient descent formula derivation

Logistic regression cost function derivation process. The algorithm solution uses the following cost function form:     Gradient descent algorithm For a function, we need to find its minimum...

The gradient descent algorithm and its promotion algorithm involve formula derivation

Gradient descent algorithm ideas and other popularization algorithms This article introduces the general format of gradient descent, Gradient descent algorithm (small batch, batch, random), Formula de...

More Recommendation

Gradient Descent Derivation and Examples

Gradient Descent Derivation and Examples Note: The author organizes on the basis of other literatures to form the basic context of this article, and hopes to help beginners through a simpler and clear...

Gradient descent method derivation

** Gradient descent method formula derivation ** The gradient descent method is simply a method of finding the minimum point. It is a commonly used optimizer in machine learning and deep learning. Spe...

Derivation of gradient descent

Derivation of gradient descent 01. Problem 02. What is gradient 03. Gradient derivation 3.1 First-order Taylor Expanded 3.2 Inference on gradient descent method 04. What is gradient descent used for? ...

Lasso regression method of coordinate descent derivation

Lasso regression method of coordinate descent derivation The objective function Lasso L1 corresponds with positive linear regression of the term. Look under the objective function: This problem is a r...

Derivation of Lasso regression's coordinate descent method

Objective function Lasso is equivalent to linear regression with L1 regularization term. First look at the objective function: RSS(w)+λ∥w∥1=∑Ni=0(yi−∑Dj=0wjhj(xi))2+λ∑D...

Copyright  DMCA © 2018-2026 - All Rights Reserved - www.programmersought.com  User Notice

Top