Yi Zuo, PhD MPH bio photo

Yi Zuo, PhD MPH

Sr. Scientist, Late Dev. Stat
BARDS, MSD
Beijing, China

  G. Scholar LinkedIn Github e-Mail

The key ideas of primal dual active set (PDAS) algorithm for solving the best subset selection problem are summarized here. The original paper can be found here . They also have a very nice R package on CRAN .


Let $\boldsymbol X\in\mathbb R^{n\times p}$ denote the design matrix, $\boldsymbol y\in\mathbb R^n$ denote the outcome vector, and $\boldsymbol\beta\in\mathbb R^p$ is the coefficient vector. The best subset selection problem with the subset size $k$ is given by the following optimization problem:


$$ \underset{\boldsymbol\beta\in\mathbb R^p}{\min} l(\boldsymbol\beta)\: \text{ s.t. } \left\Vert \boldsymbol\beta \right\Vert_0 = k $$


where $l(\boldsymbol\beta)$ is a convex loss function of $\boldsymbol\beta$ and $\left\Vert \boldsymbol\beta \right\Vert_0=\sum_{j=1}^p\left\vert\beta_j\right\vert_0=\sum_{j=1}^p 1_{\beta_j\neq 0}$ counts the number of nonzeros in $\boldsymbol\beta$. The best subset selection problem admits non-unique local optimal solutions, so coordinate-wise minimizers $\boldsymbol\beta^\diamond$ would be used in this case. The vectors of gradient and Hessian diagonal are


$$ \boldsymbol g^\diamond = \nabla l(\boldsymbol \beta^\diamond),\: \boldsymbol h^\diamond=\text{diag}(\nabla^2 l(\boldsymbol\beta^\diamond)) $$


respectively. For each coordinate $j=1,...,p$, the loss can be rewritten as $l_j(t)=l(\beta^\diamond_1,...,\beta^\diamond_{j-1}, t, \beta^\diamond_{j+1},...,\beta^\diamond_p)$ while fixing the other coordinates. $l_j(t)$ can be approximated by a second-order Taylor expansion (local quadratic approximation $l_j(t)^Q$) around $\beta_j^\diamond$, which is


$$ \begin{aligned} l_j(t)\approx l_q(t)^Q&=l_j(\beta_j^\diamond)+g_j^\diamond(t-\beta_j^\diamond)+\frac{1}{2}h_j^\diamond(t-\beta_j^\diamond)^2 \\ &= \frac{1}{2}h_j^\diamond(t-\beta_j^\diamond+g_j^\diamond/h_j^\diamond)^2 + l_j(\beta_j^\diamond) -\frac{1}{2}(g_j^\diamond)^2/h_j^\diamond \\ &= \frac{1}{2}h_j^\diamond(t-(\beta_j^\diamond+\gamma_j^\diamond)^2 + l_j(\beta_j^\diamond) -\frac{1}{2}(g_j^\diamond)^2/h_j^\diamond \end{aligned} $$


The equation above gives rise to the scaled gradient $\gamma_j^\diamond$, which has the form


$$ \gamma_j^\diamond = -g_j^\diamond/h_j^\diamond $$


Note that $l^Q_j(t)$ is minimized at $t^*=\beta_j^\diamond+\gamma_j^\diamond$ for $j=1,...,p$. When $t$ is switched from $t^*$ to 0, the change/sacrifice of $l^Q_j(t)$, denoted as $\Delta_j^\diamond$, is given by


$$ \Delta_j^\diamond = \frac{1}{2}h_j^\diamond(\beta_j^\diamond+\gamma_j^\diamond)^2, \: j=1,...,p $$


Best subset selection requires that $p-k$ components of the coefficient vector are zero. To determine which $p-k$components, we can rank all the sacrifices $\Delta_j^\diamond$ and enforce coefficients to zero if they contribute the least total sacrifice to the overall loss. Let $\Delta_{[1]}^\diamond\geq \cdot\cdot\cdot\geq\Delta_{[p]}^\diamond$ denote the decreasing rearrangement of $\Delta_j^\diamond$ for $j=1,...,p$. Best subset selection then truncates the ordered sacrifice vector at position $k$. In other words, the coordinate-wise minimizers are


$$ \beta_j^\diamond = \begin{cases} \beta_j^\diamond+\gamma^\diamond_j,\text{ if }\Delta_j^\diamond\geq \Delta_{[k]}^\diamond\\ 0,\text{ otherwise} \end{cases} $$


In the equation above, $\boldsymbol\beta^\diamond=(\beta_1^\diamond,...,\beta_p^\diamond)$ are primal variables, $\boldsymbol\gamma^\diamond=(\gamma_1^\diamond,...,\gamma_p^\diamond)$ are dual variables, and $\boldsymbol\Delta^\diamond=(\Delta_1^\diamond,...,\Delta_p^\diamond)$ are reference sacrifices.