[Click here for a PDF of this post with nicer formatting]
Disclaimer
Peeter’s lecture notes from class. These may be incoherent and rough.
These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper.
Matrix inner product
Given real matrices X, Y \in \mathbb{R}^{m\times n} , one possible matrix inner product definition is
\begin{equation}\label{eqn:convexOptimizationLecture3:20} \begin{aligned} \innerprod{X}{Y} &= \textrm{Tr}( X^\T Y) \\ &= \textrm{Tr} \lr{ \sum_{k = 1}^m X_{ki} Y_{kj} } \\ &= \sum_{k = 1}^m \sum_{j = 1}^n X_{kj} Y_{kj} \\ &= \sum_{i = 1}^m \sum_{j = 1}^n X_{ij} Y_{ij}. \end{aligned} \end{equation}
This inner product induces a norm on the (matrix) vector space, called the Frobenius norm
\begin{equation}\label{eqn:convexOptimizationLecture3:40} \begin{aligned} \Norm{X }_F &= \textrm{Tr}( X^\T X) \\ &= \sqrt{ \innerprod{X}{X} } \\ &= \sum_{i = 1}^m \sum_{j = 1}^n X_{ij}^2. \end{aligned} \end{equation}
Range, nullspace.
Definition: Range: Given A \in \mathbb{R}^{m \times n} , the range of A is the set:
\begin{equation*} \mathcal{R}(A) = \setlr{ A \Bx | \Bx \in \mathbb{R}^n }. \end{equation*}
Definition: Nullspace: Given A \in \mathbb{R}^{m \times n} , the nullspace of A is the set:
\begin{equation*} \mathcal{N}(A) = \setlr{ \Bx | A \Bx = 0 }. \end{equation*}
SVD.
To understand operation of A \in \mathbb{R}^{m \times n} , a representation of a linear transformation from \R{n} to \R{m}, decompose A using the singular value decomposition (SVD).
Definition: SVD: Given A \in \mathbb{R}^{m \times n} , an operator on \Bx \in \mathbb{R}^n , a decomposition of the following form is always possible
\begin{equation*} \begin{aligned} A &= U \Sigma V^\T \\ U &\in \mathbb{R}^{m \times r} \\ V &\in \mathbb{R}^{n \times r}, \end{aligned} \end{equation*}
where r is the rank of A, and both U and V are orthogonal
\begin{equation*} \begin{aligned} U^\T U &= I \in \mathbb{R}^{r \times r} \\ V^\T V &= I \in \mathbb{R}^{r \times r}. \end{aligned} \end{equation*}
Here \Sigma = \textrm{diag}( \sigma_1, \sigma_2, \cdots, \sigma_r ) , is a diagonal matrix of “singular” values, where
\begin{equation*} \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r. \end{equation*}
For simplicity consider square case m = n
\begin{equation}\label{eqn:convexOptimizationLecture3:100} A \Bx = \lr{ U \Sigma V^\T } \Bx. \end{equation}
The first product V^\T \Bx is a rotation, which can be checked by looking at the length
\begin{equation}\label{eqn:convexOptimizationLecture3:120} \begin{aligned} \Norm{ V^\T \Bx}_2 &= \sqrt{ \Bx^\T V V^\T \Bx } \\ &= \sqrt{ \Bx^\T \Bx } \\ &= \Norm{ \Bx }_2, \end{aligned} \end{equation}
which shows that the length of the vector is unchanged after application of the linear transformation represented by V^\T so that operation must be a rotation.
Similarly the operation of U on \Sigma V^\T \Bx also must be a rotation. The operation \Sigma = [\sigma_i]_i applies a scaling operation to each component of the vector V^\T \Bx .
All linear (square) transformations can therefore be thought of as a rotate-scale-rotate operation. Often the A of interest will be symmetric A = A^\T .
Set of symmetric matrices
Let S^n be the set of real, symmetric n \times n matrices.
Theorem: Spectral theorem: When A \in S^n then it is possible to factor A as
\begin{equation*} A = Q \Lambda Q^\T, \end{equation*}
where Q is an orthogonal matrix, and \Lambda = \textrm{diag}( \lambda_1, \lambda_2, \cdots \lambda_n). Here \lambda_i \in \mathbb{R} \, \forall i are the (real) eigenvalues of A .
A real symmetric matrix A \in S^n is “positive semi-definite” if
\begin{equation*}
\Bv^\T A \Bv \ge 0 \qquad\forall \Bv \in \mathbb{R}^n, \Bv \ne 0,
\end{equation*}
and is “positive definite” if
\begin{equation*} \Bv^\T A \Bv > 0 \qquad\forall \Bv \in \mathbb{R}^n, \Bv \ne 0. \end{equation*}
The set of such matrices is denoted S^n_{+} , and S^n_{++} respectively.
Consider A \in S^n_{+} (or S^n_{++} )
\begin{equation}\label{eqn:convexOptimizationLecture3:200} A = Q \Lambda Q^\T, \end{equation}
possible since the matrix is symmetric. For such a matrix
\begin{equation}\label{eqn:convexOptimizationLecture3:220} \begin{aligned} \Bv^\T A \Bv &= \Bv^\T Q \Lambda A^\T \Bv \\ &= \Bw^\T \Lambda \Bw, \end{aligned} \end{equation}
where \Bw = A^\T \Bv . Such a product is
\begin{equation}\label{eqn:convexOptimizationLecture3:240} \Bv^\T A \Bv = \sum_{i = 1}^n \lambda_i w_i^2. \end{equation}
So, if \lambda_i \ge 0 (\lambda_i > 0 ) then \sum_{i = 1}^n \lambda_i w_i^2 is non-negative (positive) \forall \Bw \in \mathbb{R}^n, \Bw \ne 0 . Since \Bw is just a rotated version of \Bv this also holds for all \Bv . A necessary and sufficient condition for A \in S^n_{+} ( S^n_{++} ) is \lambda_i \ge 0 (\lambda_i > 0).
Square root of positive semi-definite matrix
Real symmetric matrix power relationships such as
\begin{equation}\label{eqn:convexOptimizationLecture3:260} A^2 = Q \Lambda Q^\T Q \Lambda Q^\T = Q \Lambda^2 Q^\T , \end{equation}
or more generally A^k = Q \Lambda^k Q^\T,\, k \in \mathbb{Z} , can be further generalized to non-integral powers. In particular, the square root (non-unique) of a square matrix can be written
\begin{equation}\label{eqn:convexOptimizationLecture3:280} A^{1/2} = Q \begin{bmatrix} \sqrt{\lambda_1} & & & \\ & \sqrt{\lambda_2} & & \\ & & \ddots & \\ & & & \sqrt{\lambda_n} \\ \end{bmatrix} Q^\T, \end{equation}
since A^{1/2} A^{1/2} = A , regardless of the sign picked for the square roots in question.
Functions of matrices
Consider F : S^n \rightarrow \mathbb{R} , and define
\begin{equation}\label{eqn:convexOptimizationLecture3:300} F(X) = \log \det X, \end{equation}
Here \textrm{dom} F = S^n_{++} . The task is to find \spacegrad F , which can be done by looking at the perturbation \log \det ( X + \Delta X )
\begin{equation}\label{eqn:convexOptimizationLecture3:320} \begin{aligned} \log \det ( X + \Delta X ) &= \log \det ( X^{1/2} (I + X^{-1/2} \Delta X X^{-1/2}) X^{1/2} ) \\ &= \log \det ( X (I + X^{-1/2} \Delta X X^{-1/2}) ) \\ &= \log \det X + \log \det (I + X^{-1/2} \Delta X X^{-1/2}). \end{aligned} \end{equation}
Let X^{-1/2} \Delta X X^{-1/2} = M where \lambda_i are the eigenvalues of M : M \Bv = \lambda_i \Bv when \Bv is an eigenvector of M . In particular
\begin{equation}\label{eqn:convexOptimizationLecture3:340} (I + M) \Bv = (1 + \lambda_i) \Bv, \end{equation}
where 1 + \lambda_i are the eigenvalues of the I + M matrix. Since the determinant is the product of the eigenvalues, this gives
\begin{equation}\label{eqn:convexOptimizationLecture3:360} \begin{aligned} \log \det ( X + \Delta X ) &= \log \det X + \log \prod_{i = 1}^n (1 + \lambda_i) \\ &= \log \det X + \sum_{i = 1}^n \log (1 + \lambda_i). \end{aligned} \end{equation}
If \lambda_i are sufficiently “small”, then \log ( 1 + \lambda_i ) \approx \lambda_i , giving
\begin{equation}\label{eqn:convexOptimizationLecture3:380} \log \det ( X + \Delta X ) = \log \det X + \sum_{i = 1}^n \lambda_i \approx \log \det X + \textrm{Tr}( X^{-1/2} \Delta X X^{-1/2} ). \end{equation}
Since
\begin{equation}\label{eqn:convexOptimizationLecture3:400}
\textrm{Tr}( A B ) = \textrm{Tr}( B A ),
\end{equation}
this trace operation can be written as
\begin{equation}\label{eqn:convexOptimizationLecture3:420} \log \det ( X + \Delta X ) \approx \log \det X + \textrm{Tr}( X^{-1} \Delta X ) = \log \det X + \innerprod{ X^{-1}}{\Delta X}, \end{equation}
so
\begin{equation}\label{eqn:convexOptimizationLecture3:440}
\spacegrad F(X) = X^{-1}.
\end{equation}
To check this, consider the simplest example with X \in \mathbb{R}^{1 \times 1} , where we have
\begin{equation}\label{eqn:convexOptimizationLecture3:460} \frac{d}{dX} \lr{ \log \det X } = \frac{d}{dX} \lr{ \log X } = \inv{X} = X^{-1}. \end{equation}
This is a nice example demonstrating how the gradient can be obtained by performing a first order perturbation of the function. The gradient can then be read off from the result.
Second order perturbations
- To get first order approximation found the part that varied linearly in \Delta X .
- To get the second order part, perturb X^{-1} by \Delta X and see how that perturbation varies in \Delta X .
For G(X) = X^{-1} , this is
\begin{equation}\label{eqn:convexOptimizationLecture3:480} \begin{aligned} (X + \Delta X)^{-1} &= \lr{ X^{1/2} (I + X^{-1/2} \Delta X X^{-1/2} ) X^{1/2} }^{-1} \\ &= X^{-1/2} (I + X^{-1/2} \Delta X X^{-1/2} )^{-1} X^{-1/2} \end{aligned} \end{equation}
To be proven in the homework (for “small” A)
\begin{equation}\label{eqn:convexOptimizationLecture3:500} (I + A)^{-1} \approx I – A. \end{equation}
This gives
\begin{equation}\label{eqn:convexOptimizationLecture3:520} \begin{aligned} (X + \Delta X)^{-1} &= X^{-1/2} (I – X^{-1/2} \Delta X X^{-1/2} ) X^{-1/2} \\ &= X^{-1} – X^{-1} \Delta X X^{-1}, \end{aligned} \end{equation}
or
\begin{equation}\label{eqn:convexOptimizationLecture3:800} \begin{aligned} G(X + \Delta X) &= G(X) + (D G) \Delta X \\ &= G(X) + (\spacegrad G)^\T \Delta X, \end{aligned} \end{equation}
so
\begin{equation}\label{eqn:convexOptimizationLecture3:820}
(\spacegrad G)^\T \Delta X
=
– X^{-1} \Delta X X^{-1}.
\end{equation}
The Taylor expansion of F to second order is
\begin{equation}\label{eqn:convexOptimizationLecture3:840} F(X + \Delta X) = F(X) + \textrm{Tr} \lr{ (\spacegrad F)^\T \Delta X} + \inv{2} \lr{ (\Delta X)^\T (\spacegrad^2 F) \Delta X}. \end{equation}
The first trace can be expressed as an inner product
\begin{equation}\label{eqn:convexOptimizationLecture3:860} \begin{aligned} \textrm{Tr} \lr{ (\spacegrad F)^\T \Delta X} &= \innerprod{ \spacegrad F }{\Delta X} \\ &= \innerprod{ X^{-1} }{\Delta X}. \end{aligned} \end{equation}
The second trace also has the structure of an inner product
\begin{equation}\label{eqn:convexOptimizationLecture3:880} \begin{aligned} (\Delta X)^\T (\spacegrad^2 F) \Delta X &= \textrm{Tr} \lr{ (\Delta X)^\T (\spacegrad^2 F) \Delta X} \\ &= \innerprod{ (\spacegrad^2 F)^\T \Delta X }{\Delta X}, \end{aligned} \end{equation}
where a no-op trace could be inserted in the second order term since that quadratic form is already a scalar. This (\spacegrad^2 F)^\T \Delta X term has essentially been found implicitly by performing the linear variation of \spacegrad F in \Delta X , showing that we must have
\begin{equation}\label{eqn:convexOptimizationLecture3:900} \textrm{Tr} \lr{ (\Delta X)^\T (\spacegrad^2 F) \Delta X} = \innerprod{ – X^{-1} \Delta X X^{-1} }{\Delta X}, \end{equation}
so
\begin{equation}\label{eqn:convexOptimizationLecture3:560}
F( X + \Delta X) = F(X) +
\innerprod{X^{-1}}{\Delta X}
+\inv{2} \innerprod{-X^{-1} \Delta X X^{-1}}{\Delta X},
\end{equation}
or
\begin{equation}\label{eqn:convexOptimizationLecture3:580}
\log \det ( X + \Delta X) = \log \det X +
\textrm{Tr}( X^{-1} \Delta X )
– \inv{2} \textrm{Tr}( X^{-1} \Delta X X^{-1} \Delta X ).
\end{equation}
Convex Sets
- Types of sets: Affine, convex, cones
- Examples: Hyperplanes, polyhedra, balls, ellipses, norm balls, cone of PSD matrices.
Definition: Affine set:
A set C \subseteq \mathbb{R}^n is affine if \forall \Bx_1, \Bx_2 \in C then
\begin{equation*} \theta \Bx_1 + (1 -\theta) \Bx_2 \in C, \qquad \forall \theta \in \mathbb{R}. \end{equation*}
The affine sum above can
be rewritten as
\begin{equation}\label{eqn:convexOptimizationLecture3:600} \Bx_2 + \theta (\Bx_1 – \Bx_2). \end{equation}
Since \theta is a scaling, this is the line containing \Bx_2 in the direction between \Bx_1 and \Bx_2 .
Observe that the solution to a set of linear equations
\begin{equation}\label{eqn:convexOptimizationLecture3:620} C = \setlr{ \Bx | A \Bx = \Bb }, \end{equation}
is an affine set. To check, note that
\begin{equation}\label{eqn:convexOptimizationLecture3:640} \begin{aligned} A (\theta \Bx_1 + (1 – \theta) \Bx_2) &= \theta A \Bx_1 + (1 – \theta) A \Bx_2 \\ &= \theta \Bb + (1 – \theta) \Bb \\ &= \Bb. \end{aligned} \end{equation}
Definition: Affine combination: An affine combination of points \Bx_1, \Bx_2, \cdots \Bx_n is
\begin{equation*} \sum_{i = 1}^n \theta_i \Bx_i, \end{equation*}
such that for \theta_i \in \mathbb{R}
\begin{equation*} \sum_{i = 1}^n \theta_i = 1. \end{equation*}
An affine set contains all affine combinations of points in the set. Examples of a couple affine sets are sketched in fig 1.1
For comparison, a couple of non-affine sets are sketched in fig 1.2
\begin{equation}\label{eqn:convexOptimizationLecture3:700} \theta \Bx_1 + (1 – \theta) \Bx_2 \in C. \end{equation}
Definition: Convex combination: A convex combination of \Bx_1, \Bx_2, \cdots \Bx_n is
\begin{equation*} \sum_{i = 1}^n \theta_i \Bx_i, \end{equation*}
such that \forall \theta_i \ge 0
\begin{equation*} \sum_{i = 1}^n \theta_i = 1 \end{equation*}
Definition: Convex hull: Convex hull of a set C is a set of all convex combinations of points in C, denoted
\begin{equation}\label{eqn:convexOptimizationLecture3:720} \textrm{conv}(C) = \setlr{ \sum_{i=1}^n \theta_i \Bx_i | \Bx_i \in C, \theta_i \ge 0, \sum_{i=1}^n \theta_i = 1 }. \end{equation}
A non-convex set can be converted into a convex hull by filling in all the combinations of points connecting points in the set, as sketched in fig 1.3.
Definition: Cone: A set C is a cone if \forall \Bx \in C and \forall \theta \ge 0 we have \theta \Bx \in C.
This scales out if \theta > 1 and scales in if \theta < 1.
A convex cone is a cone that is also a convex set. A conic combination is
\begin{equation*} \sum_{i=1}^n \theta_i \Bx_i, \theta_i \ge 0. \end{equation*}
A convex and non-convex 2D cone is sketched in fig. 1.4
A comparison of properties for different set types is tabulated in table 1.1
Hyperplanes and half spaces
Definition: Hyperplane: A hyperplane is defined by
\begin{equation*} \setlr{ \Bx | \Ba^\T \Bx = \Bb, \Ba \ne 0 }. \end{equation*}
A line and plane are examples of this general construct as sketched in
fig. 1.5
An alternate view is possible should one
find any specific \Bx_0 such that \Ba^\T \Bx_0 = \Bb
\begin{equation}\label{eqn:convexOptimizationLecture3:740} \setlr{\Bx | \Ba^\T \Bx = b } = \setlr{\Bx | \Ba^\T (\Bx -\Bx_0) = 0 } \end{equation}
This shows that \Bx – \Bx_0 = \Ba^\perp is perpendicular to \Ba , or
\begin{equation}\label{eqn:convexOptimizationLecture3:780} \Bx = \Bx_0 + \Ba^\perp. \end{equation}
This is the subspace perpendicular to \Ba shifted by \Bx_0, subject to \Ba^\T \Bx_0 = \Bb . As a set
\begin{equation}\label{eqn:convexOptimizationLecture3:760} \Ba^\perp = \setlr{ \Bv | \Ba^\T \Bv = 0 }. \end{equation}
Half space
Definition: Half space: The half space is defined as
\begin{equation*}
\setlr{ \Bx | \Ba^\T \Bx = \Bb }
= \setlr{ \Bx | \Ba^\T (\Bx – \Bx_0) \le 0 }.
\end{equation*}
This can also be expressed as \setlr{ \Bx | \innerprod{ \Ba }{\Bx – \Bx_0 } \le 0 } .