## Notes compilation for ECE1505, Convex Optimization

March 18, 2017 ece1505 No comments

I’ve now posted a notes compilation for the subset of the Convex Optimization (ECE1505H) course I was taking in the winter 2017 session.

This course was taught by Prof. S. Draper.

These convex optimization notes are incomplet, covering only the first 9 lectures. The unredacted notes include my solution to problem set 1 (149 pages, vs. 131 pages).

I initially enrolled on this optimization course because I needed a specific quota of ECE courses to satisfy the M.Eng graduation requirements, and the electromagnetics group wasn’t offering enough courses.  I remembered liking linear programming in high school, and always wanted to understand the rational for some of the assumptions that was based on that were never proven in class.  Specifically, I recall that it was stated, but not proved in that high school class, that the extreme values were always found at the vertices of the optimization region.  So, my thought was, I’ll have fun learning the basis for those assumptions, and also learn about optimization theory in general.

It turns out that optimization theory, at least as presented in this course, is very very dry.  It was an endless seeming sequence of definition and proof, with the end goal so far away that it was very difficult to see the big picture.  I worked through the a number of weeks of this particular course before I had enough and bailed.  Work is too fun right now to torture myself and spend the time on an academic course that I am not enjoying, so I dropped it and am back to full time work at LzLabs (from 80%) until the next session at UofT starts again.

The reason I enrolled on the M.Eng in the first place was to study material that I was interested in.  Ideally I would have done that in a part time physics grad context, but that was not available, so I found that the M.Eng allowed me to take an interesting (but constrained) mix of physics and engineering electromagnetism courses.  However, when I enrolled, the electromagnetism course selection was a lot better, and now unfortunately it is sparse and includes only courses that I’d already taken.  I don’t want the M.Eng degree paper badly enough to torture myself with a course that I’m not actually interested in.

I now actually have a plan to satisfy both the degree requirements and my interests (using a project “course”).  That will involve independent study on Geometric Algebra applications to engineering electromagnetism.  I am irked that I have to pay a part time engineering program fee next year to self study, but it does seem worthwhile to come out of the M.Eng study with an actual degree as a side effect, so I am going to go ahead and do it anyways.

## Disclaimer

Peeter’s lecture notes from class. These may be incoherent and rough.

These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper, from [1].

## Today

• Finish local vs global.
• Compositions of functions.
• Introduction to convex optimization problems.

## Continuing proof:

We want to prove that if

\begin{equation*}
\begin{aligned}
\end{aligned},
\end{equation*}

then $$\Bx^\conj$$ is a local optimum.

Proof:

Again, using Taylor approximation

\label{eqn:convexOptimizationLecture8:20}
F(\Bx^\conj + \Bv) = F(\Bx^\conj) + \lr{ \spacegrad F(\Bx^\conj)}^\T \Bv + \inv{2} \Bv^\T \spacegrad^2 F(\Bx^\conj) \Bv + o(\Norm{\Bv}^2)

The linear term is zero by assumption, whereas the Hessian term is given as $$> 0$$. Any direction that you move in, if your move is small enough, this is going uphill at a local optimum.

## Summarize:

For twice continuously differentiable functions, at a local optimum $$\Bx^\conj$$, then

\label{eqn:convexOptimizationLecture8:40}
\begin{aligned}
\end{aligned}

If, in addition, $$F$$ is convex, then $$\spacegrad F(\Bx^\conj) = 0$$ implies that $$\Bx^\conj$$ is a global optimum. i.e. for (unconstrained) convex functions, local and global optimums are equivalent.

• It is possible that a convex function does not have a global optimum. Examples are $$F(x) = e^x$$
(fig. 1)
, which has an $$\inf$$, but no lowest point.

fig. 1. Exponential has no global optimum.

• Our discussion has been for unconstrained functions. For constrained problems (next topic) is not not necessarily true that $$\spacegrad F(\Bx) = 0$$ implies that $$\Bx$$ is a global optimum, even for $$F$$ convex.

As an example of a constrained problem consider

\label{eqn:convexOptimizationLecture8:n}
\begin{aligned}
\min &2 x^2 + y^2 \\
x &\ge 3 \\
y &\ge 5.
\end{aligned}

The level sets of this objective function are plotted in fig. 2. The optimal point is at $$\Bx^\conj = (3,5)$$, where $$\spacegrad F \ne 0$$.

fig. 2. Constrained problem with optimum not at the zero gradient point.

## Projection

Given $$\Bx \in \mathbb{R}^n, \By \in \mathbb{R}^p$$, if $$h(\Bx,\By)$$ is convex in $$\Bx, \By$$, then

\label{eqn:convexOptimizationLecture8:60}
F(\Bx_0) = \inf_\By h(\Bx_0,\By)

is convex in $$\Bx$$, as sketched in fig. 3.

fig. 3. Epigraph of $$h$$ is a filled bowl.

The intuition here is that shining light on the (filled) “bowl”. That is, the image of $$\textrm{epi} h$$ on the $$\By = 0$$ screen which we will show is a convex set.

Proof:

Since $$h$$ is convex in $$\begin{bmatrix} \Bx \\ \By \end{bmatrix} \in \textrm{dom} h$$, then

\label{eqn:convexOptimizationLecture8:80}
\textrm{epi} h = \setlr{ (\Bx,\By,t) | t \ge h(\Bx,\By), \begin{bmatrix} \Bx \\ \By \end{bmatrix} \in \textrm{dom} h },

is a convex set.

We also have to show that the domain of $$F$$ is a convex set. To show this note that

\label{eqn:convexOptimizationLecture8:100}
\begin{aligned}
\textrm{dom} F
&= \setlr{ \Bx | \exists \By s.t. \begin{bmatrix} \Bx \\ \By \end{bmatrix} \in \textrm{dom} h } \\
&= \setlr{
\begin{bmatrix}
I_{n\times n} & 0_{n \times p}
\end{bmatrix}
\begin{bmatrix}
\Bx \\
\By
\end{bmatrix}
| \begin{bmatrix} \Bx \\ \By \end{bmatrix} \in \textrm{dom} h
}.
\end{aligned}

This is an affine map of a convex set. Therefore $$\textrm{dom} F$$ is a convex set.

\label{eqn:convexOptimizationLecture8:120}
\begin{aligned}
\textrm{epi} F
&=
\setlr{ \begin{bmatrix} \Bx \\ \By \end{bmatrix} | t \ge \inf h(\Bx,\By), \Bx \in \textrm{dom} F, \By: \begin{bmatrix} \Bx \\ \By \end{bmatrix} \in \textrm{dom} h } \\
&=
\setlr{
\begin{bmatrix}
I & 0 & 0 \\
0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
\Bx \\
\By \\
t
\end{bmatrix}
|
t \ge h(\Bx,\By), \begin{bmatrix} \Bx \\ \By \end{bmatrix} \in \textrm{dom} h
}.
\end{aligned}

### Example:

The function

\label{eqn:convexOptimizationLecture8:140}
F(\Bx) = \inf_{\By \in C} \Norm{ \Bx – \By },

over $$\Bx \in \mathbb{R}^n, \By \in C$$, ,is convex if $$C$$ is a convex set. Reason:

• $$\Bx – \By$$ is linear in $$(\Bx, \By)$$.
• $$\Norm{ \Bx – \By }$$ is a convex function if the domain is a convex set
• The domain is $$\mathbb{R}^n \times C$$. This will be a convex set if $$C$$ is.
• $$h(\Bx, \By) = \Norm{\Bx -\By}$$ is a convex function if $$\textrm{dom} h$$ is a convex set. By setting $$\textrm{dom} h = \mathbb{R}^n \times C$$, if $$C$$ is convex, $$\textrm{dom} h$$ is a convex set.
• $$F()$$

## Composition of functions

Consider

\label{eqn:convexOptimizationLecture8:160}
\begin{aligned}
F(\Bx) &= h(g(\Bx)) \\
\textrm{dom} F &= \setlr{ \Bx \in \textrm{dom} g | g(\Bx) \in \textrm{dom} h } \\
F &: \mathbb{R}^n \rightarrow \mathbb{R} \\
g &: \mathbb{R}^n \rightarrow \mathbb{R} \\
h &: \mathbb{R} \rightarrow \mathbb{R}.
\end{aligned}

Cases:

1. $$g$$ is convex, $$h$$ is convex and non-decreasing.
2. $$g$$ is convex, $$h$$ is convex and non-increasing.

Show for 1D case ( $$n = 1$$). Get to $$n > 1$$ by applying to all lines.

1. \label{eqn:convexOptimizationLecture8:180}
\begin{aligned}
F'(x) &= h'(g(x)) g'(x) \\
F”(x) &=
h”(g(x)) g'(x) g'(x)
+
h'(g(x)) g”(x) \\
&=
h”(g(x)) (g'(x))^2
+
h'(g(x)) g”(x) \\
&=
\lr{ \ge 0 } \cdot \lr{ \ge 0 }^2 + \lr{ \ge 0 } \cdot \lr{ \ge 0 },
\end{aligned}

since $$h$$ is respectively convex, and non-decreasing.

2. \label{eqn:convexOptimizationLecture8:180b}
\begin{aligned}
F'(x) =
\lr{ \ge 0 } \cdot \lr{ \ge 0 }^2 + \lr{ \le 0 } \cdot \lr{ \le 0 },
\end{aligned}

since $$h$$ is respectively convex, and non-increasing, and g is concave.

## Extending to multiple dimensions

\label{eqn:convexOptimizationLecture8:200}
\begin{aligned}
F(\Bx)
&= h(g(\Bx)) = h( g_1(\Bx), g_2(\Bx), \cdots g_k(\Bx) ) \\
g &: \mathbb{R}^n \rightarrow \mathbb{R} \\
h &: \mathbb{R}^k \rightarrow \mathbb{R}.
\end{aligned}

is convex if $$g_i$$ is convex for each $$i \in [1,k]$$ and $$h$$ is convex and non-decreasing in each argument.

Proof:

again assume $$n = 1$$, without loss of generality,

\label{eqn:convexOptimizationLecture8:220}
\begin{aligned}
g &: \mathbb{R} \rightarrow \mathbb{R}^k \\
h &: \mathbb{R}^k \rightarrow \mathbb{R} \\
\end{aligned}

\label{eqn:convexOptimizationLecture8:240}
F”(\Bx)
=
\begin{bmatrix}
g_1(\Bx) & g_2(\Bx) & \cdots & g_k(\Bx)
\end{bmatrix}
\begin{bmatrix}
g_1′(\Bx) \\ g_2′(\Bx) \\ \vdots \\ g_k'(\Bx)
\end{bmatrix}
+
\begin{bmatrix}
g_1”(\Bx) \\ g_2”(\Bx) \\ \vdots \\ g_k”(\Bx)
\end{bmatrix}

The Hessian is PSD.

### Example:

\label{eqn:convexOptimizationLecture8:260}
F(x) = \exp( g(x) ) = h( g(x) ),

where $$g$$ is convex is convex, and $$h(y) = e^y$$. This implies that $$F$$ is a convex function.

### Example:

\label{eqn:convexOptimizationLecture8:280}
F(x) = \inv{g(x)},

is convex if $$g(x)$$ is concave and positive. The most simple such example of such a function is $$h(x) = 1/x, \textrm{dom} h = \mathbb{R}_{++}$$, which is plotted in fig. 4.

fig. 4. Inverse function is convex over positive domain.

### Example:

\label{eqn:convexOptimizationLecture8:300}
F(x) = – \sum_{i = 1}^n \log( -F_i(x) )

is convex on $$\setlr{ x | F_i(x) < 0 \forall i }$$ if all $$F_i$$ are convex.

• Due to $$\textrm{dom} F$$, $$-F_i(x) > 0 \,\forall x \in \textrm{dom} F$$
• $$\log(x)$$ concave on $$\mathbb{R}_{++}$$ so $$-\log$$ convex also non-increasing (fig. 5).

fig. 5. Negative logarithm convex over positive domain.

\label{eqn:convexOptimizationLecture8:320}
F(x) = \sum h_i(x)

but
\label{eqn:convexOptimizationLecture8:340}
h_i(x) = -\log(-F_i(x)),

which is a convex and non-increasing function ($$-\log$$), of a convex function $$-F_i(x)$$. Each
$$h_i$$ is convex, so this is a sum of convex functions, and is therefore convex.

### Example:

Over $$\textrm{dom} F = S^n_{++}$$

\label{eqn:convexOptimizationLecture8:360}
F(X) = \log \det X^{-1}

To show that this is convex, check all lines in domain. A line in $$S^n_{++}$$ is a 1D family of matrices

\label{eqn:convexOptimizationLecture8:380}
\tilde{F}(t) = \log \det( \lr{X_0 + t H}^{-1} ),

where $$X_0 \in S^n_{++}, t \in \mathbb{R}, H \in S^n$$.

F9

For $$t$$ small enough,

\label{eqn:convexOptimizationLecture8:400}
X_0 + t H \in S^n_{++}

\label{eqn:convexOptimizationLecture8:420}
\begin{aligned}
\tilde{F}(t)
&= \log \det( \lr{X_0 + t H}^{-1} ) \\
&= \log \det\lr{ X_0^{-1/2} \lr{I + t X_0^{-1/2} H X_0^{-1/2} }^{-1} X_0^{-1/2} } \\
&= \log \det\lr{ X_0^{-1} \lr{I + t X_0^{-1/2} H X_0^{-1/2} }^{-1} } \\
&= \log \det X_0^{-1} + \log\det \lr{I + t X_0^{-1/2} H X_0^{-1/2} }^{-1} \\
&= \log \det X_0^{-1} – \log\det \lr{I + t X_0^{-1/2} H X_0^{-1/2} } \\
&= \log \det X_0^{-1} – \log\det \lr{I + t M }.
\end{aligned}

If $$\lambda_i$$ are eigenvalues of $$M$$, then $$1 + t \lambda_i$$ are eigenvalues of $$I + t M$$. i.e.:

\label{eqn:convexOptimizationLecture8:440}
\begin{aligned}
(I + t M) \Bv
&=
I \Bv + t \lambda_i \Bv \\
&=
(1 + t \lambda_i) \Bv.
\end{aligned}

This gives

\label{eqn:convexOptimizationLecture8:460}
\begin{aligned}
\tilde{F}(t)
&= \log \det X_0^{-1} – \log \prod_{i = 1}^n (1 + t \lambda_i) \\
&= \log \det X_0^{-1} – \sum_{i = 1}^n \log (1 + t \lambda_i)
\end{aligned}

• $$1 + t \lambda_i$$ is linear in $$t$$.
• $$-\log$$ is convex in its argument.
• sum of convex function is convex.

### Example:

\label{eqn:convexOptimizationLecture8:480}
F(X) = \lambda_\max(X),

is convex on $$\textrm{dom} F \in S^n$$

(a)
\label{eqn:convexOptimizationLecture8:500}
\lambda_{\max} (X) = \sup_{\Norm{\Bv}_2 \le 1} \Bv^\T X \Bv,

\label{eqn:convexOptimizationLecture8:520}
\begin{bmatrix}
\lambda_1 & & & \\
& \lambda_2 & & \\
& & \ddots & \\
& & & \lambda_n
\end{bmatrix}

Recall that a decomposition

\label{eqn:convexOptimizationLecture8:540}
\begin{aligned}
X &= Q \Lambda Q^\T \\
Q^\T Q = Q Q^\T = I
\end{aligned}

can be used for any $$X \in S^n$$.

(b)

Note that $$\Bv^\T X \Bv$$ is linear in $$X$$. This is a max of a number of linear (and convex) functions, so it is convex.

Last example:

(non-symmetric matrices)

\label{eqn:convexOptimizationLecture8:560}
F(X) = \sigma_\max(X),

is convex on $$\textrm{dom} F = \mathbb{R}^{m \times n}$$. Here

\label{eqn:convexOptimizationLecture8:580}
\sigma_\max(X) = \sup_{\Norm{\Bv}_2 = 1} \Norm{X \Bv}_2

This is called an operator norm of $$X$$. Using the SVD

\label{eqn:convexOptimizationLecture8:600}
\begin{aligned}
X &= U sectionigma V^\T \\
U &= \mathbb{R}^{m \times r} \\
sectionigma &\in \mathrm{diag} \in \mathbb{R}{ r \times r } \\
V^T &\in \mathbb{R}^{r \times n}.
\end{aligned}

Have

\label{eqn:convexOptimizationLecture8:620}
\Norm{X \Bv}_2^2
=
\Norm{ U sectionigma V^\T \Bv }_2^2
=
\Bv^\T V sectionigma U^\T U sectionigma V^\T \Bv
=
\Bv^\T V sectionigma sectionigma V^\T \Bv
=
\Bv^\T V sectionigma^2 V^\T \Bv
=
\tilde{\Bv}^\T sectionigma^2 \tilde{\Bv},

where $$\tilde{\Bv} = \Bv^\T V$$, so

\label{eqn:convexOptimizationLecture8:640}
\Norm{X \Bv}_2^2
=
\sum_{i = 1}^r \sigma_i^2 \Norm{\tilde{\Bv}}
\le \sigma_\max^2 \Norm{\tilde{\Bv}}^2,

or
\label{eqn:convexOptimizationLecture8:660}
\Norm{X \Bv}_2
\le \sqrt{ \sigma_\max^2 } \Norm{\tilde{\Bv}}
\le
\sigma_\max.

Set $$\Bv$$ to the right singular value of $$X$$ to get equality.

# References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

## ECE1505H Convex Optimization. Lecture 6: First and second order conditions. Taught by Prof.\ Stark Draper

### Disclaimer

Peeter’s lecture notes from class. These may be incoherent and rough.

These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper, from [1].

### Today

• First and second order conditions for convexity of differentiable functions.
• Consequences of convexity: local and global optimality.
• Properties.

### Quasi-convex

$$F_1$$ and $$F_2$$ convex implies $$\max( F_1, F_2)$$ convex.

fig. 1. Min and Max

Note that $$\min(F_1, F_2)$$ is NOT convex.

If $$F : \mathbb{R}^n \rightarrow \mathbb{R}$$ is convex, then $$F( \Bx_0 + t \Bv )$$ is convex in $$t\,\forall t \in \mathbb{R}, \Bx_0 \in \mathbb{R}^n, \Bv \in \mathbb{R}^n$$, provided $$\Bx_0 + t \Bv \in \textrm{dom} F$$.

Idea: Restrict to a line (line segment) in $$\textrm{dom} F$$. Take a cross section or slice through $$F$$ alone the line. If the result is a 1D convex function for all slices, then $$F$$ is convex.

This is nice since it allows for checking for convexity, and is also nice numerically. Attempting to test a given data set for non-convexity with some random lines can help disprove convexity. However, to show that $$F$$ is convex it is required to test all possible slices (which isn’t possible numerically, but is in some circumstances possible analytically).

### Differentiable (convex) functions

Definition: First order condition.

If

\begin{equation*}
F : \mathbb{R}^n \rightarrow \mathbb{R}
\end{equation*}

is differentiable, then $$F$$ is convex iff $$\textrm{dom} F$$ is a convex set and $$\forall \Bx, \Bx_0 \in \textrm{dom} F$$

\begin{equation*}
F(\Bx) \ge F(\Bx_0) + \lr{\spacegrad F(\Bx_0)}^\T (\Bx – \Bx_0).
\end{equation*}

This is the first order Taylor expansion. If $$n = 1$$, this is $$F(x) \ge F(x_0) + F'(x_0) ( x – x_0)$$.

The first order condition says a convex function \underline{always} lies above its first order approximation, as sketched in fig. 3.

fig. 2. First order approximation lies below convex function

When differentiable, the supporting plane is the tangent plane.

Definition: Second order condition

If $$F : \mathbb{R}^n \rightarrow \mathbb{R}$$ is twice differentiable, then $$F$$ is convex iff $$\textrm{dom} F$$ is a convex set and $$\spacegrad^2 F(\Bx) \ge 0 \,\forall \Bx \in \textrm{dom} F$$.

The Hessian is always symmetric, but is not necessarily positive. Recall that the Hessian is the matrix of the second order partials $$(\spacegrad F)_{ij} = \partial^2 F/(\partial x_i \partial x_j)$$.

The scalar case is $$F”(x) \ge 0 \, \forall x \in \textrm{dom} F$$.

An implication is that if $$F$$ is convex, then $$F(x) \ge F(x_0) + F'(x_0) (x – x_0) \,\forall x, x_0 \in \textrm{dom} F$$

Since $$F$$ is convex, $$\textrm{dom} F$$ is convex.

Consider any 2 points $$x, y \in \textrm{dom} F$$, and $$\theta \in [0,1]$$. Define

\label{eqn:convexOptimizationLecture6:60}
z = (1-\theta) x + \theta y \in \textrm{dom} F,

then since $$\textrm{dom} F$$ is convex

\label{eqn:convexOptimizationLecture6:80}
F(z) =
F( (1-\theta) x + \theta y )
\le
(1-\theta) F(x) + \theta F(y )

Reordering

\label{eqn:convexOptimizationLecture6:220}
\theta F(x) \ge
\theta F(x) + F(z) – F(x),

or
\label{eqn:convexOptimizationLecture6:100}
F(y) \ge
F(x) + \frac{F(x + \theta(y-x)) – F(x)}{\theta},

which is, in the limit,

\label{eqn:convexOptimizationLecture6:120}
F(y) \ge
F(x) + F'(x) (y – x),

completing one direction of the proof.

To prove the other direction, showing that

\label{eqn:convexOptimizationLecture6:140}
F(x) \ge F(x_0) + F'(x_0) (x – x_0),

implies that $$F$$ is convex. Take any $$x, y \in \textrm{dom} F$$ and any $$\theta \in [0,1]$$. Define

\label{eqn:convexOptimizationLecture6:160}
z = \theta x + (1 -\theta) y,

which is in $$\textrm{dom} F$$ by assumption. We want to show that

\label{eqn:convexOptimizationLecture6:180}
F(z) \le \theta F(x) + (1-\theta) F(y).

By assumption

1. $$F(x) \ge F(z) + F'(z) (x – z)$$
2. $$F(y) \ge F(z) + F'(z) (y – z)$$

Compute

\label{eqn:convexOptimizationLecture6:200}
\begin{aligned}
\theta F(x) + (1-\theta) F(y)
&\ge
\theta \lr{ F(z) + F'(z) (x – z) }
+ (1-\theta) \lr{ F(z) + F'(z) (y – z) } \\
&=
F(z) + F'(z) \lr{ \theta( x – z) + (1-\theta) (y-z) } \\
&=
F(z) + F'(z) \lr{ \theta x + (1-\theta) y – \theta z – (1 -\theta) z } \\
&=
F(z) + F'(z) \lr{ \theta x + (1-\theta) y – z} \\
&=
F(z) + F'(z) \lr{ z – z} \\
&= F(z).
\end{aligned}

### Proof of the 2nd order case for $$n = 1$$

Want to prove that if

\label{eqn:convexOptimizationLecture6:240}
F : \mathbb{R} \rightarrow \mathbb{R}

is a convex function, then $$F”(x) \ge 0 \,\forall x \in \textrm{dom} F$$.

By the first order conditions $$\forall x \ne y \in \textrm{dom} F$$

\label{eqn:convexOptimizationLecture6:260}
\begin{aligned}
F(y) &\ge F(x) + F'(x) (y – x)
F(x) &\ge F(y) + F'(y) (x – y)
\end{aligned}

Can combine and get

\label{eqn:convexOptimizationLecture6:280}
F'(x) (y-x) \le F(y) – F(x) \le F'(y)(y-x)

Subtract the two derivative terms for

\label{eqn:convexOptimizationLecture6:340}
\frac{(F'(y) – F'(x))(y – x)}{(y – x)^2} \ge 0,

or
\label{eqn:convexOptimizationLecture6:300}
\frac{F'(y) – F'(x)}{y – x} \ge 0.

In the limit as $$y \rightarrow x$$, this is
\label{eqn:convexOptimizationLecture6:320}
\boxed{
F”(x) \ge 0 \,\forall x \in \textrm{dom} F.
}

Now prove the reverse condition:

If $$F”(x) \ge 0 \,\forall x \in \textrm{dom} F \subseteq \mathbb{R}$$, implies that $$F : \mathbb{R} \rightarrow \mathbb{R}$$ is convex.

Note that if $$F”(x) \ge 0$$, then $$F'(x)$$ is non-decreasing in $$x$$.

i.e. If $$x < y$$, where $$x, y \in \textrm{dom} F$$, then

\label{eqn:convexOptimizationLecture6:360}
F'(x) \le F'(y).

Consider any $$x,y \in \textrm{dom} F$$ such that $$x < y$$, where

\label{eqn:convexOptimizationLecture6:380}
F(y) – F(x) = \int_x^y F'(t) dt \ge F'(x) \int_x^y 1 dt = F'(x) (y-x).

This tells us that

\label{eqn:convexOptimizationLecture6:400}
F(y) \ge F(x) + F'(x)(y – x),

which is the first order condition. Similarly consider any $$x,y \in \textrm{dom} F$$ such that $$x < y$$, where

\label{eqn:convexOptimizationLecture6:420}
F(y) – F(x) = \int_x^y F'(t) dt \le F'(y) \int_x^y 1 dt = F'(y) (y-x).

This tells us that

\label{eqn:convexOptimizationLecture6:440}
F(x) \ge F(y) + F'(y)(x – y).

### Vector proof:

$$F$$ is convex iff $$F(\Bx + t \Bv)$$ is convex $$\forall \Bx,\Bv \in \mathbb{R}^n, t \in \mathbb{R}$$, keeping $$\Bx + t \Bv \in \textrm{dom} F$$.

Let
\label{eqn:convexOptimizationLecture6:460}
h(t ; \Bx, \Bv) = F(\Bx + t \Bv)

then $$h(t)$$ satisfies scalar first and second order conditions for all $$\Bx, \Bv$$.

\label{eqn:convexOptimizationLecture6:480}
h(t) = F(\Bx + t \Bv) = F(g(t)),

where $$g(t) = \Bx + t \Bv$$, where

\label{eqn:convexOptimizationLecture6:500}
\begin{aligned}
F &: \mathbb{R}^n \rightarrow \mathbb{R} \\
g &: \mathbb{R} \rightarrow \mathbb{R}^n.
\end{aligned}

This is expressing $$h(t)$$ as a composition of two functions. By the first order condition for scalar functions we know that

\label{eqn:convexOptimizationLecture6:520}
h(t) \ge h(0) + h'(0) t.

Note that

\label{eqn:convexOptimizationLecture6:540}
h(0) = \evalbar{F(\Bx + t \Bv)}{t = 0} = F(\Bx).

Let’s figure out what $$h'(0)$$ is. Recall hat for any $$\tilde{F} : \mathbb{R}^n \rightarrow \mathbb{R}^m$$

\label{eqn:convexOptimizationLecture6:560}
D \tilde{F} \in \mathbb{R}^{m \times n},

and
\label{eqn:convexOptimizationLecture6:580}
{D \tilde{F}(\Bx)}_{ij} = \PD{x_j}{\tilde{F_i}(\Bx)}

This is one function per row, for $$i \in [1,m], j \in [1,n]$$. This gives

\label{eqn:convexOptimizationLecture6:600}
\begin{aligned}
\frac{d}{dt} F(\Bx + \Bv t)
&=
\frac{d}{dt} F( g(t) ) \\
&=
\frac{d}{dt} h(t) \\
&= D h(t) \\
&= D F(g(t)) \cdot D g(t)
\end{aligned}

The first matrix is in $$\mathbb{R}^{1\times n}$$ whereas the second is in $$\mathbb{R}^{n\times 1}$$, since $$F : \mathbb{R}^n \rightarrow \mathbb{R}$$ and $$g : \mathbb{R} \rightarrow \mathbb{R}^n$$. This gives

\label{eqn:convexOptimizationLecture6:620}
\frac{d}{dt} F(\Bx + \Bv t)
= \evalbar{D F(\tilde{\Bx})}{\tilde{\Bx} = g(t)} \cdot D g(t).

That first matrix is

\label{eqn:convexOptimizationLecture6:640}
\begin{aligned}
\evalbar{D F(\tilde{\Bx})}{\tilde{\Bx} = g(t)}
&=
\evalbar{
\lr{\begin{bmatrix}
\PD{\tilde{x}_1}{ F(\tilde{\Bx})} &
\PD{\tilde{x}_2}{ F(\tilde{\Bx})} & \cdots
\PD{\tilde{x}_n}{ F(\tilde{\Bx})}
\end{bmatrix}
}}{ \tilde{\Bx} = g(t) = \Bx + t \Bv } \\
&=
\evalbar{
}{
\tilde{\Bx} = g(t)
} \\
=
\end{aligned}

The second Jacobian is

\label{eqn:convexOptimizationLecture6:660}
D g(t)
=
D
\begin{bmatrix}
g_1(t) \\
g_2(t) \\
\vdots \\
g_n(t) \\
\end{bmatrix}
=
D
\begin{bmatrix}
x_1 + t v_1 \\
x_2 + t v_2 \\
\vdots \\
x_n + t v_n \\
\end{bmatrix}
=
\begin{bmatrix}
v_1 \\
v_1 \\
\vdots \\
v_n \\
\end{bmatrix}
=
\Bv.

so

\label{eqn:convexOptimizationLecture6:680}
h'(t) = D h(t) = \lr{ \spacegrad F(g(t))}^\T \Bv,

and
\label{eqn:convexOptimizationLecture6:700}
h'(0) = \lr{ \spacegrad F(g(0))}^\T \Bv
=

Finally

\label{eqn:convexOptimizationLecture6:720}
\begin{aligned}
F(\Bx + t \Bv)
&\ge h(0) + h'(0) t \\
&= F(\Bx) + \lr{ \spacegrad F(\Bx) }^\T (t \Bv) \\
&= F(\Bx) + \innerprod{ \spacegrad F(\Bx) }{ t \Bv}.
\end{aligned}

Which is true for all $$\Bx, \Bx + t \Bv \in \textrm{dom} F$$. Note that the quantity $$t \Bv$$ is a shift.

### Epigraph

Recall that if $$(\Bx, t) \in \textrm{epi} F$$ then $$t \ge F(\Bx)$$.

\label{eqn:convexOptimizationLecture6:740}
t \ge F(\Bx) \ge F(\Bx_0) + \lr{\spacegrad F(\Bx_0) }^\T (\Bx – \Bx_0),

or

\label{eqn:convexOptimizationLecture6:760}
0 \ge
-(t – F(\Bx_0)) + \lr{\spacegrad F(\Bx_0) }^\T (\Bx – \Bx_0),

In block matrix form

\label{eqn:convexOptimizationLecture6:780}
0 \ge
\begin{bmatrix}
\lr{ \spacegrad F(\Bx_0) }^\T & -1
\end{bmatrix}
\begin{bmatrix}
\Bx – \Bx_0 \\
t – F(\Bx_0)
\end{bmatrix}

With $$\Bw = \begin{bmatrix} \lr{ \spacegrad F(\Bx_0) }^\T & -1 \end{bmatrix}$$, the geometry of the epigraph relation to the half plane is sketched in fig. 3.

fig. 3. Half planes and epigraph.

# References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

## Disclaimer

Peeter’s lecture notes from class. These may be incoherent and rough.

These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper, from [1].

## Last time

• examples of sets: planes, half spaces, balls, ellipses, cone of positive semi-definite matrices
• generalized inequalities
• examples of convexity preserving operations

## Today

• more examples of convexity preserving operations
• separating and supporting hyperplanes
• basic definitions of convex functions
• epigraphs, quasi-convexity, sublevel sets
• first and second order conditions for convexity of differentiable functions.

## Operations that preserve convexity

If $$S_\alpha$$ is convex $$\forall \alpha \in A$$, then

\label{eqn:convexOptimizationLecture5:40}
\cup_{\alpha \in A} S_\alpha,

is convex.

Example:

\label{eqn:convexOptimizationLecture5:60}
F(\Bx) = A \Bx + \Bb

\label{eqn:convexOptimizationLecture5:80}
\begin{aligned}
\Bx &\in \mathbb{R}^n \\
A &\in \mathbb{R}^{m \times n} \\
F &: \mathbb{R}^{n} \rightarrow \mathbb{R}^m \\
\Bb &\in \mathbb{R}^m
\end{aligned}

1. If $$S \in \mathbb{R}^n$$ is convex, then\label{eqn:convexOptimizationLecture5:100}
F(S) = \setlr{ F(\Bx) | \Bx \in S }
is convex if $$F$$ is affine.
2. If $$S \in \mathbb{R}^m$$ is convex, then\label{eqn:convexOptimizationLecture5:120}
F^{-1}(S) = \setlr{ \Bx | F(\Bx) \in S }

is convex.

Example:

\label{eqn:convexOptimizationLecture5:140}
\setlr{ \By | \By = A \Bx + \Bb, \Norm{\Bx} \le 1}

is convex. Here $$A \Bx + \Bb$$ is an affine function ($$F(\Bx)$$. This is the image of a (convex) unit ball, through an affine map.

Earlier saw when defining ellipses

\label{eqn:convexOptimizationLecture5:160}
\By = P^{1/2} \Bx + \Bx_c

Example :

\label{eqn:convexOptimizationLecture5:180}
\setlr{ \Bx | \Norm{ A \Bx + \Bb } \le 1 },

is convex. This can be seen by writing

\label{eqn:convexOptimizationLecture5:200}
\begin{aligned}
\setlr{ \Bx | \Norm{ A \Bx + \Bb } \le 1 }
&=
\setlr{ \Bx | \Norm{ F(\Bx) } \le 1 } \\
&=
\setlr{ \Bx | F(\Bx) \in \mathcal{B} },
\end{aligned}

where $$\mathcal{B} = \setlr{ \By | \Norm{\By} \le 1 }$$. This is the pre-image (under $$F()$$) of a unit norm ball.

Example:

\label{eqn:convexOptimizationLecture5:220}
\setlr{ \Bx \in \mathbb{R}^n | x_1 A_1 + x_2 A_2 + \cdots x_n A_n \le \mathcal{B} }

where $$A_i \in S^m$$ and $$\mathcal{B} \in S^m$$, and the inequality is a matrix inequality. This is a convex set. The constraint is a “linear matrix inequality” (LMI).

This has to do with an affine map:

\label{eqn:convexOptimizationLecture5:240}
F(\Bx) = B – 1 x_1 A_1 – x_2 A_2 – \cdots x_n A_n \ge 0

(positive semi-definite inequality). This is a mapping

\label{eqn:convexOptimizationLecture5:480}
F : \mathbb{R}^n \rightarrow S^m,

since all $$A_i$$ and $$B$$ are in $$S^m$$.

This $$F(\Bx) = B – A(\Bx)$$ is a constant and a factor linear in x, so is affine. Can be written

\label{eqn:convexOptimizationLecture5:260}
\setlr{ \Bx | B – A(\Bx) \ge 0 }
=
\setlr{ \Bx | B – A(\Bx) \in S^m_{+} }

This is a pre-image of a cone of PSD matrices, which is convex. Therefore, this is a convex set.

## Separating hyperplanes

Theorem: Separating hyperplanes

If $$S, T \subseteq \mathbb{R}^n$$ are convex and disjoint
i.e. $$S \cup T = 0$$, then
there exists on $$\Ba \in \mathbb{R}^n$$ $$\Ba \ne 0$$ and a $$\Bb \in \mathbb{R}^n$$ such that

\begin{equation*}
\Ba^\T \Bx \ge \Bb \, \forall \Bx \in S
\end{equation*}

and
\begin{equation*}
\Ba^\T \Bx < \Bb \,\forall \Bx \in T.
\end{equation*}

An example of a hyperplanes that separates two sets and two sets that are not separable is sketched in fig 1.1

Proof in the book.

Theorem: Supporting hyperplane
If $$S$$ is convex then $$\forall x_0 \in \partial S = \textrm{cl}(S) \ \textrm{int}(S)$$, where
$$\partial S$$ is the boundary of $$S$$, then $$\exists$$ an $$\Ba \ne 0 \in \mathbb{R}^n$$ such that $$\Ba^\T \Bx \le \Ba^\T x_0 \, \forall \Bx \in S$$.

Here $$\$$ denotes “without”.

An example is sketched in fig. 3, for which

fig. 3. Supporting hyperplane.

• The vector $$\Ba$$ perpendicular to tangent plane.
• inner product $$\Ba^\T (\Bx – \Bx_0) \le 0$$.

A set with a supporting hyperplane is sketched in fig 4a whereas fig 4b shows that there is not necessarily a unique supporting hyperplane at any given point, even if $$S$$ is convex.

fig 4a. Set with supporting hyperplane.

fig 4b. No unique supporting hyperplane possible.

## basic definitions of convex functions

Theorem: Convex functions
If $$F : \mathbb{R}^n \rightarrow \mathbb{R}$$ is defined on a convex domain (i.e. $$\textrm{dom} F \subseteq \mathbb{R}^n$$ is a convex set), then $$F$$ is convex if $$\forall \Bx, \By \in \textrm{dom} F$$, $$\forall \theta \in [0,1] \in \mathbb{R}$$

\label{eqn:convexOptimizationLecture5:340}
F( \theta \Bx + (1-\theta) \By \le \theta F(\Bx) + (1-\theta) F(\By)

An example is sketched in fig. 5.

fig. 5. Example of convex function.

Remarks

• Require $$\textrm{dom} F$$ to be a convex set. This is required so that the function at the point $$\theta u + (1-\theta) v$$ can be evaluated. i.e. so that $$F(\theta u + (1-\theta) v)$$ is well defined. Example: $$\textrm{dom} F = (-\infty, 0] \cup [1, \infty)$$ is not okay, because a linear combination in $$(0,1)$$ would be undesirable.
• Parameter $$\theta$$ is “how much up” the line segment connecting $$(u, F(u)$$ and $$(v, F(v)$$. This line segment never below the bottom of the bowl.
The function is \underlineAndIndex{concave}, if $$-F$$ is convex.
i.e. If the convex function is flipped upside down. That is\label{eqn:convexOptimizationLecture5:360}
F(\theta \Bx + (1-\theta) \By ) \ge \theta F(\Bx) + (1-\theta) F(\By) \,\forall \Bx,\By \in \textrm{dom} F, \theta \in [0,1].
• a “strictly” convex function means $$\forall \theta \in [0,1]$$\label{eqn:convexOptimizationLecture5:380}
F(\theta \Bx + (1-\theta) \By ) < \theta F(\Bx) + (1-theta) F(\By).
• Strictly concave function $$F$$ means $$-F$$ is strictly convex.
• Examples:\imageFigure{../figures/ece1505-convex-optimization/l5Fig6a}{}{fig:l5:l5Fig6a}{0.2}

fig 6a. Not convex or concave.

fig 6b. Not strictly convex

Definition: Epigraph of a function

The epigraph $$\textrm{epi} F$$ of a function $$F : \mathbb{R}^n \rightarrow \mathbb{R}$$ is

\begin{equation*}
\textrm{epi} F = \setlr{ (\Bx,t) \in \mathbb{R}^{n +1} | \Bx \in \textrm{dom} F, t \ge F(\Bx) },
\end{equation*}

where $$\Bx \in \mathbb{R}^n, t \in \mathbb{R}$$.

fig. 7. Epigraph.

Theorem: Convexity and epigraph.
If $$F$$ is convex implies $$\textrm{epi} F$$ is a convex set.

Proof:

For convex function, a line segment connecting any 2 points on function is above the function. i.e. it is $$\textrm{epi} F$$.

Many authors will go the other way around, showing \ref{dfn:convexOptimizationLecture5:400} from \ref{thm:convexOptimizationLecture5:420}. That is:

Pick any 2 points in $$\textrm{epi} F$$, $$(\Bx,\mu) \in \textrm{epi} F$$ and $$(\By, \nu) \in \textrm{epi} F$$. Consider convex combination

\label{eqn:convexOptimizationLecture5:420}
\theta( \Bx, \mu ) + (1-\theta) (\By, \nu) =
(\theta \Bx (1-\theta) \By, \theta \mu (1-\theta) \nu )
\in \textrm{epi} F,

since $$\textrm{epi} F$$ is a convex set.

By definition of $$\textrm{epi} F$$

\label{eqn:convexOptimizationLecture5:440}
F( \theta \Bx (1-\theta) \By ) \le \theta \mu (1-\theta) \nu.

Picking $$\mu = F(\Bx), \nu = F(\By)$$ gives
\label{eqn:convexOptimizationLecture5:460}
F( \theta \Bx (1-\theta) \By ) \le \theta F(\Bx) (1-\theta) F(\By).

## Extended value function

Sometimes convenient to work with “extended value function”

\label{eqn:convexOptimizationLecture5:500}
\tilde{F}(\Bx) =
\left\{
\begin{array}{l l}
F(\Bx) & \quad \mbox{If $$\Bx \in \textrm{dom} F$$} \\
\end{array}
\right.

Examples:

• Linear (affine) functions (fig. 8) are both convex and concave.

fig. 8. Linear functions.

• $$x^2$$ is convex, sketched in fig. 9.

• $$\log x, \textrm{dom} F = \mathbb{R}_{+}$$ concave, sketched in fig. 10.

fig. 10. Concave (logarithm.)

• $$\Norm{\Bx}$$ is convex. $$\Norm{ \theta \Bx + (1-\theta) \By } \le \theta \Norm{ \Bx } + (1-\theta) \Norm{\By }$$.
• $$1/x$$ is convex on $$\setlr{ x | x > 0 } = \textrm{dom} F$$, and concave on $$\setlr{ x | x < 0 } = \textrm{dom} F$$. \label{eqn:convexOptimizationLecture5:520} \tilde{F}(x) = \left\{ \begin{array}{l l} \inv{x} & \quad \mbox{If $$x > 0$$} \\
\end{array}
\right.

Definition: Sublevel

The sublevel set of a function $$F : \mathbb{R}^n \rightarrow \mathbb{R}$$ is

\begin{equation*}
C(\alpha) = \setlr{ \Bx \in \textrm{dom} F | F(\Bx) \le \alpha }
\end{equation*}

Convex sublevel

Non-convex sublevel.

Theorem:
If $$F$$ is convex then $$C(\alpha)$$ is a convex set $$\forall \alpha$$.

This is not an if and only if condition, as illustrated in fig. 12.

fig. 12. Convex sublevel does not imply convexity.

There $$C(\alpha)$$ is convex, but the function itself is not.

Proof:

Since $$F$$ is convex, then $$\textrm{epi} F$$ is a convex set.

• Let\label{eqn:convexOptimizationLecture5:580}
\mathcal{A} = \setlr{ (\Bx,t) | t = \alpha }
is a convex set.
• $$\mathcal{A} \cap \textrm{epi} F$$is a convex set since it is the intersection of convex sets.
• Project $$\mathcal{A} \cap \textrm{epi} F$$ onto \R{n} (i.e. domain of $$F$$ ). The projection is an affine mapping. Image of a convex set through affine mapping is a convex set.

Definition: Quasi-convex.

A function is quasi-convex if \underline{all} of its sublevel sets are convex.

## Composing convex functions

Properties of convex functions:

• If $$F$$ is convex, then $$\alpha F$$ is convex $$\forall \alpha > 0$$.
• If $$F_1, F_2$$ are convex, then the sum $$F_1 + F_2$$ is convex.
• If $$F$$ is convex, then $$g(\Bx) = F(A \Bx + \Bb)$$ is convex $$\forall \Bx \in \setlr{ \Bx | A \Bx + \Bb \in \textrm{dom} F }$$.

Note: for the last

\label{eqn:convexOptimizationLecture5:620}
\begin{aligned}
g &: \mathbb{R}^m \rightarrow \mathbb{R} \\
F &: \mathbb{R}^n \rightarrow \mathbb{R} \\
\Bx &\in \mathbb{R}^m \\
A &\in \mathbb{R}^{n \times m} \\
\Bb &\in \mathbb{R}^n
\end{aligned}

Proof (of last):

\label{eqn:convexOptimizationLecture5:640}
\begin{aligned}
g( \theta \Bx + (1-\theta) \By )
&=
F( \theta (A \Bx + \Bb) + (1-\theta) (A \By + \Bb) ) \\
&\le
\theta F( A \Bx + \Bb) + (1-\theta) F (A \By + \Bb) \\
&= \theta g(\Bx) + (1-\theta) g(\By).
\end{aligned}

# References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

## ECE1505H Convex Optimization. Lecture 4: Sets and convexity. Taught by Prof. Stark Draper

January 25, 2017 ece1505 No comments , , , , , , ,

ECE1505H Convex Optimization. Lecture 4: Sets and convexity. Taught by Prof. Stark Draper

### Disclaimer

Peeter’s lecture notes from class. These may be incoherent and rough.

These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper, covering [1] content.

### Today

• more on various sets: hyperplanes, half-spaces, polyhedra, balls, ellipses, norm balls, cone of PSD
• generalize inequalities
• operations that preserve convexity
• separating and supporting hyperplanes.

## Hyperplanes

Find some $$\Bx_0 \in \mathbb{R}^n$$ such that $$\Ba^\T \Bx_0 = \Bb$$, so

\label{eqn:convexOptimizationLecture4:20}
\begin{aligned}
\setlr{ \Bx | \Ba^\T \Bx = \Bb }
&=
\setlr{ \Bx | \Ba^\T \Bx = \Ba^\T \Bx_0 } \\
&=
\setlr{ \Bx | \Ba^\T (\Bx – \Bx_0) } \\
&=
\Bx_0 + \Ba^\perp,
\end{aligned}

where

\label{eqn:convexOptimizationLecture4:40}
\Ba^\perp = \setlr{ \Bv | \Ba^\T \Bv = 0 }.

fig. 1. Parallel hyperplanes.

Recall

\label{eqn:convexOptimizationLecture4:60}
\Norm{\Bz}_\conj = \sup_\Bx \setlr{ \Bz^\T \Bx | \Norm{\Bx} \le 1 }

Denote the optimizer of above as $$\Bx^\conj$$. By definition

\label{eqn:convexOptimizationLecture4:80}
\Bz^\T \Bx^\conj \ge \Bz^\T \Bx \quad \forall \Bx, \Norm{\Bx} \le 1

This defines a half space in which the unit ball

\label{eqn:convexOptimizationLecture4:100}
\setlr{ \Bx | \Bz^\T (\Bx – \Bx^\conj \le 0 }

Start with the $$l_1$$ norm, duals of $$l_1$$ is $$l_\infty$$

fig. 2. Half space containing unit ball.

Similar pic for $$l_\infty$$, for which the dual is the $$l_1$$ norm, as sketched in fig. 3.  Here the optimizer point is at $$(1,1)$$

fig. 3. Half space containing the unit ball for l_infinity

and a similar pic for $$l_2$$, which is sketched in fig. 4.

fig. 4. Half space containing for l_2 unit ball.

## Polyhedra

\label{eqn:convexOptimizationLecture4:120}
\begin{aligned}
\mathcal{P}
&= \setlr{ \Bx |
\Ba_j^\T \Bx \le \Bb_j, j \in [1,m],
\Bc_i^\T \Bx = \Bd_i, i \in [1,p]
} \\
&=
\setlr{ \Bx | A \Bx \le \Bb, C \Bx = d },
\end{aligned}

where the final inequality and equality are component wise.

Proving $$\mathcal{P}$$ is convex:

• Pick $$\Bx_1 \in \mathcal{P}$$, $$\Bx_2 \in \mathcal{P}$$
• Pick any $$\theta \in [0,1]$$
• Test $$\theta \Bx_1 + (1-\theta) \Bx_2$$. Is it in $$\mathcal{P}$$?

\label{eqn:convexOptimizationLecture4:140}
\begin{aligned}
A \lr{ \theta \Bx_1 + (1-\theta) \Bx_2 }
&=
\theta A \Bx_1 + (1-\theta) A \Bx_2 \\
&\le
\theta \Bb + (1-\theta) \Bb \\
&=
\Bb.
\end{aligned}

## Balls

Euclidean ball for $$\Bx_c \in \mathbb{R}^n, r \in \mathbb{R}$$

\label{eqn:convexOptimizationLecture4:160}
\mathcal{B}(\Bx_c, r)
= \setlr{ \Bx | \Norm{\Bx – \Bx_c}_2 \le r },

or
\label{eqn:convexOptimizationLecture4:180}
\mathcal{B}(\Bx_c, r)
= \setlr{ \Bx | \lr{\Bx – \Bx_c}^\T \lr{\Bx – \Bx_c} \le r^2 }.

Let $$\Bx_1, \Bx_2$$, $$\theta \in [0,1]$$

\label{eqn:convexOptimizationLecture4:200}
\begin{aligned}
\Norm{ \theta \Bx_1 + (1-\theta) \Bx_2 – \Bx_c }_2
&=
\Norm{ \theta (\Bx_1 – \Bx_c) + (1-\theta) (\Bx_2 – \Bx_c) }_2 \\
&\le
\Norm{ \theta (\Bx_1 – \Bx_c)}_2 + \Norm{(1-\theta) (\Bx_2 – \Bx_c) }_2 \\
&=
\Abs{\theta} \Norm{ \Bx_1 – \Bx_c}_2 + \Abs{1 -\theta} \Norm{ \Bx_2 – \Bx_c }_2 \\
&=
\theta \Norm{ \Bx_1 – \Bx_c}_2 + \lr{1 -\theta} \Norm{ \Bx_2 – \Bx_c }_2 \\
&\le
\theta r + (1 – \theta) r \\
&= r
\end{aligned}

## Ellipse

\label{eqn:convexOptimizationLecture4:220}
\mathcal{E}(\Bx_c, P)
=
\setlr{ \Bx | (\Bx – \Bx_c)^\T P^{-1} (\Bx – \Bx_c) \le 1 },

where $$P \in S^n_{++}$$.

• Euclidean ball is an ellipse with $$P = I r^2$$
• Ellipse is image of Euclidean ball $$\mathcal{B}(0,1)$$ under affine mapping.

fig. 5. Circle and ellipse.

Given

\label{eqn:convexOptimizationLecture4:240}
F(\Bu) = P^{1/2} \Bu + \Bx_c

\label{eqn:convexOptimizationLecture4:260}
\begin{aligned}
\setlr{ F(\Bu) | \Norm{\Bu}_2 \le r }
&=
\setlr{ P^{1/2} \Bu + \Bx_c | \Bu^\T \Bu \le r^2 } \\
&=
\setlr{ \Bx | \Bx = P^{1/2} \Bu + \Bx_c, \Bu^\T \Bu \le r^2 } \\
&=
\setlr{ \Bx | \Bu = P^{-1/2} (\Bx – \Bx_c), \Bu^\T \Bu \le r^2 } \\
&=
\setlr{ \Bx | (\Bx – \Bx_c)^\T P^{-1} (\Bx – \Bx_c) \le r^2 }
\end{aligned}

## Geometry of an ellipse

Decomposition of positive definite matrix $$P \in S^n_{++} \subset S^n$$ is:

\label{eqn:convexOptimizationLecture4:280}
\begin{aligned}
P &= Q \textrm{diag}(\lambda_i) Q^\T \\
Q^\T Q &= 1
\end{aligned},

where $$\lambda_i \in \mathbb{R}$$, and $$\lambda_i > 0$$. The ellipse is defined by

\label{eqn:convexOptimizationLecture4:300}
(\Bx – \Bx_c)^\T Q \textrm{diag}(1/\lambda_i) (\Bx – \Bx_c) Q \le r^2

The term $$(\Bx – \Bx_c)^\T Q$$ projects $$\Bx – \Bx_c$$ onto the columns of $$Q$$. Those columns are perpendicular since $$Q$$ is an orthogonal matrix. Let

\label{eqn:convexOptimizationLecture4:320}
\tilde{\Bx} = Q^\T (\Bx – \Bx_c),

this shifts the origin around $$\Bx_c$$ and $$Q$$ rotates into a new coordinate system. The ellipse is therefore

\label{eqn:convexOptimizationLecture4:340}
\tilde{\Bx}^\T
\begin{bmatrix}
\inv{\lambda_1} & & & \\
&\inv{\lambda_2} & & \\
& \ddots & \\
& & & \inv{\lambda_n}
\end{bmatrix}
\tilde{\Bx}
=
\sum_{i = 1}^n \frac{\tilde{x}_i^2}{\lambda_i} \le 1.

An example is sketched for $$\lambda_1 > \lambda_2$$ below.

Ellipse with $$\lambda_1 > \lambda_2$$.

• $$\lambda_i$$ tells us length of the semi-major axis.
• Larger $$\lambda_i$$ means $$\tilde{x}_i^2$$ can be bigger and still satisfy constraint $$\le 1$$.
• Volume of ellipse if proportional to $$\sqrt{ \det P } = \sqrt{ \prod_{i = 1}^n \lambda_i }$$.
• When any $$\lambda_i \rightarrow 0$$ a dimension is lost and the volume goes to zero. That removes the invertibility required.

Ellipses will be seen a lot in this course, since we are interested in “bowl” like geometries (and the ellipse is the image of a Euclidean ball).

## Norm ball.

The norm ball

\label{eqn:convexOptimizationLecture4:360}
\mathcal{B} = \setlr{ \Bx | \Norm{\Bx} \le 1 },

is a convex set for all norms. Proof:

Take any $$\Bx, \By \in \mathcal{B}$$

\label{eqn:convexOptimizationLecture4:380}
\Norm{ \theta \Bx + (1 – \theta) \By }
\le
\Abs{\theta} \Norm{ \Bx } + \Abs{1 – \theta} \Norm{ \By }
=
\theta \Norm{ \Bx } + \lr{1 – \theta} \Norm{ \By }
\lr
\theta + \lr{1 – \theta}
=
1.

This is true for any p-norm $$1 \le p$$, $$\Norm{\Bx}_p = \lr{ \sum_{i = 1}^n \Abs{x_i}^p }^{1/p}$$.

Norm ball.

The shape of a $$p < 1$$ norm unit ball is sketched below (lines connecting points in such a region can exit the region).

## Cones

Recall that $$C$$ is a cone if $$\forall \Bx \in C, \theta \ge 0, \theta \Bx \in C$$.

Impt cone of PSD matrices

\label{eqn:convexOptimizationLecture4:400}
\begin{aligned}
S^n &= \setlr{ X \in \mathbb{R}^{n \times n} | X = X^\T } \\
S^n_{+} &= \setlr{ X \in S^n | \Bv^\T X \Bv \ge 0, \quad \forall v \in \mathbb{R}^n } \\
S^n_{++} &= \setlr{ X \in S^n_{+} | \Bv^\T X \Bv > 0, \quad \forall v \in \mathbb{R}^n } \\
\end{aligned}

These have respectively

• $$\lambda_i \in \mathbb{R}$$
• $$\lambda_i \in \mathbb{R}_{+}$$
• $$\lambda_i \in \mathbb{R}_{++}$$

$$S^n_{+}$$ is a cone if:

$$X \in S^n_{+}$$, then $$\theta X \in S^n_{+}, \quad \forall \theta \ge 0$$

\label{eqn:convexOptimizationLecture4:420}
\Bv^\T (\theta X) \Bv
= \theta \Bv^\T \Bv
\ge 0,

since $$\theta \ge 0$$ and because $$X \in S^n_{+}$$.

Shorthand:

\label{eqn:convexOptimizationLecture4:440}
\begin{aligned}
X &\in S^n_{+} \Rightarrow X \succeq 0
X &\in S^n_{++} \Rightarrow X \succ 0.
\end{aligned}

Further $$S^n_{+}$$ is a convex cone.

Let $$A \in S^n_{+}$$, $$B \in S^n_{+}$$, $$\theta_1, \theta_2 \ge 0, \theta_1 + \theta_2 = 1$$, or $$\theta_2 = 1 – \theta_1$$.

Show that $$\theta_1 A + \theta_2 B \in S^n_{+}$$ :

\label{eqn:convexOptimizationLecture4:460}
\Bv^\T \lr{ \theta_1 A + \theta_2 B } \Bv
=
\theta_1 \Bv^\T A \Bv
+\theta_2 \Bv^\T B \Bv
\ge 0,

since $$\theta_1 \ge 0, \theta_2 \ge 0, \Bv^\T A \Bv \ge 0, \Bv^\T B \Bv \ge 0$$.

fig. 8. Cone.

Inequalities:

Start with a proper cone $$K \subseteq \mathbb{R}^n$$

• closed, convex
• non-empty interior (“solid”)
• “pointed” (contains no lines)

The $$K$$ defines a generalized inequality in \R{n} defined as “$$\le_K$$”

Interpreting

\label{eqn:convexOptimizationLecture4:480}
\begin{aligned}
\Bx \le_K \By &\leftrightarrow \By – \Bx \in K
\Bx \end{aligned}

Why pointed? Want if $$\Bx \le_K \By$$ and $$\By \le_K \Bx$$ with this $$K$$ is a half space.

Example:1: $$K = \mathbb{R}^n_{+}, \Bx \in \mathbb{R}^n, \By \in \mathbb{R}^n$$

fig. 12. K is non-negative “orthant”

\label{eqn:convexOptimizationLecture4:500}
\Bx \le_K \By \Rightarrow \By – \Bx \in K

say:

\label{eqn:convexOptimizationLecture4:520}
\begin{bmatrix}
y_1 – x_1
y_2 – x_2
\end{bmatrix}
\in R^2_{+}

Also:

\label{eqn:convexOptimizationLecture4:540}
K = R^1_{+}

(pointed, since it contains no rays)

\label{eqn:convexOptimizationLecture4:560}
\Bx \le_K \By ,

with respect to $$K = \mathbb{R}^n_{+}$$ means that $$x_i \le y_i$$ for all $$i \in [1,n]$$.

Example:2: For $$K = PSD \subseteq S^n$$,

\label{eqn:convexOptimizationLecture4:580}
\Bx \le_K \By ,

means that

\label{eqn:convexOptimizationLecture4:600}
\By – \Bx \in K = S^n_{+}.

• Difference $$\By – \Bx$$ is always in $$S$$
• check if in $$K$$ by checking if all eigenvalues $$\ge 0$$.
• $$S^n_{++}$$ is the interior of $$S^n_{+}$$.

Interpretation:

\label{eqn:convexOptimizationLecture4:620}
\begin{aligned}
\Bx \le_K \By &\leftrightarrow \By – \Bx \in K \\
\Bx \end{aligned}

We’ll use these with vectors and matrices so often the $$K$$ subscript will often be dropped, writing instead (for vectors)

\label{eqn:convexOptimizationLecture4:640}
\begin{aligned}
\Bx \le \By &\leftrightarrow \By – \Bx \in \mathbb{R}^n_{+} \\
\Bx < \By &\leftrightarrow \By – \Bx \in \textrm{int} \mathbb{R}^n_{++}
\end{aligned}

and for matrices

\label{eqn:convexOptimizationLecture4:660}
\begin{aligned}
\Bx \le \By &\leftrightarrow \By – \Bx \in S^n_{+} \\
\Bx < \By &\leftrightarrow \By – \Bx \in \textrm{int} S^n_{++}.
\end{aligned}

## Intersection

Take the intersection of (perhaps infinitely many) sets $$S_\alpha$$:

If $$S_\alpha$$ is (affine,convex, conic) for all $$\alpha \in A$$ then

\label{eqn:convexOptimizationLecture4:680}
\cap_\alpha S_\alpha

is (affine,convex, conic). To prove in homework:

\label{eqn:convexOptimizationLecture4:700}
\mathcal{P} = \setlr{ \Bx | \Ba_i^\T \Bx \le \Bb_i, \Bc_j^\T \Bx = \Bd_j, \quad \forall i \cdots j }

This is convex since the intersection of a bunch of hyperplane and half space constraints.

1. If $$S \subseteq \mathbb{R}^n$$ is convex then\label{eqn:convexOptimizationLecture4:720}
F(S) = \setlr{ F(\Bx) | \Bx \in S }
is convex.
2. If $$S \subseteq \mathbb{R}^m$$ then\label{eqn:convexOptimizationLecture4:740}
F^{-1}(S) = \setlr{ \Bx | F(\Bx) \in S }
is convex. Such a mapping is sketched in fig. 14.

fig. 14. Mapping functions of sets.

# References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

## ECE1505H Convex Optimization. Lecture 3: Matrix functions, SVD, and types of Sets. Taught by Prof. Stark Draper

### Disclaimer

Peeter’s lecture notes from class. These may be incoherent and rough.

These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper.

## Matrix inner product

Given real matrices $$X, Y \in \mathbb{R}^{m\times n}$$, one possible matrix inner product definition is

\label{eqn:convexOptimizationLecture3:20}
\begin{aligned}
\innerprod{X}{Y}
&= \textrm{Tr}( X^\T Y) \\
&= \textrm{Tr} \lr{ \sum_{k = 1}^m X_{ki} Y_{kj} } \\
&= \sum_{k = 1}^m \sum_{j = 1}^n X_{kj} Y_{kj} \\
&= \sum_{i = 1}^m \sum_{j = 1}^n X_{ij} Y_{ij}.
\end{aligned}

This inner product induces a norm on the (matrix) vector space, called the Frobenius norm

\label{eqn:convexOptimizationLecture3:40}
\begin{aligned}
\Norm{X }_F
&= \textrm{Tr}( X^\T X) \\
&= \sqrt{ \innerprod{X}{X} } \\
&=
\sum_{i = 1}^m \sum_{j = 1}^n X_{ij}^2.
\end{aligned}

## Range, nullspace.

Definition: Range: Given $$A \in \mathbb{R}^{m \times n}$$, the range of A is the set:

\begin{equation*}
\mathcal{R}(A) = \setlr{ A \Bx | \Bx \in \mathbb{R}^n }.
\end{equation*}

Definition: Nullspace: Given $$A \in \mathbb{R}^{m \times n}$$, the nullspace of A is the set:

\begin{equation*}
\mathcal{N}(A) = \setlr{ \Bx | A \Bx = 0 }.
\end{equation*}

## SVD.

To understand operation of $$A \in \mathbb{R}^{m \times n}$$, a representation of a linear transformation from \R{n} to \R{m}, decompose $$A$$ using the singular value decomposition (SVD).

Definition: SVD: Given $$A \in \mathbb{R}^{m \times n}$$, an operator on $$\Bx \in \mathbb{R}^n$$, a decomposition of the following form is always possible

\begin{equation*}
\begin{aligned}
A &= U \Sigma V^\T \\
U &\in \mathbb{R}^{m \times r} \\
V &\in \mathbb{R}^{n \times r},
\end{aligned}
\end{equation*}

where $$r$$ is the rank of $$A$$, and both $$U$$ and $$V$$ are orthogonal

\begin{equation*}
\begin{aligned}
U^\T U &= I \in \mathbb{R}^{r \times r} \\
V^\T V &= I \in \mathbb{R}^{r \times r}.
\end{aligned}
\end{equation*}

Here $$\Sigma = \textrm{diag}( \sigma_1, \sigma_2, \cdots, \sigma_r )$$, is a diagonal matrix of “singular” values, where

\begin{equation*}
\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r.
\end{equation*}

For simplicity consider square case $$m = n$$

\label{eqn:convexOptimizationLecture3:100}
A \Bx = \lr{ U \Sigma V^\T } \Bx.

The first product $$V^\T \Bx$$ is a rotation, which can be checked by looking at the length

\label{eqn:convexOptimizationLecture3:120}
\begin{aligned}
\Norm{ V^\T \Bx}_2
&= \sqrt{ \Bx^\T V V^\T \Bx } \\
&= \sqrt{ \Bx^\T \Bx } \\
&= \Norm{ \Bx }_2,
\end{aligned}

which shows that the length of the vector is unchanged after application of the linear transformation represented by $$V^\T$$ so that operation must be a rotation.

Similarly the operation of $$U$$ on $$\Sigma V^\T \Bx$$ also must be a rotation. The operation $$\Sigma = [\sigma_i]_i$$ applies a scaling operation to each component of the vector $$V^\T \Bx$$.

All linear (square) transformations can therefore be thought of as a rotate-scale-rotate operation. Often the $$A$$ of interest will be symmetric $$A = A^\T$$.

## Set of symmetric matrices

Let $$S^n$$ be the set of real, symmetric $$n \times n$$ matrices.

Theorem: Spectral theorem: When $$A \in S^n$$ then it is possible to factor $$A$$ as

\begin{equation*}
A = Q \Lambda Q^\T,
\end{equation*}

where $$Q$$ is an orthogonal matrix, and $$\Lambda = \textrm{diag}( \lambda_1, \lambda_2, \cdots \lambda_n)$$. Here $$\lambda_i \in \mathbb{R} \, \forall i$$ are the (real) eigenvalues of $$A$$.

A real symmetric matrix $$A \in S^n$$ is “positive semi-definite” if

\begin{equation*}
\Bv^\T A \Bv \ge 0 \qquad\forall \Bv \in \mathbb{R}^n, \Bv \ne 0,
\end{equation*}
and is “positive definite” if

\begin{equation*}
\Bv^\T A \Bv > 0 \qquad\forall \Bv \in \mathbb{R}^n, \Bv \ne 0.
\end{equation*}

The set of such matrices is denoted $$S^n_{+}$$, and $$S^n_{++}$$ respectively.

Consider $$A \in S^n_{+}$$ (or $$S^n_{++}$$ )

\label{eqn:convexOptimizationLecture3:200}
A = Q \Lambda Q^\T,

possible since the matrix is symmetric. For such a matrix

\label{eqn:convexOptimizationLecture3:220}
\begin{aligned}
\Bv^\T A \Bv
&=
\Bv^\T Q \Lambda A^\T \Bv \\
&=
\Bw^\T \Lambda \Bw,
\end{aligned}

where $$\Bw = A^\T \Bv$$. Such a product is

\label{eqn:convexOptimizationLecture3:240}
\Bv^\T A \Bv
=
\sum_{i = 1}^n \lambda_i w_i^2.

So, if $$\lambda_i \ge 0$$ ($$\lambda_i > 0$$ ) then $$\sum_{i = 1}^n \lambda_i w_i^2$$ is non-negative (positive) $$\forall \Bw \in \mathbb{R}^n, \Bw \ne 0$$. Since $$\Bw$$ is just a rotated version of $$\Bv$$ this also holds for all $$\Bv$$. A necessary and sufficient condition for $$A \in S^n_{+}$$ ($$S^n_{++}$$ ) is $$\lambda_i \ge 0$$ ($$\lambda_i > 0$$).

## Square root of positive semi-definite matrix

Real symmetric matrix power relationships such as

\label{eqn:convexOptimizationLecture3:260}
A^2
=
Q \Lambda Q^\T
Q \Lambda Q^\T
=
Q \Lambda^2
Q^\T
,

or more generally $$A^k = Q \Lambda^k Q^\T,\, k \in \mathbb{Z}$$, can be further generalized to non-integral powers. In particular, the square root (non-unique) of a square matrix can be written

\label{eqn:convexOptimizationLecture3:280}
A^{1/2} = Q
\begin{bmatrix}
\sqrt{\lambda_1} & & & \\
& \sqrt{\lambda_2} & & \\
& & \ddots & \\
& & & \sqrt{\lambda_n} \\
\end{bmatrix}
Q^\T,

since $$A^{1/2} A^{1/2} = A$$, regardless of the sign picked for the square roots in question.

## Functions of matrices

Consider $$F : S^n \rightarrow \mathbb{R}$$, and define

\label{eqn:convexOptimizationLecture3:300}
F(X) = \log \det X,

Here $$\textrm{dom} F = S^n_{++}$$. The task is to find $$\spacegrad F$$, which can be done by looking at the perturbation $$\log \det ( X + \Delta X )$$

\label{eqn:convexOptimizationLecture3:320}
\begin{aligned}
\log \det ( X + \Delta X )
&=
\log \det ( X^{1/2} (I + X^{-1/2} \Delta X X^{-1/2}) X^{1/2} ) \\
&=
\log \det ( X (I + X^{-1/2} \Delta X X^{-1/2}) ) \\
&=
\log \det X + \log \det (I + X^{-1/2} \Delta X X^{-1/2}).
\end{aligned}

Let $$X^{-1/2} \Delta X X^{-1/2} = M$$ where $$\lambda_i$$ are the eigenvalues of $$M : M \Bv = \lambda_i \Bv$$ when $$\Bv$$ is an eigenvector of $$M$$. In particular

\label{eqn:convexOptimizationLecture3:340}
(I + M) \Bv =
(1 + \lambda_i) \Bv,

where $$1 + \lambda_i$$ are the eigenvalues of the $$I + M$$ matrix. Since the determinant is the product of the eigenvalues, this gives

\label{eqn:convexOptimizationLecture3:360}
\begin{aligned}
\log \det ( X + \Delta X )
&=
\log \det X +
\log \prod_{i = 1}^n (1 + \lambda_i) \\
&=
\log \det X +
\sum_{i = 1}^n \log (1 + \lambda_i).
\end{aligned}

If $$\lambda_i$$ are sufficiently “small”, then $$\log ( 1 + \lambda_i ) \approx \lambda_i$$, giving

\label{eqn:convexOptimizationLecture3:380}
\log \det ( X + \Delta X )
=
\log \det X +
\sum_{i = 1}^n \lambda_i
\approx
\log \det X +
\textrm{Tr}( X^{-1/2} \Delta X X^{-1/2} ).

Since
\label{eqn:convexOptimizationLecture3:400}
\textrm{Tr}( A B ) = \textrm{Tr}( B A ),

this trace operation can be written as

\label{eqn:convexOptimizationLecture3:420}
\log \det ( X + \Delta X )
\approx
\log \det X +
\textrm{Tr}( X^{-1} \Delta X )
=
\log \det X +
\innerprod{ X^{-1}}{\Delta X},

so
\label{eqn:convexOptimizationLecture3:440}

To check this, consider the simplest example with $$X \in \mathbb{R}^{1 \times 1}$$, where we have

\label{eqn:convexOptimizationLecture3:460}
\frac{d}{dX} \lr{ \log \det X } = \frac{d}{dX} \lr{ \log X } = \inv{X} = X^{-1}.

This is a nice example demonstrating how the gradient can be obtained by performing a first order perturbation of the function. The gradient can then be read off from the result.

## Second order perturbations

• To get first order approximation found the part that varied linearly in $$\Delta X$$.
• To get the second order part, perturb $$X^{-1}$$ by $$\Delta X$$ and see how that perturbation varies in $$\Delta X$$.

For $$G(X) = X^{-1}$$, this is

\label{eqn:convexOptimizationLecture3:480}
\begin{aligned}
(X + \Delta X)^{-1}
&=
\lr{ X^{1/2} (I + X^{-1/2} \Delta X X^{-1/2} ) X^{1/2} }^{-1} \\
&=
X^{-1/2} (I + X^{-1/2} \Delta X X^{-1/2} )^{-1} X^{-1/2}
\end{aligned}

To be proven in the homework (for “small” A)

\label{eqn:convexOptimizationLecture3:500}
(I + A)^{-1} \approx I – A.

This gives

\label{eqn:convexOptimizationLecture3:520}
\begin{aligned}
(X + \Delta X)^{-1}
&=
X^{-1/2} (I – X^{-1/2} \Delta X X^{-1/2} ) X^{-1/2} \\
&=
X^{-1} – X^{-1} \Delta X X^{-1},
\end{aligned}

or

\label{eqn:convexOptimizationLecture3:800}
\begin{aligned}
G(X + \Delta X)
&= G(X) + (D G) \Delta X \\
&= G(X) + (\spacegrad G)^\T \Delta X,
\end{aligned}

so
\label{eqn:convexOptimizationLecture3:820}
=
– X^{-1} \Delta X X^{-1}.

The Taylor expansion of $$F$$ to second order is

\label{eqn:convexOptimizationLecture3:840}
F(X + \Delta X)
=
F(X)
+
\textrm{Tr} \lr{ (\spacegrad F)^\T \Delta X}
+
\inv{2}
\lr{ (\Delta X)^\T (\spacegrad^2 F) \Delta X}.

The first trace can be expressed as an inner product

\label{eqn:convexOptimizationLecture3:860}
\begin{aligned}
\textrm{Tr} \lr{ (\spacegrad F)^\T \Delta X}
&=
\innerprod{ \spacegrad F }{\Delta X} \\
&=
\innerprod{ X^{-1} }{\Delta X}.
\end{aligned}

The second trace also has the structure of an inner product

\label{eqn:convexOptimizationLecture3:880}
\begin{aligned}
(\Delta X)^\T (\spacegrad^2 F) \Delta X
&=
\textrm{Tr} \lr{ (\Delta X)^\T (\spacegrad^2 F) \Delta X} \\
&=
\innerprod{ (\spacegrad^2 F)^\T \Delta X }{\Delta X},
\end{aligned}

where a no-op trace could be inserted in the second order term since that quadratic form is already a scalar. This $$(\spacegrad^2 F)^\T \Delta X$$ term has essentially been found implicitly by performing the linear variation of $$\spacegrad F$$ in $$\Delta X$$, showing that we must have

\label{eqn:convexOptimizationLecture3:900}
\textrm{Tr} \lr{ (\Delta X)^\T (\spacegrad^2 F) \Delta X}
=
\innerprod{ – X^{-1} \Delta X X^{-1} }{\Delta X},

so
\label{eqn:convexOptimizationLecture3:560}
F( X + \Delta X) = F(X) +
\innerprod{X^{-1}}{\Delta X}
+\inv{2} \innerprod{-X^{-1} \Delta X X^{-1}}{\Delta X},

or
\label{eqn:convexOptimizationLecture3:580}
\log \det ( X + \Delta X) = \log \det X +
\textrm{Tr}( X^{-1} \Delta X )
– \inv{2} \textrm{Tr}( X^{-1} \Delta X X^{-1} \Delta X ).

## Convex Sets

• Types of sets: Affine, convex, cones
• Examples: Hyperplanes, polyhedra, balls, ellipses, norm balls, cone of PSD matrices.

Definition: Affine set:

A set $$C \subseteq \mathbb{R}^n$$ is affine if $$\forall \Bx_1, \Bx_2 \in C$$ then

\begin{equation*}
\theta \Bx_1 + (1 -\theta) \Bx_2 \in C, \qquad \forall \theta \in \mathbb{R}.
\end{equation*}

The affine sum above can
be rewritten as

\label{eqn:convexOptimizationLecture3:600}
\Bx_2 + \theta (\Bx_1 – \Bx_2).

Since $$\theta$$ is a scaling, this is the line containing $$\Bx_2$$ in the direction between $$\Bx_1$$ and $$\Bx_2$$.

Observe that the solution to a set of linear equations

\label{eqn:convexOptimizationLecture3:620}
C = \setlr{ \Bx | A \Bx = \Bb },

is an affine set. To check, note that

\label{eqn:convexOptimizationLecture3:640}
\begin{aligned}
A (\theta \Bx_1 + (1 – \theta) \Bx_2)
&=
\theta A \Bx_1 + (1 – \theta) A \Bx_2 \\
&=
\theta \Bb + (1 – \theta) \Bb \\
&= \Bb.
\end{aligned}

Definition: Affine combination: An affine combination of points $$\Bx_1, \Bx_2, \cdots \Bx_n$$ is

\begin{equation*}
\sum_{i = 1}^n \theta_i \Bx_i,
\end{equation*}

such that for $$\theta_i \in \mathbb{R}$$

\begin{equation*}
\sum_{i = 1}^n \theta_i = 1.
\end{equation*}

An affine set contains all affine combinations of points in the set. Examples of a couple affine sets are sketched in fig 1.1

For comparison, a couple of non-affine sets are sketched in fig 1.2

Definition: Convex set: A set $$C \subseteq \mathbb{R}^n$$ is convex if $$\forall \Bx_1, \Bx_2 \in C$$ and $$\forall \theta \in \mathbb{R}, \theta \in [0,1]$$, the combination

\label{eqn:convexOptimizationLecture3:700}
\theta \Bx_1 + (1 – \theta) \Bx_2 \in C.

Definition: Convex combination: A convex combination of $$\Bx_1, \Bx_2, \cdots \Bx_n$$ is

\begin{equation*}
\sum_{i = 1}^n \theta_i \Bx_i,
\end{equation*}

such that $$\forall \theta_i \ge 0$$

\begin{equation*}
\sum_{i = 1}^n \theta_i = 1
\end{equation*}

Definition: Convex hull: Convex hull of a set $$C$$ is a set of all convex combinations of points in $$C$$, denoted

\label{eqn:convexOptimizationLecture3:720}
\textrm{conv}(C) = \setlr{ \sum_{i=1}^n \theta_i \Bx_i | \Bx_i \in C, \theta_i \ge 0, \sum_{i=1}^n \theta_i = 1 }.

A non-convex set can be converted into a convex hull by filling in all the combinations of points connecting points in the set, as sketched in fig 1.3.

Definition: Cone: A set $$C$$ is a cone if $$\forall \Bx \in C$$ and $$\forall \theta \ge 0$$ we have $$\theta \Bx \in C$$.

This scales out if $$\theta > 1$$ and scales in if $$\theta < 1$$.

A convex cone is a cone that is also a convex set. A conic combination is

\begin{equation*}
\sum_{i=1}^n \theta_i \Bx_i, \theta_i \ge 0.
\end{equation*}

A convex and non-convex 2D cone is sketched in fig. 1.4

A comparison of properties for different set types is tabulated in table 1.1

## Hyperplanes and half spaces

Definition: Hyperplane: A hyperplane is defined by

\begin{equation*}
\setlr{ \Bx | \Ba^\T \Bx = \Bb, \Ba \ne 0 }.
\end{equation*}

A line and plane are examples of this general construct as sketched in
fig. 1.5

An alternate view is possible should one
find any specific $$\Bx_0$$ such that $$\Ba^\T \Bx_0 = \Bb$$

\label{eqn:convexOptimizationLecture3:740}
\setlr{\Bx | \Ba^\T \Bx = b }
=
\setlr{\Bx | \Ba^\T (\Bx -\Bx_0) = 0 }

This shows that $$\Bx – \Bx_0 = \Ba^\perp$$ is perpendicular to $$\Ba$$, or

\label{eqn:convexOptimizationLecture3:780}
\Bx
=
\Bx_0 + \Ba^\perp.

This is the subspace perpendicular to $$\Ba$$ shifted by $$\Bx_0$$, subject to $$\Ba^\T \Bx_0 = \Bb$$. As a set

\label{eqn:convexOptimizationLecture3:760}
\Ba^\perp = \setlr{ \Bv | \Ba^\T \Bv = 0 }.

## Half space

Definition: Half space: The half space is defined as
\begin{equation*}
\setlr{ \Bx | \Ba^\T \Bx = \Bb }
= \setlr{ \Bx | \Ba^\T (\Bx – \Bx_0) \le 0 }.
\end{equation*}

This can also be expressed as $$\setlr{ \Bx | \innerprod{ \Ba }{\Bx – \Bx_0 } \le 0 }$$.

## ECE1505H Convex Optimization. Lecture 2: Mathematical background. Taught by Prof. Stark Draper

### Disclaimer

Peeter’s lecture notes from class. These may be incoherent and rough.

These are notes for the UofT course ECE1505H, Convex Optimization, taught by Prof. Stark Draper, from [1].

### Topics

• Calculus: Derivatives and Jacobians, Gradients, Hessians, approximation functions.
• Linear algebra, Matrices, decompositions, …

## Norms

Vector space:

A set of elements (vectors) that is closed under vector addition and scaling.

This generalizes the directed arrow concept of vector space (fig. 1) that is familiar from geometry.

Normed vector spaces:

A vector space with a notion of length of any single vector, the “norm”.

Inner product space:
A normed vector space with a notion of a real angle between any pair of vectors.

This course has a focus on optimization in \R{n}. Complex spaces in the context of this course can be considered with a mapping $$\text{\C{n}} \rightarrow \mathbb{R}^{2 n}$$.

Norm:
A norm is a function operating on a vector

\begin{equation*}
\Bx = ( x_1, x_2, \cdots, x_n )
\end{equation*}

that provides a mapping

\begin{equation*}
\Norm{ \cdot } : \mathbb{R}^{n} \rightarrow \mathbb{R},
\end{equation*}

where

• $$\Norm{ \Bx } \ge 0$$
• $$\Norm{ \Bx } = 0 \qquad \iff \Bx = 0$$
• $$\Norm{ t \Bx } = \Abs{t} \Norm{ \Bx }$$
• $$\Norm{ \Bx + \By } \le \Norm{ \Bx } + \Norm{\By}$$. This is the triangle inequality.

### Example: Euclidean norm

\label{eqn:convex-optimizationLecture2:24}
\Norm{\Bx} = \sqrt{ \sum_{i = 1}^n x_i^2 }

### Example: $$l_p$$-norms

\label{eqn:convex-optimizationLecture2:44}
\Norm{\Bx}_p = \lr{ \sum_{i = 1}^n \Abs{x_i}^p }^{1/p}.

For $$p = 1$$, this is

\label{eqn:convex-optimizationLecture2:64}
\Norm{\Bx}_1 = \sum_{i = 1}^n \Abs{x_i},

For $$p = 2$$, this is the Euclidean norm \ref{eqn:convex-optimizationLecture2:24}.
For $$p = \infty$$, this is

\label{eqn:convex-optimizationLecture2:324}
\Norm{\Bx}_\infty = \max_{i = 1}^n \Abs{x_i}.

Unit ball:

\begin{equation*}
\setlr{ \Bx | \Norm{\Bx} \le 1 }
\end{equation*}

The regions of the unit ball under the $$l_1, l_2, and l_\infty$$ norms are plotted in fig. 2.

fig. 2. Some unit ball regions.

The $$l_2$$ norm is not only familiar, but can be “induced” by an inner product

\label{eqn:convex-optimizationLecture2:84}
\left\langle \Bx, \By \right\rangle = \Bx^\T \By = \sum_{i = 1}^n x_i y_i,

which is not true for all norms. The norm induced by this inner product is

\label{eqn:convex-optimizationLecture2:104}
\Norm{\Bx}_2 = \sqrt{ \left\langle \Bx, \By \right\rangle }

Inner product spaces have a notion of angle (fig. 3) given by

\label{eqn:convex-optimizationLecture2:124}
\left\langle \Bx, \By \right\rangle = \Norm{\Bx} \Norm{\By} \cos \theta,

fig. 3. Inner product induced angle.

and always satisfy the Cauchy-Schwartz inequality

\label{eqn:convex-optimizationLecture2:144}
\left\langle \Bx, \By \right\rangle \le \Norm{\Bx}_2 \Norm{\By}_2.

In an inner product space we say $$\Bx$$ and $$\By$$ are orthogonal vectors $$\Bx \perp \By$$ if $$\left\langle \Bx, \By \right\rangle = 0$$, as sketched in fig. 4.

fig. 4. Orthogonality.

## Dual norm

Let $$\Norm{ \cdot }$$ be a norm in \R{n}. The “dual” norm $$\Norm{ \cdot }_\conj$$ is defined as

\begin{equation*}
\Norm{\Bz}_\conj = \sup_\Bx \setlr{ \Bz^\T \Bx | \Norm{\Bx} \le 1 }.
\end{equation*}

where $$\sup$$ is roughly the “least upper bound”.
\index{sup}

This is a limit over the unit ball of $$\Norm{\cdot}$$.

### $$l_2$$ dual

Dual of the $$l_2$$ is the $$l_2$$ norm.

fig. 5. l_2 dual norm determination.

Proof:

\label{eqn:convex-optimizationLecture2:164}
\begin{aligned}
\Norm{\Bz}_\conj
&= \sup_\Bx \setlr{ \Bz^\T \Bx | \Norm{\Bx}_2 \le 1 } \\
&= \sup_\Bx \setlr{ \Norm{\Bz}_2 \Norm{\Bx}_2 \cos\theta | \Norm{\Bx}_2 \le 1 } \\
&\le \sup_\Bx \setlr{ \Norm{\Bz}_2 \Norm{\Bx}_2 | \Norm{\Bx}_2 \le 1 } \\
&\le
\Norm{\Bz}_2
\Norm{
\frac{\Bz}{ \Norm{\Bz}_2 }
}_2 \\
&=
\Norm{\Bz}_2.
\end{aligned}

### $$l_1$$ dual

For $$l_1$$, the dual is the $$l_\infty$$ norm. Proof:

\label{eqn:convex-optimizationLecture2:184}
\Norm{\Bz}_\conj
=
\sup_\Bx \setlr{ \Bz^\T \Bx | \Norm{\Bx}_1 \le 1 },

but
\label{eqn:convex-optimizationLecture2:204}
\Bz^\T \Bx
=
\sum_{i=1}^n z_i x_i \le
\Abs{
\sum_{i=1}^n z_i x_i
}
\le
\sum_{i=1}^n \Abs{z_i x_i },

so
\label{eqn:convex-optimizationLecture2:224}
\begin{aligned}
\Norm{\Bz}_\conj
&=
\sum_{i=1}^n \Abs{z_i}\Abs{ x_i } \\
&\le \lr{ \max_{j=1}^n \Abs{z_j} }
\sum_{i=1}^n \Abs{ x_i } \\
&\le \lr{ \max_{j=1}^n \Abs{z_j} } \\
&=
\Norm{\Bz}_\infty.
\end{aligned}

### $$l_\infty$$ dual

.

fig. 6. l_1 dual norm determination.

fig. 7. l_\infinity dual norm determination.

\label{eqn:convex-optimizationLecture2:244}
\Norm{\Bz}_\conj
=
\sup_\Bx \setlr{ \Bz^\T \Bx | \Norm{\Bx}_\infty \le 1 }.

Here
\label{eqn:convex-optimizationLecture2:264}
\begin{aligned}
\Bz^\T \Bx
&=
\sum_{i=1}^n z_i x_i \\
&\le
\sum_{i=1}^n \Abs{z_i}\Abs{ x_i } \\
&\le
\lr{ \max_j \Abs{ x_j } }
\sum_{i=1}^n \Abs{z_i} \\
&=
\Norm{\Bx}_\infty
\sum_{i=1}^n \Abs{z_i}.
\end{aligned}

So
\label{eqn:convex-optimizationLecture2:284}
\Norm{\Bz}_\conj
\le
\sum_{i=1}^n \Abs{z_i}
=
\Norm{\Bz}_1.

Statement from the lecture: I’m not sure where this fits:

\label{eqn:convex-optimizationLecture2:304}
x_i^\conj
=
\left\{
\begin{array}{l l}
+1 & \quad \mbox{$$z_i \ge 0$$} \\
-1 & \quad \mbox{$$z_i \le 0$$}
\end{array}
\right.

# References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

## Motivation

In class this Friday the Jacobian and Hessian matrices were introduced, but I did not find the treatment terribly clear. Here is an alternate treatment, beginning with the gradient construction from [2], which uses a nice trick to frame the multivariable derivative operation as a single variable Taylor expansion.

## Multivariable Taylor approximation

The Taylor series expansion for a scalar function $$g : {\mathbb{R}} \rightarrow {\mathbb{R}}$$ about the origin is just

\label{eqn:jacobianAndHessian:20}
g(t) = g(0) + t g'(0) + \frac{t^2}{2} g”(0) + \cdots

In particular

\label{eqn:jacobianAndHessian:40}
g(1) = g(0) + g'(0) + \frac{1}{2} g”(0) + \cdots

Now consider $$g(t) = f( \Bx + \Ba t )$$, where $$f : {\mathbb{R}}^n \rightarrow {\mathbb{R}}$$, $$g(0) = f(\Bx)$$, and $$g(1) = f(\Bx + \Ba)$$. The multivariable Taylor expansion now follows directly

\label{eqn:jacobianAndHessian:60}
f( \Bx + \Ba)
= f(\Bx)
+ \evalbar{\frac{df(\Bx + \Ba t)}{dt}}{t = 0} + \frac{1}{2} \evalbar{\frac{d^2f(\Bx + \Ba t)}{dt^2}}{t = 0} + \cdots

The first order term is

\label{eqn:jacobianAndHessian:80}
\begin{aligned}
\evalbar{\frac{df(\Bx + \Ba t)}{dt}}{t = 0}
&=
\sum_{i = 1}^n
\frac{d( x_i + a_i t)}{dt}
\evalbar{\PD{(x_i + a_i t)}{f(\Bx + \Ba t)}}{t = 0} \\
&=
\sum_{i = 1}^n
a_i
\PD{x_i}{f(\Bx)} \\
\end{aligned}

Similarily, for the second order term

\label{eqn:jacobianAndHessian:100}
\begin{aligned}
\evalbar{\frac{d^2 f(\Bx + \Ba t)}{dt^2}}{t = 0}
&=
\evalbar{\lr{
\frac{d}{dt}
\lr{
\sum_{i = 1}^n
a_i
\PD{(x_i + a_i t)}{f(\Bx + \Ba t)}
}
}
}{t = 0} \\
&=
\evalbar{
\lr{
\sum_{j = 1}^n
\frac{d(x_j + a_j t)}{dt}
\sum_{i = 1}^n
a_i
\frac{\partial^2 f(\Bx + \Ba t)}{\partial (x_j + a_j t) \partial (x_i + a_i t) }
}
}{t = 0} \\
&=
\sum_{i,j = 1}^n a_i a_j \frac{\partial^2 f}{\partial x_i \partial x_j} \\
&=
\end{aligned}

The complete Taylor expansion of a scalar function $$f : {\mathbb{R}}^n \rightarrow {\mathbb{R}}$$ is therefore

\label{eqn:jacobianAndHessian:120}
f(\Bx + \Ba)
= f(\Bx) +
\inv{2} \lr{ \Ba \cdot \spacegrad}^2 f + \cdots,

so the Taylor expansion has an exponential structure

\label{eqn:jacobianAndHessian:140}
f(\Bx + \Ba) = \sum_{k = 0}^\infty \inv{k!} \lr{ \Ba \cdot \spacegrad}^k f = e^{\Ba \cdot \spacegrad} f.

Should an approximation of a vector valued function $$\Bf : {\mathbb{R}}^n \rightarrow {\mathbb{R}}^m$$ be desired it is only required to form a matrix of the components

\label{eqn:jacobianAndHessian:160}
\Bf(\Bx + \Ba)
= \Bf(\Bx) +
\inv{2} [\lr{ \Ba \cdot \spacegrad}^2 f_i]_i + \cdots,

where $$[.]_i$$ denotes a column vector over the rows $$i \in [1,m]$$, and $$f_i$$ are the coordinates of $$\Bf$$.

## The Jacobian matrix

In [1] the Jacobian $$D \Bf$$ of a function $$\Bf : {\mathbb{R}}^n \rightarrow {\mathbb{R}}^m$$ is defined in terms of the limit of the $$l_2$$ norm ratio

\label{eqn:jacobianAndHessian:180}
\frac{\Norm{\Bf(\Bz) – \Bf(\Bx) – (D \Bf) (\Bz – \Bx)}_2 }{ \Norm{\Bz – \Bx}_2 },

with the statement that the function $$\Bf$$ has a derivative if this limit exists. Here the Jacobian $$D \Bf \in {\mathbb{R}}^{m \times n}$$ must be matrix valued.

Let $$\Bz = \Bx + \Ba$$, so the first order expansion of \ref{eqn:jacobianAndHessian:160} is

\label{eqn:jacobianAndHessian:200}
\Bf(\Bz)
= \Bf(\Bx) + [\lr{ \Bz – \Bx } \cdot \spacegrad f_i]_i
.

With the (unproven) assumption that this Taylor expansion satisfies the norm limit criteria of \ref{eqn:jacobianAndHessian:180}, it is possible to extract the structure of the Jacobian by comparison

\label{eqn:jacobianAndHessian:220}
\begin{aligned}
(D \Bf)
(\Bz – \Bx)
&=
{\begin{bmatrix}
\lr{ \Bz – \Bx } \cdot \spacegrad f_i
\end{bmatrix}}_i \\
&=
{\begin{bmatrix}
\sum_{j = 1}^n (z_j – x_j) \PD{x_j}{f_i}
\end{bmatrix}}_i \\
&=
{\begin{bmatrix}
\PD{x_j}{f_i}
\end{bmatrix}}_{ij}
(\Bz – \Bx),
\end{aligned}

so
\label{eqn:jacobianAndHessian:240}
\boxed{
(D \Bf)_{ij} = \PD{x_j}{f_i}
}

Written out explictly as a matrix the Jacobian is

\label{eqn:jacobianAndHessian:320}
D \Bf
=
\begin{bmatrix}
\PD{x_1}{f_1} & \PD{x_2}{f_1} & \cdots & \PD{x_n}{f_1} \\
\PD{x_1}{f_2} & \PD{x_2}{f_2} & \cdots & \PD{x_n}{f_2} \\
\vdots & \vdots & & \vdots \\
\PD{x_1}{f_m} & \PD{x_2}{f_m} & \cdots & \PD{x_n}{f_m} \\
\end{bmatrix}
=
\begin{bmatrix}
\vdots \\
\end{bmatrix}.

In particular, when the function is scalar valued
\label{eqn:jacobianAndHessian:261}

With this notation, the first Taylor expansion, in terms of the Jacobian matrix is

\label{eqn:jacobianAndHessian:260}
\boxed{
\Bf(\Bz)
\approx \Bf(\Bx) + (D \Bf) \lr{ \Bz – \Bx }.
}

## The Hessian matrix

For scalar valued functions, the text expresses the second order expansion of a function in terms of the Jacobian and Hessian matrices

\label{eqn:jacobianAndHessian:271}
f(\Bz)
\approx f(\Bx) + (D f) \lr{ \Bz – \Bx }
+ \inv{2} \lr{ \Bz – \Bx }^\T (\spacegrad^2 f) \lr{ \Bz – \Bx }.

Because $$\spacegrad^2$$ is the usual notation for a Laplacian operator, this $$\spacegrad^2 f \in {\mathbb{R}}^{n \times n}$$ notation for the Hessian matrix is not ideal in my opinion. Ignoring that notational objection for this class, the structure of the Hessian matrix can be extracted by comparison with the coordinate expansion

\label{eqn:jacobianAndHessian:300}
=
\sum_{r,s = 1}^n a_r a_s \frac{\partial^2 f}{\partial x_r \partial x_s}

so
\label{eqn:jacobianAndHessian:280}
\boxed{
=
\frac{\partial^2 f_i}{\partial x_i \partial x_j}.
}

In explicit matrix form the Hessian is

\label{eqn:jacobianAndHessian:340}
=
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1 \partial x_1} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots &\frac{\partial^2 f}{\partial x_1 \partial x_n} \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2 \partial x_2} & \cdots &\frac{\partial^2 f}{\partial x_2 \partial x_n} \\
\vdots & \vdots & & \vdots \\
\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots &\frac{\partial^2 f}{\partial x_n \partial x_n}
\end{bmatrix}.

Is there a similar nice matrix structure for the Hessian of a function $$f : {\mathbb{R}}^n \rightarrow {\mathbb{R}}^m$$?

# References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[2] D. Hestenes. New Foundations for Classical Mechanics. Kluwer Academic Publishers, 1999.

## UofT ece1505 convex optimization: Introduction (taught by Prof. Stark Draper)

January 12, 2017 ece1505 No comments

## Peeter’s lecture notes. May be incoherent or rough.

• Science of optimization.
• problem formulation, design, analysis of engineering systems.

## Basic concepts

• Basic concepts. convex sets, functions, problems.
• Theory (about 40 % of the material). Specifically Lagrangian duality.
• Algorithms: gradient descent, Newton’s, interior point, …

Homework will involve computational work (solving problems, …)

## Goals

• Recognize and formulate engineering problems as convex optimization problems.
• To develop (Matlab) code to solve problems numerically.
• To characterize the solutions via duality theory
• NOT a math course, but lots of proofs.
• NOT a communications course, but lots of … (?)
• NOT a CS course, but lots of useful algorithms.

## Mathematical program

\label{eqn:intro:20}
\min_\Bx F_0(\Bx)

where $$\Bx = (x_1, x_2, \cdots, x_m) \in \text{\R{m}}$$ is subject to constraints $$F_i : \text{\R{m}} \rightarrow \text{\R{1}}$$

\label{eqn:intro:40}
F_i(\Bx) \le 0, \qquad i = 1, \cdots, m

The function $$F_0 : \text{\R{m}} \rightarrow \text{\R{1}}$$ is called the “objective function”.

Solving a problem produces:

An optimal $$\Bx^\conj$$ is a value $$\Bx$$ that gives the smallest value among all the feasible $$\Bx$$ for the objective function $$F_0$$. Such a function is sketched in fig. 1.

fig. 1. Convex objective function.

• A convex objective looks like a bowl, “holds water”.
• If connect two feasible points line segment in the ? above bottom of the bowl.

A non-convex function is illustrated in fig. 2, which has a number of local minimums.

fig. 2. Non-convex (wavy) figure with a number of local minimums.

## Example: Line fitting.

A linear fit of some points distributed around a line $$y = a x + b$$ is plotted in fig. 3. Here $$a, b$$ are the optimization variables $$\Bx = (a, b)$$.

fig. 3. Linear fit of points around a line.

How is the solution for such a best fit line obtained?

### Approach 1: Calculus minimization of a multivariable error function.

Describe an error function, describing how far from the line a given point is.

\label{eqn:intro:100}
y_i – (a x_i + b),

Because this can be positive or negative, we can define a squared variant of this, and then sum over all data points.

\label{eqn:intro:120}
F_0 = \sum_{i=1}^n \lr{ y_i – (a x_i + b) }^2.

One way to solve (for $$a, b$$): Take the derivatives

\label{eqn:intro:140}
\begin{aligned}
\PD{a}{F_0} &= \sum_{i=1}^n 2 ( y_i – (a x_i + b) )(-x_i) = 0 \\
\PD{b}{F_0} &= \sum_{i=1}^n 2 ( y_i – (a x_i + b) )(-1) = 0.
\end{aligned}

This yields

\label{eqn:intro:160}
\begin{aligned}
\sum_{i = 1}^n y_i &= \lr{\sum_{i = 1}^n x_i} a + \lr{\sum_{i = 1}^n 1} b \\
\sum_{i = 1}^n x_i y_i &= \lr{\sum_{i = 1}^n x_i^2} a + \lr{\sum_{i = 1}^n x_i} b.
\end{aligned}

In matrix form, this is

\label{eqn:intro:180}
\begin{bmatrix}
\sum x_i y_i \\
\sum y_i
\end{bmatrix}
=
\begin{bmatrix}
\sum x_i^2 & \sum x_i \\
\sum x_i & n
\end{bmatrix}
\begin{bmatrix}
a \\
b
\end{bmatrix}.

If invertible, have an analytic solution for $$(a^\conj, b^\conj)$$. This is a convex optimization problem because $$F(x) = x^2$$ is a convex “quadratic program”. In general a quadratic program has the structure

\label{eqn:intro:200}
F(a, b) = (\cdots) a^2 + (\cdots) a b + (\cdots) b^2.

### Approach 2: Linear algebraic formulation.

\label{eqn:intro:220}
\begin{bmatrix}
y_1 \\
\vdots \\
y_n
\end{bmatrix}
=
\begin{bmatrix}
x_1 & 1 \\
\vdots & \vdots \\
x_n & 1
\end{bmatrix}
\begin{bmatrix}
a \\
b
\end{bmatrix}
+
\begin{bmatrix}
z_1 \\
\vdots \\
z_n
\end{bmatrix}
,

or
\label{eqn:intro:240}
\By = H \Bv + \Bz,

where $$\Bz$$ is the error vector. The problem is now reduced to to: Fit $$\By$$ to be as close to $$H \Bv + \Bz$$ as possible, or to minimize the norm of the error vector, or

\label{eqn:intro:260}
\begin{aligned}
\min_\Bv \Norm{ \By – H \Bv }^2_2
&= \min_\Bv \lr{ \By – H \Bv }^\T \lr{ \By – H \Bv } \\
&= \min_\Bv
\lr{ \By^\T \By – \By^\T H \Bv – \Bv^\T H \By + \Bv^\T H^\T H \Bv } \\
&= \min_\Bv
\lr{ \By^\T \By – 2 \By^\T H \Bv + \Bv^\T H^\T H \Bv }.
\end{aligned}

It is now possible to take the derivative with respect to the $$\Bv$$ vector (i.e. the gradient with respect to the coordinates of the constraint vector)

\label{eqn:intro:280}
\PD{\Bv}{}
\lr{ \By^\T \By – 2 \By^\T H \Bv + \Bv^\T H^\T H \Bv }
=
– 2 \By^\T H + 2 \Bv^\T H^\T H
= 0,

or

\label{eqn:intro:300}
(H^\T H) \Bv = H^\T \By,

so, assuming that $$H^\T H$$ is invertible, the optimization problem has solution

\label{eqn:intro:320}
\Bv^\conj =
(H^\T H)^{-1} H^\T \By,

where

\label{eqn:intro:340}
\begin{aligned}
H^\T H
&=
\begin{bmatrix}
x_1 & \cdots & x_n \\
1 & \cdots & 1 \\
\end{bmatrix}
\begin{bmatrix}
x_1 & 1 \\
\vdots & \vdots \\
x_n & 1
\end{bmatrix} \\
&=
\begin{bmatrix}
\sum x_i^2 & \sum x_i \\
\sum x_i & n
\end{bmatrix}
,
\end{aligned}

as seen in the calculus approach.

## Maximum Likelyhood Estimation (MLE).

It is reasonable to ask why the 2-norm was picked for the objective function?

• One justification is practical: Because we can solve the derivative equation.
• Another justification: In statistics the error vector $$\Bz = \By – H \Bv$$ can be modelled as an IID (Independently and Identically Distributed) Gaussian random variable (i.e. noise). Under this model, the use of the 2-norm can be viewed as a consequence of such an ML estimation problem (see [1] ch. 7).

A Gaussian fig. 4 IID model is given by

\label{eqn:intro:360}
y_i = a x_i + b

\label{eqn:intro:380}
z_i = y_i – a x_i -b \sim N(O, O^2)

\label{eqn:intro:400}
P_Z(z) = \inv{\sqrt{2 \pi \sigma}} \exp\lr{ -\inv{2} z^2/\sigma^2 }.

fig. 4. Gaussian probability distribution.

### MLE: Maximum Likelyhood Estimator

Pick $$(a,b)$$ to maximize the probability of observed data.

\label{eqn:intro:420}
\begin{aligned}
(a^\conj, b^\conj)
&= \arg \max P( x, y ; a, b ) \\
&= \arg \max P_Z( y – (a x + b) ) \\
&= \arg \max \prod_{i = 1}^n \\
&= \arg \max \inv{\sqrt{2 \pi \sigma}} \exp\lr{ -\inv{2} (y_i – a x_i – b)^2/\sigma^2 }.
\end{aligned}

Taking logs gives
\label{eqn:intro:440}
\begin{aligned}
(a^\conj, b^\conj)
&= \arg \max
\lr{
\textrm{constant}
-\inv{2} \sum_i (y_i – a x_i – b)^2/\sigma^2
} \\
&= \arg \min
\inv{2} \sum_i (y_i – a x_i – b)^2/\sigma^2 \\
&= \arg \min
\sum_i (y_i – a x_i – b)^2/\sigma^2
\end{aligned}

Here $$\arg \max$$ is not the maximum of the function, but the value of the parameter (the argument) that maximizes the function.

### Double sides exponential noise

A double sided exponential distribution is plotted in fig. 5, and has the mathematical form

\label{eqn:intro:460}
P_Z(z) = \inv{2 c} \exp\lr{ -\inv{c} \Abs{z} }.

fig. 5. Double sided exponential probability distribution.

The optimization problem is

\label{eqn:intro:480}
\begin{aligned}
\max_{a,b} \prod_{i = 1}^n P_z(z_i)
&=
\max_{a,b} \prod_{i = 1}^n
\inv{2 c} \exp\lr{ -\inv{c} \Abs{z_i} } \\
&=
\max_{a,b} \prod_{i = 1}^n
\inv{2 c} \exp\lr{ -\inv{c} \Abs{y_i – a x_i – b} } \\
&=
\max_{a,b}
\lr{\inv{2 c}}^n \exp\lr{ -\inv{c} \sum_{i=1}^n \Abs{y_i – a x_i – b} }.
\end{aligned}

This is a L1 norm problem

\label{eqn:intro:500}
\min_{a,b} \sum_{i = 1}^n \Abs{ y_i – a x_i – b }.

i.e.

\label{eqn:intro:520}
\min_\Bv \Norm{ \By – H \Bv }_1.

This is still convex, but has no analytic solution, and

is an example of a linear program.

### Solution of linear program

Introduce helper variables $$t_1, \cdots, t_n$$, and minimize $$\sum_i t_i$$, such that

\label{eqn:intro:540}
\Abs{ y_i – a x_i – b } \le t_i,

This is now an optimization problem for $$a, b, t_1, \cdots t_n$$. A linear program is defined as

\label{eqn:intro:560}
\min_{a, b, t_1, \cdots t_n} \sum_i t_i

such that
\label{eqn:intro:580}
\begin{aligned}
y_i – a x_i – b \le t_i
y_i – a x_i – b \ge -t_i
\end{aligned}

### Single sided exponential

What if your noise doesn’t look double sided, with only noise for values $$x > 0$$. Can define a single sided probability distribution, as that of fig. 6.

fig. 6. Single sided exponential distribution.

\label{eqn:intro:600}
P_Z(z) =
\left\{
\begin{array}{l l}
\inv{c} e^{-z/c} & \quad \mbox{$$z \ge 0$$} \\
0 & \quad \mbox{$$z < 0$$} \end{array} \right. i.e. all $$z_i$$ error values are always non-negative. \label{eqn:intro:620} \log P_z(z) = \left\{ \begin{array}{l l} \textrm{const} – z/c & \quad \mbox{$$z > 0$$} \\
-\infty & \quad \mbox{$$z< 0$$}
\end{array}
\right.

Problem becomes

\label{eqn:intro:640}
\min_{a, b} \sum_i \lr{ y_i – a x_i – b }

such that
\label{eqn:intro:660}
y_i – a x_i – b \ge t_i \qquad \forall i

### Uniform noise

For noise that is uniformly distributed in a range, as that of fig. 7, which is constant in the range $$[-c,c]$$ and zero outside that range.

fig. 7. Uniform probability distribution.

\label{eqn:intro:680}
P_Z(z) =
\left\{
\begin{array}{l l}
\inv{2 c} & \quad \mbox{$$\Abs{z} \le c$$} \\
0 & \quad \mbox{$$\Abs{z} > c.$$}
\end{array}
\right.

or

\label{eqn:intro:700}
\log P_Z(z) =
\left\{
\begin{array}{l l}
\textrm{const} & \quad \mbox{$$\Abs{z} \le c$$} \\
-\infty & \quad \mbox{$$\Abs{z} > c.$$}
\end{array}
\right.

MLE solution

\label{eqn:intro:720}
\max_{a,b} \prod_{i = 1}^n P(x, y; a, b)
=
\max_{a,b} \sum_{i = 1}^n \log P_Z( y_i – a x_i – b )

Here the argument is constant if $$-c \le y_i – a x_i – b \le c$$, so an ML solution is \underline{any} $$(a,b)$$ such that

\label{eqn:intro:740}
\Abs{ y_i – a x_i – b } \le c \qquad \forall i \in 1, \cdots, n.

This is a linear program known as a “feasibility problem”.

\label{eqn:intro:760}
\min d

such that

\label{eqn:intro:780}
\begin{aligned}
y_i – a x_i – b &\le d \\
y_i – a x_i – b &\ge -d
\end{aligned}

If $$d^\conj \le c$$, then the problem is feasible, however, if $$d^\conj > c$$ it is infeasible.

### Method comparison

The double sided exponential, single sided exponential and uniform probability distributions of fig 1.8 each respectively represent the point plots of the form fig 1.9. The double sided exponential samples are distributed on both sides of the line, the single sided strictly above or on the line, and the uniform representing error bars distributed around the line of best fit.

## References

[1] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.