# Notebook

(For me only)  Edit

## Number Theory

Chinese Remainder Theorem. Given a (pairwise co)prime factorization of an integer $M=m_1m_2$, there is a bijection between $\mathbb{Z}_M$ and the Cartesian product $U_1 \times U_2$ where $U_i$ is the set of units (i.e., invertible elements) modulo $m_i$ i.e., the set of integers $z$ that are coprime to $m_i$.

More. This bijection takes an element $m \bmod M$ to the tuple $(m \bmod m_1, m\bmod m_2)$. Moreover, given a tuple $(a_1 \bmod m_1, \cdots, a_n \bmod m_n)$, the corresponding element in $\mathbb{Z}_M$ is

$\displaystyle \sum{a_i \left( \overline{m_i} \overline{m_i}^{-1} \bmod m_i \right) } \bmod M$

where $\overline{m_i} = M/m_i$. Notice that $\overline{m_i} \bmod m_i \neq 0$ since the $\{m_i\}$ are coprime to each other.

Fermat’s Integer Factorization of $N$. Try to find two integers $a,b$ such that $a \neq \pm b \bmod N$ and $a^2 \equiv b^2 \bmod N$. Then $N$ must divide the product $(a+b)(a-b)$. Consequently, the gcd of $N$ and at least one of these factors must be non-trivial. This gcd divides $N$.

## Algebra

Finite Abelian Groups. Every finite Abelian group $G$ can be writen as a direct product

$\displaystyle G \cong \bigoplus_{i=1}^t \mathbb{Z}_{q_i} \cong \bigoplus_{i=1}^t \mathbb{Z}_{k_i}$

where $0 < t < \infty$ and $q_i$ are powers of (possibly repeating) primes and $k_1$ divides $k_2$, $k_2$ divides $k_3$ and so on.

Cyclic Groups. These are abelian groups which are generated by a single element. $\mathbb{Z}_p$ is cyclic. For every positive divisor $d$ of $n$, the quotient group $G_n := \mathbb{Z}/(n)$ has precisely one subgroup of order $d$, the one generated by the residue class of $n/d$. The number of generators of $G_n$ equals the number of elements coprime to $n$. This number is given by Euler’s totient function $\phi(n)$. The order of the element $d$ in $G_n$ is $n/gcd(d, n)$. The direct product of two cyclic groups $G_n$ and $G_m$ is cyclic if and only if $n$ and $m$ are coprime.

Every finite abelian group can be expressed as “the” direct sum of cyclic subgroups of prime-power order, and these orders are uniquely determined.

Pontryagin Dual. Let $G$ be an additive finite Abelian group and let $T$ be the unit torus with multipliciation as the group operation. Let $\chi : G \rightarrow T$ be a group homomorphism defined as $\chi(0) = 1, \chi(x+y) = \chi(x) \chi(y)$. Then, the set $\hat{G} = \{\chi\}$ forms an Abelian group with the group operation being “point-wise multiplication”

$(\chi \cdot \rho)(x) = \chi(x) \rho(x)\,, \qquad x \in G$ ,

and the identity operation being the constant function

$1_G: x \mapsto 1\, , \qquad x \in G$ .

$\hat{G}$ is called the Pontryagin dual of $G$, and the character $1_G$ is called the trivial character.

## Coding Theory

Reed-Muller Code. The order-$r$ Reed-Muller code $R(r,m)$ is the set of Boolean polynomials on $m$ variables and degree at most $r$.

Reed-Solomon Code. In the Reed-Solomon code $RS_{q, \mathcal{E} }(n,k)$, a degree-$(k-1)$  polynomial $m(x)$ over $GF(q)$ — the message — is encoded into a degree-$(n-1)$ polynomial $c(x)$ over $GF(q)$ — the codeword — where the $n$ coefficients $c_0, \cdots, c_{n-1}$ of the codeword are obtained by evaluating $m(x)$ on a fixed set $\mathcal{E} = \{ \alpha_0, \cdots, \alpha_{n-1} \}$.

Since $m(x)$ can have at most $k-1$ zeros at any field, at most $k-1$ of the $c_j$s can be zero. This means the minimum distance $d$ is at least $n - (k-1)$ which matches the singleton bound. By definition, the Reed-Solomon code is MDS.

Singleton Bound. For a linear $[n,k,d]$ code, the distance $d$ is at most $n-k+1$. Any code matching this bound is called MDS (maximum distance separable).

Proof: Suppose we have a $q$-ary code with block length $n$, dimension $k$, and distance $d$. This means each pair of codewords has a Hamming distance at least $d$. If we puncture the code by deleting (without loss of generality) the first $d-1$ symbols from each codeword, the resulting pairwise-distance among codewords is still at least $1$ and we still have a valid code as long as $q^{n-(d-1)} \geq q^k$. This condition is written as

$d \leq n - k + 1$ .

Plotkin Bound. If the distance $d$ of a code is strictly larger than half the block length $n$, that is, if $d > n/2 + \alpha$, the number of codewords is at most $\displaystyle \lfloor d/ \alpha \rfloor$.

Johnson Bound. Consider a code whose relative distance is “large” i.e., at least $1/2 - \epsilon$. Then, for every codeword and every “large” $\alpha > \sqrt{\epsilon}$, only a “small” number of strings are “close” to a codeword. In particular, for any codeword $x$, the number of strings (in the target space) that are within a relative distance $1/2 - \alpha$ from x is at most $1/2\alpha^2$.

Gilbert-Vershamov Bound. For every $\delta < 1/2$ and all sufficiently large $n$, there exists a code with relative distance $\delta$ on $n$ bit strings with block length $n/(1-H(\delta)$. Here, $H$ is the binary entropy function.

Hadamard Matrix. Let $H_1 = \left( \begin{array}{cc} 1 & 1 \\ 1 & -1 \end{array}\right)$ . Then

$H := H_{2^n} = \left( \begin{array}{cc} H_{2^{n-1}} & H_{2^{n-1}} \\ H_{2^{n-1}} & -H_{2^{n-1}} \end{array}\right)$

is the Hadamard matrix of order $2^n$.

• The rows/columns of $H$ are indexed by $n$-bit Boolean strings. Let $\odot$ denote the inner-product modulo $2$. Then

$H(x,y) = (-1)^{x \odot y}$.

• The map $H : \{0,1\}^n \mapsto \{0,1\}^{2^n}$ is a $[2^n, n, 2^{n-1}]$ linear code.
• The volume enclosed by the (unit-length) columns of $H$ (i.e., the determinant) is the largest possible for any real matrix of the same order.

Discrete Fourier Transform. See here.

## Probability

Expectation. Suppose a random variable $X$ takes its value from the domain $\chi$.

$\displaystyle E[X] = \sum_{x \in \chi}{x \Pr[X = x]}$ .

Additionally, if $X \geq 0$,

$\displaystyle E[X] = \sum_{x \in \chi}{ \Pr[X \geq x]}$ .

Variance. In general,

$\displaystyle Var(X) = E[X^2] - E[X]^2$ .

If $X_1, \cdots, X_n$ are independent, then

$\displaystyle Var(\prod_i X_i) = \prod_i E[X_i^2] - \prod_i E[X_i]^2$ .

Moments. For any random variable $X$, its $n$th moment is simply $\mathbb{E} X^n$. These moments are given by the (exponential) moment generating function

$\displaystyle \mathbb{E} e^{t X} = \sum_{n\geq 0}{ t^n \frac{ \mathbb{E} X^n}{n!}}$.

Given an mgf $g(X, t)$, we can calculate the $n$th moment of $X$ by taking the $n$th derivative of $g$ w.r.t. $t$ and evaluate it at $t = 0$.

The $n$th factorial moment of $X$ is defined as

$\displaystyle \mathbb{E} (X)_r = \mathbb{E} X (X-1) (X-2) \cdots (X - r + 1)$ .

The regular moments and the factorial moments are related by

$\displaystyle \mathbb{E}X^n = \sum_{r = 0}^n{ \begin{Bmatrix}n \\ r\end{Bmatrix} \mathbb{E} (X)_r }$

where $\displaystyle \begin{Bmatrix}n \\ r\end{Bmatrix}$ are the Stirling numbers of the second kind.

Limit superior and limit inferior of a real sequence. A real sequence $(r_n)$ should be visualized as an oscillating curve which, as $n \rightarrow \infty$, dampens, i.e., the width $|r_n - r_{n - 1}|$ gets smaller and smaller. The supremum of the sequence, written $\sup_{m \geq n} r_m$, tightly bounds this oscillating curve from above and, as $n$ progresses, gets smaller and smaller, to eventually become $\limsup_n r_n$.

Likewise, the infimum of the sequence,written  $\inf_{m \geq n} r_m$, does the same to the sequence $(r_n)$ from below and becomes $\liminf_n r_n$ as $n \rightarrow \infty$. The two extremums are the same if and only if the sequence converges. Thus we have

$\displaystyle \sup_{m \geq n} r_m \geq \limsup_n r_n \geq \lim_{n \rightarrow \infty} r_n \geq \liminf_n r_n \geq \inf_{m \geq n} r_m\, .$

Limit superior/inferior of a sequence of subsets. Consider an infinite sequence of subsets $S = (S_n)$ of elements from some ground set $\Omega$. The limit superior of this sequence, written $\limsup S$, is the set of elements $w \in \Omega$ so that sets $S_n \in S$ that include $w$ occurs infinitely often. The limit inferior of this sequence, written $\liminf S$, is the set of elements $w \in \Omega$ so that only finitely many sets in $S$ may exclude $w$.

Borel-Cantelli Lemmas. These lemmas establish a connection between a set-sequence and its limit superior.

Specifically, let $\Omega$ be a set and let $E = (E_n)$ be an infinite  sequence of events on (i.e., subsets of) $\Omega$. Let $p_n = \Pr[E_n]$.  If $T = \sum_n p_n < \infty$ then almost surely, only finitely many of the events $E_n$ are true. That is to say, the limit superior of the sequence $(E_n)$ has probability zero.

Conversely, suppose the sets $(E_n)$ are independent and let $p$ be the probability of the limit superior of $E$. Then (1) $p = 0$ iff $T < \infty$, and (2) $p = 1$ iff $T = \infty$.

Weak Convergence, or Convergence in Distribution. Let $X_1, X_2, \cdots$ be a sequence of random variables. Let $F_i$ be the cumulative distribution associated with $X_i$. Then, we say “the sequence $(X_i)$ converges is distribution” (and write $F_i \Rightarrow F$) if there exists some function $F$ such that $\lim_{n \rightarrow \infty}F_n(x) = F(x)$ for all real $x$ where $F(x)$ is continuous.

Helly’s Theorem. Any sequence of distribution functions has a subsequence that converges pointwise (but it may not be a distribution function).

In addition, $(X_n)$ converges to $X$ in distribution if and only if for all bounded continuous functions $f$ on $\mathbb{R}$, $\mathbb{E}[f(X_n)]$ converges to $\mathbb{E}X$ pointwise.

Levy’s Continuity Theorem. Probability measures $(\mu_n)$ converge in distribution (say, to $\mu$) if and only if their characteristic functions $(\phi_n)$ converge pointwise (say, to $\phi$). If $\phi$ is continuous at the origin, then $\phi$ is, in fact, the characteristic function of $\mu$.

Convergence in Probability. A sequence $(X_n)$ of random variables converges in probability to some random variable $X$ if, for any postive real $\epsilon$, the event “$|X_n - X|$ is larger than $\epsilon$” has probability zero as $n \rightarrow \infty$.

Strong convergence, or almost sure convergence. A sequence $(X_n)$ of random variables, defined over some ground set $\Omega$, converges almost surely to some random variable $X$ if there exists a set $\mathrm{Bad} \subset \Omega$ of probability zero so that for all $w \not \in \mathrm{Bad}$, $X_n(w) \rightarrow X(w)$ as $n \rightarrow \infty$.

Every sequence that converges in probability, contains a subsequence that converges almost surely.

Monotone Convergence Theorem. It tells us when the limit commutes with an integral (or a sum).

If a monotone non-decreasing sequence of non-negative real functions $(f_n)$ converges almost surely to its limit, one can exchange the order of limits and integrals over these functions.

For monotone non-increasing sequence, the same holds if the functions are integrable. (The reason is, the difference sequence $(f_1 - f_n)$ is increasing and it limits to $f_1 - f$ where $f$ is the limit; we can apply MCT to this sequence but, to deal with the minus sign we would need individual $f_n$s to be integrable.)

The same is true for a monotone sequence of non-negative random variables and their expectations.

Fatou’s Lemma. Let $(X_n)$ be a sequence of non-negative random variables. (We don’t assume any convergence.) Then

$\displaystyle \liminf_{n \rightarrow \infty} \mathbb{E} X_n \geq \mathbb{E}\left[ \liminf_{n \rightarrow \infty} X_n \right]\,.$

Informally, this states that the expections of a sequence of random variables osillates less than the sequence itself.

Dominated/Bounded Convergence Theorem. It tells us when the limit commutes with an integral (or a sum).

Let $(f_n)$ be a sequence of real functions that converge almost entirely to some function $f$. If there is an integrable  function $g$ that dominates $|f_n|$, then limit and integration commutes on $(f_n)$.

Central Limit Theorem, or Lindberg-Levy CLT. Let $X_1, \cdots, X_n$ be a sequence of independent and identically distributed random variables, let $S_n \triangleq (\sum X_i)/n$ be the sample average, and let $\mu \triangleq \mathbb{E} X_i$ and $\sigma^2 \triangleq \mathbf{Var}(X_i) < \infty$. Then as $n$ tends to infinity, the limiting distribution of $\dfrac{S_n - \mu}{\sigma/ \sqrt{n} }$ is standard normal $\mathcal{N}(0, 1)$.

If, in addition, the third moment of the random variables is finite, Berry-Esseen theorem gives a quantitative bound on the rate of convergence.

Berry-Esseen CLT. Let $X=(X_1, \ldots, X_n)$ be independent real random variables with zero mean and variance $\sigma_i^2$. Let $\sigma \triangleq \sqrt{\sum \sigma_i^2}$ and $\rho_i = \mathbb{E} |X_i|^3 < \infty$. Define $S_n \triangleq (\sum X_i)/\sigma$. Let $F_n$ and $\Phi$ be the cumulative distribution functions of $S_n$ and the standard normal, respectively. Then for all real $u$,

$|F_n(u) - \Phi(u))|\leq 0.56\, \dfrac{ \sum \rho_i }{\sigma^3} \leq \dfrac{0.56}{ \sigma}\, \max_i \dfrac{\rho_i}{\sigma_i^2} \leq 0.56\, \|X\|_\infty$.

If $X_1, \ldots, X_n$ are identical, Then for all real $x$,

$\displaystyle |F_n(x) - \Phi(x)| \leq \frac{0.56\, \sum \rho_i}{(\sum \sigma_i^2)^{3/2}} = 0.56\, \dfrac{n \rho_1}{n^{3/2} \sigma_1^3} = 0.56\, \frac{\rho_1}{\sqrt{n} \sigma_1}$ .

In addition, if $|X_i| < \epsilon$ for some $\epsilon \in (0, 1)$ and $\sigma_1^2 = 1$, then

$\displaystyle |F_n(x) - \Phi(x)| \leq 0.56\, \epsilon$.

That is, the approximation error is of the order $1/\sqrt{n}$ for all $n$.

Multidimensional (Berry-Esseen Type) CLT.  Let $X_1, \cdots, X_n$ be independent $d$-dimensional real random variables with zero mean, $T_n \triangleq (\sum X_i)$, and $\Sigma \triangleq \text{Cov}(X_1, \cdots, X_n)$ is the invertible covariance matrix. Let $Z \triangleq \mathcal{N}(0, \Sigma)$ be a distribution on $\mathbb{R}^d$. Then, for all convex sets $A \subset \mathbb{R}^d$,

$\displaystyle | \Pr[ T_n \in A ] - \Pr[Z \in A] |\leq c\, d^{1/4}\, \gamma$ ,

where

$\displaystyle \gamma \triangleq \mathbb{E} \sum_i \| \Sigma^{-1/2} X_i \|_2^3$.

Tower Property of Conditional Expectation.

Let $X$ be a random variable and the sets $A, B$ denote events.

$\displaystyle \mathop{E}_X [X] = \mathop{E}_B[\, \mathop{E}_{B^c} [X \mid B] ]$

where $B^c$ denotes set complement.

Using $X \leftarrow (X \mid A)$,

$\displaystyle \mathop{E}_{A^c} [X \mid A] = \mathop{E}_{A^c}\big[ \mathop{E}_{(A \cup B)^c} [X \mid A, B] \mid A\big]$ .

In particular, if $A \subseteq B$,

$\displaystyle \mathop{E}_{A^c} [X \mid A] = \mathop{E}_{A^c}\big[ \mathop{E}_{B^c} [X \mid B] \mid A\big]$ .

Proof: here.

The Birthday Paradox. Suppose $k = \Omega(\sqrt{n})$. When you randomly draw $k$ times from an $n$-set with replacement, the resulting $k$-subset will contain a repeated element with constant probability.

Explanation. Suppose we choose $k$ independent samples $X=\{ x_1, \cdots, x_k \}$ from the numbers $1, \cdots, n$. For any $i\neq j$, the probability that $x_i=x_j$ is $1/n$. The expected number of collisions in $X$ is at most ${k \choose 2}/n$. This is a constant whenever $k = \Omega(\sqrt{n})$.

Coupon Collector’s Problem. Suppose you want to color all items in a set $S$ of size $n$ as follows: at round $1, 2, 3, \cdots$, you pick a random element from $S$; if it is still uncolored, you color it red, put it back, and repeat. This process will require $\Theta(n \log n)$ rounds to color all items.

In particular, let $\tau$ be the number of rounds till the game ends. For any positive $c$,

$\displaystyle \Pr[ \tau \geq n\log n + cn] \leq e^{-c n}$ .

There is also an asymptotically matching lower bound.

Balls in Bins. If $n$ balls are thrown into $n$ bins independently and uniformly at random, the “heaviest” bin contains at most $\displaystyle \frac{\ln n}{\ln \ln n} \big(1+o(1) \big)$ balls with high probability. However, if the balls are placed sequentially—each into the “lightest” among $d$ bins sampled uniformly at random—then the largest load is at most $\displaystyle \frac{\ln \ln n}{\ln d} + \Theta(1)$. Thus even if $d=2$, the bins are highly balanced in the sequential case.

Binomial Distribution.  Let $S_n \sim \text{Binomial}(n, p)$ and let $q = 1 - p$. Let $m$ be the integer index of the central term.

$\displaystyle \Pr[S_n \text{ is even}] = \frac{1}{2}(1 + (2p-1)^n)\, .$

$\displaystyle \Pr[S_n \geq r] \leq \dfrac{rq}{(r - np)^2}\quad\text{if}\quad r > np$ .

$\displaystyle \Pr[S_n \leq r] \leq \dfrac{(n - r)p}{(np - r)^2}\quad\text{if}\quad r < np$ .

Normal approximation (DeMoivre-Laplace limit theorem.) Let $\sigma = \sqrt{n pq}$ and $a_k = \Pr[S_n = m + k]$. Let $\mathsf{n}$ be the normal density function, i.e.,

$\displaystyle \mathsf{n}(x) = e^{-x^2/2}/\sqrt{2 \pi}$.

Then

$\displaystyle a_k \sim \dfrac{\mathsf{n}(k/\sigma)}{\sigma}\quad\text{if}\quad k^3 = o(n^2)$.

Specifically, if $n \rightarrow \infty$ and $k^3 = o(n^2)$ then

$\displaystyle \left| \dfrac{\sigma a_k}{\mathsf{n}(k/\sigma)} - 1 \right| < \epsilon\quad\text{for every}\quad \epsilon > 0$.

Poisson Distribution. Let $X \in \{0, 1, 2, \cdots\}$ be a discrete random variable. If it has the Poisson distribution $P(\mu)$, its probability mass function at point $k$ is the probability that $X$ is $k$ given $\mathbf{E}\,X = \mu$. That is,

$f(k;\mu) := = \frac{\mu^k }{k! e^\mu}.$

The expectation and variance of $X$ is $\mu$. The expected deviation of $X$ from its mean is $f( \lfloor \mu \rfloor; \mu)$. A binomial distribution $B(n,p)$ tends to a Poisson distribution when $p \rightarrow 0$.

Upper Tail. Let $S_n$ is the sum of $n$ i.i.d. Bernoulli random variables with expectation $\lambda/n$ so that $\mathbb{E}S_n = \lambda$. Then we have a crude tail bound:

$\displaystyle \Pr[S_n \geq k] \leq \dfrac{\lambda^k}{k!}\,.$

Note: The events $X = a$ and $X = b$ must be independent.

Geometric Distribution. Suppose we are having a number of independent Bernoulli trials with success probability $p$ and failure probability $q = 1-p$. Let $X \in \{1,2,3, \cdots\}$ be the random variable denoting the index of the first success. Then $X$ has a geometric distribution with

• mass functinon $\Pr[X = n]\, = (1 - p)^{n-1} p$,
• expectation $1/p$,
• variance $(1-p)/p^2$,
• moment generating function $M_X(a) = \dfrac{p e^a}{1 - (1 - p) e^a}$ ,
• For a non-negative real $\lambda$, the $\lambda$th moment

$\mathbb{E} X^\lambda = \dfrac{p}{q}\mathrm{Li}_{-\lambda}(1 - p) \leq \dfrac{\Gamma( \lambda + 1)}{p^\lambda}$

where $\mathrm{Li}$ is the polylogarithm function

$\displaystyle \mathrm{Li}_a(x) := \sum_{k=1}^\infty \frac{x^k}{k^a}$

and $\Gamma$ is the gamma function. Thus for integers $n$,

$\mathbb{E} X^n \leq \dfrac{\lambda!}{p^\lambda}$.

Hypergeometric Distribution. This is the binomial distribution without replacement. In particular, let $h(k;N,K)$ be the probability that exactly $k$ “good” items are obtained in $n$ sample-without-replacement trials from a population of $N$ items containing exactly $K$ “good” items in total.

$\displaystyle h(k;N,K) = \frac{ {K \choose k} {N-K \choose n-k} }{ {N \choose n} }$ .

The mean of this distribution is, surprisingly, $nK/N$, which equals $\mathbf{E}_\mathrm{binomial}$. The variance is

$\displaystyle nK/N (N-K)/N \cdot (N-n)/(N-1) = \mathbf{Var}_\mathrm{binomial} \cdot (1- o(1) )$ if $n = o(N)$.

Stochastic Dominance (First Order). Consider two random variables $A, B$. $A$ stochastically dominates $B$ if $A$ “always” has a “fatter right tail” than that of $B$. To be precise, for all $k$ in the (totally-ordered) domain of the variables, $\Pr[A > k] \geq \Pr[B > k]$ with strict inequality for some $k$. $A$ having a fatter right tail is equivalent to $A$ having a thinner left tail.

The Gambler’s Ruin Problem. See here.

The Unbiased Random Walk. See here.

Shearer’s Inequality. Let $S_i \subseteq [N], i \leq M$ be coordinate-sets such that each coordinate $i \in [N]$ is covered by at least $r$ different $S_i$s. Let $X = (X_1, \cdots, X_N)$ be a collection of independent random variables. The entropy of $X$ is at most $1/r$ times the sum of the entropies of its restriction to each coordinate set. That is,

$\displaystyle H(X_1, \cdots, X_N) \leq \frac{1}{r} \sum_{i=1}^M{ H(X\rvert_{S_i}})$ .

Total Variation Distance of two distributions $f, g$. The total variation distance of two distributions over a set is equal to

• the maximum absolute difference over all subsets.
• half the sum of absolute pointwise differences.
• the sum of one-sided (i.e., where $f > g$) signed pointwise differences.

That is,

$\displaystyle \Vert f - g \Vert_{tv} = \sup_{S \subseteq X} \vert f(S) - g(S) \vert$.

$\displaystyle \Vert f - g \Vert_{tv} = \sum_{x \in F}{f(x) - g(x)} = \sum_{x \in G}{g(x) - f(x)} = \frac{1}{2} \Vert f - g \Vert_1$,

where $F = \{ x \in X: f(x) \geq g(x) \}$ and $G = \{x \in X: g(x) \geq f(x)$.

See here for a proof.

For two fixed distributions $f, g$, the “disagreement probability” for any random variables $X \sim f, Y \sim g$ is at least the tv-distance between $f$ and $g$. That is,

$\displaystyle \Vert f - g \Vert_{tv} = \inf_{X \sim f, Y \sim g} \Pr[X \neq Y]$.

For any distribution $P$ and the uniform distribution $U$ on the finite Abelian group $\mathbb{Z}/p$ for a prime $p$, it is not hard to show that

$\displaystyle \Vert P - U \Vert_{tv} \leq \frac{1}{4} \sum_{z \neq 0} \left| \hat{P}(z) \right|^2 = \frac{1}{4} \left( \Vert \hat{P}\Vert_2^2 - \hat{P}(0)^2 \right) = \frac{1}{4} \left( \Vert P \Vert_2^2 - 1 \right)$ for $z \in \mathbb{Z}/p$

since the transform is unitary (i.e., norm-preserving) and $\hat{P}(0)^2 = \Vert P \Vert_1^2 = 1$ as $P$ is a probability distribution.

See Equation (7) in the Chung-Graham paper for an easy proof via Cauchy-Schwarz and Plancherel.

Pinsker’s Inequality. The Kullback-Leibler divergence of two distributions is at least twice the square of their total-variaion distance. Namely,

$\displaystyle D_\text{KL}(f \Vert g) \geq 2\Vert f - g \Vert_{tv}^2$.

Kullback-Leibler Divergence. Given a probability distribution $p$ on a discrete sample space $X$, let $s(i ; p) := \log \frac{1}{p(i)}$ denote the “amount of surprise” associated with the event $i$ according to the distribution $p$. The Entropy is the expected amount of surprise, namely

$\displaystyle H(p) = \sum_{i \in X}{p(i) s(i; p)}$.

The KL Divergence of $q$ with respect to $p$ is

• the expected amount of surprise “remaining” in $q$ after revealing $p$.
• the entropy of $q$ relative to $p$.
• the expectation (in $p$) of the pointwise-difference of the log-probabilities.

$\displaystyle D_\text{KL}(p \Vert q) = \sum_i{ p(i) \left(s(i;q) - s(i;p) \right) } = \sum_i{ p(i) \log \frac{p(i)}{q(i)}}$.

Markov Chains.

• A Markov chain is aperiodic if it has self-loops. It is irreducible (i.e., connected) if every state is reachable. It is ergodic if it has a unique stationary distribution $\pi$, which happens if and only if the chain is aperiodic and irreducible. Let $P = (p_{ij})$ be the transition probability matrix of the chain. The chain is reversible if $\pi_i p_{ij} = \pi_j p_{ji}$ for all $i,j$. A random walk is lazy if it stays at the same state with a constant probability. Thus a lazy (i.e., aperiodic and reversible) and irreducible random walk is ergodic.
• In the stationary state, the chain spends a $\pi_i$ fraction of its time at state $i$.
• Let $\tau_{jj}$ be the time taken by the chain, starting from state $j$, return to $j$ for the first time. In the stationary state, in expectation, the chain visits state $j$ once every $\mathbb{E} \tau_{jj}$ steps. Hence $\pi_j = 1/ \mathbb{E} \tau_{jj}$.
• The mixing time $\tau$ of a random walk with an initial distribution $x$ is the number of steps $t$ it takes so that the distribution after $t$ steps is very close to its stationary distribution.
• An ergodic chain mixes rapidly. In particular, for any initial distribution $x$,

$\displaystyle \vert \frac{P^t(x, \cdot)}{\pi(.)} - 1 \Vert_1 \leq \frac{\lambda_2^t}{\min_i \pi(i)} \Vert x \Vert_2$

where $\lambda_2$ is the second largest (absolute) eigenvalue. The mixing time in the above case is

$\displaystyle t_{mix} = O( \frac{\log \min_i \{1/\pi(i)\} }{\lambda_2})$.

For a $d$-regular graph, this simplifies to $O((\log n) / \lambda)$.

• Mixing time in $\ell_\infty$. The probability mass at any state is within an $\epsilon$ of its stationary value after only

$\displaystyle t_{mix} = \frac{ \log(1/\epsilon)}{\gamma}$

steps where $\gamma$ is the spectral gap of the transition matrix $P$. In particular, after $t$ steps, for any state $x, y$,

$\displaystyle \left \vert \frac{P^t(x,y)}{\pi(y)} - 1 \right \vert \leq \lambda_2^t = (1 - \gamma)^t \approx \exp(-\gamma t)$ .

See here and here for nice expositions.

• Ergodic Theorem. Asymptotically, the fraction of times the walk stays at a specific state equals the stationary mass at that state.
• Concentration. See here.

Mixing Time via Coupling. Given two identical walks on the same state-space starting from different initial states $x$ and $y$, let $\displaystyle \tau_{couple}$ be the time when they meet. Then

$\displaystyle t_{mix} \leq 4 \max_{x,y} \mathbb{E} \tau_{couple}$.

• For the lazy-$1/2$ biased random walk on the Hypercube, the mixing time is

$\displaystyle t_{mix}(\epsilon) \leq n \log n + n \log (1/\epsilon) = n \log (n/\epsilon)$ .

• For the lazy biased walk on the Cycle $\mathbb{Z}_n$ with

$\Pr[\text{stay}] = 1/2\, , \qquad \Pr[\text{up}] = p/2\, , \qquad \Pr[\text{down}] = (1-p)/2\,$ ,

$\displaystyle \frac{n^2}{32} \leq t_{mix} \leq n^2$ .

• A birth and death process is a lazy Markov chain on $[0, n]$ with possibly different up/down/stay probabilities for each state. Let $\mathbb{E}_a(\tau_b)$ denote the coupling time for two walks starting at $a$ and $b$. Then

$\displaystyle \mathbb{E}_{a}(\tau_b) = \sum_{k = a+1}^b \mathbb{E}_{k}(\tau_{k-1})$ .

A special case is when $p = \Pr[\text{up}]$ and $q = \Pr[\text{down}]$ are identical for each state. Assuming $p \neq q$,

$\displaystyle \mathbb{E}_{k-1}(\tau_k) = \frac{1}{p-q} \left[ 1 - \left( \frac{p}{q} \right)^\ell\right]$ .

In particular, this means

$t_{mix} \leq 4 \mathbb{E}_0(\tau_n) = \frac{4}{p-q} \left[ n - \left( \frac{1-(q/p)^n}{p-q} \right) \right]$ .

If $p = q$, however,

$\displaystyle \mathbb{E}_{k-1}(\tau_k) = \frac{k}{p}$ .

Doob Martingale. For any bounded function $f$ and a sequence of random variables $(X_i) = X_0, X_1, \cdots$, define $\overline{X}_k = X_0, \cdots, X_k$, $f_i := f(\overline{X}_i)$, and

$\displaystyle Z_0 = \mathop{E}_{ X_0, \cdots, X_n }[ f_n ]$ ,

$\displaystyle Z_n = f_n$ , and

$\displaystyle Z_i = \mathop{E}_{X_{i+1}, \cdots, X_n} [f_n \mid \overline{X}_{i}]$ .

Then, the sequence $(Z_i) = Z_0, Z_1, \cdots, Z_n$ is a martingale with respect to the sequence $(X_i)$, i.e.,

$\displaystyle \mathop{E}_{X_i, \cdots, X_n}[Z_i \mid \overline{X}_{i-1} ] = Z_{i-1}$ for every $i$ .

Proof: Using $A = X_0, \cdots, X_{i-1}, B = X_0, \cdots, X_{i}$, using $.^c$ to denote complement in the set $X_0, \cdots, X_n$, the left hand side is

$\displaystyle \mathop{E}_{A^c} [Z_i \mid \underbrace{X_0, \cdots, X_{i-1} }_{A} ]$

$\dpi{150} = \mathop{E}_{A^c} \big[ \underbrace{ \big( \mathop{E}_{B^c} [f_n \mid A, X_i ] \big) }_{Z_i} \mid A \big]$

using the def. of $Z_i$.

$\displaystyle = \mathop{E}_{A^c} \big[ \mathop{E}_{B^c} [f_n \mid B] \mid A \big]$    (since $A \subset B$) .

Using the tower property of conditional expectation, this quantity is nothing but

$E_{A^c}[f_n \mid A] = Z_{i-1}$ .

The Exponential Distribution. An exponential distribution with parameter $\lambda$ is the continuous analog of the geometric distribution.

Weak Law of Large Numbers. Let $(X_i)$ be a sequence of independent, mean zero random variables with the second moments bounded by a constant. Let $S_n$ denote the partial sum of first $n$ items. Then $S_n/n \rightarrow 0$ in probability as $n \rightarrow \infty$.

Cantelli’s Strong Law of Large Numbers. Assume the preceding scenario except that instead of second moments, the fourth moments are bounded by a constant. Then $S_n/n \rightarrow 0$ almost surely as $n \rightarrow \infty$.

Kolmogorov’s Law of Large Numbers. Let $(X_i)$ be a sequence of i.i.d. mean zero random variables. Let $S_n$ denote the partial sum of first $n$ items. Then $S_n/n \rightarrow 0$ almost surely as $n \rightarrow \infty$.

Kronecker’s Lemma. Let $(a_k)$ be a sequence of reals such that $\sum_k a_k/k$ converges. Let $S_n$ denote the partial sum of first $n$ items. Then $S_n/n \rightarrow 0$ in probability as $n \rightarrow \infty$.

## Combinatorics

Pigeonhole Principle.

• Let $n > m > 0$ are two integers. If $n$ balls are placed in $m$ bins, at least one bin receives two balls.
• If a quantity $r \in \mathbb{R}_+$ is distributed into $n \in \mathbb{Z}_+$ parts, then some part receives at least (at most) $\mu := r/n$.
• Let $D$ be a probability distribution on a set $S$. Then there must be some element with probability at least (at most) $1/\vert S \vert$.

Binomial Identities.

Triangle. $\displaystyle {n \choose k} + {n \choose k+1} = {n+1 \choose k+1}$ .

Hockey Stick. $\displaystyle \sum_{r=k}^n { r \choose k} = {n+1 \choose k+1}$ .

Probability Generating Functions. Let $A(X) = \sum_{t=0}^\infty{a_t X^t}$ be a formal power series with $a_t \geq 0$. If ${A}(1) = \sum_t{a_t} = 1$ then ${A}$ “generates” the probability distribution $A(t) := \Pr[A = t] = a_t$. Suppose ${B}$ generates the distribution $B(t)$. It is easy to see that the convolution $A \ast B$ generates the probability distribution $C = A + B$ whose mass function is

$\displaystyle C(t) := \sum_{t = x + y}{A(x) + B(y)}$.

Composition of Generating Functions. Let $a_n$ and $b_n$ denote the number of ways to build two structures, respectively of types 1 and 2, on a $n$-element set where $a_0 = 0$ and $b_0 = 1$. Let $c_n$ be the number of ways to divide the set $[n]$ into an unspecified number of intervals, build a type-1 structure on each interval, then build a type-2 structure on the set of intervals. Set $c_0 = 1$. Denote by ${A}(x), {B}(x), {C}(x)$ the generating functions of the sequences $\{a_n\}, \{b_n\}, \{c_n\}$. Then

${C}(x) = {B}\left({A}(x) \right).$

We can expand $A(C(Z))$ as

$\displaystyle \sum_{n\geq 0}{a_n C(Z)^n} = \sum_{m\geq 0}{f(m, A, C) Z^m}$ where

$\displaystyle f(m, A, C) = \sum_{n\geq 0}{ a_n \sum_{\lambda_1, \cdot, \lambda_n \geq 0, \sum{\lambda_i} = m}{\prod_{i \in I}{c_i}} }$ .

If $A, C$ are probability generating functions, we can interpret $A(C(Z))$ as the generating function for the sum of $p$ independent draws from $C$ where $p$ is distributed according to $A$.

Sum of first binomial coefficients. Suppose $k \leq n/2$. Then,

$\displaystyle \sum_{i=0}^k {n \choose i} \leq {n \choose k} \frac{n - k+1}{n - 2k + 1}$ .

This bound is due to Michael Lugo. (Aside: This implies, if $k \leq n/3$, the sum is at most ${n \choose k}$ .)

The above bound is within a factor of $1.5$ as long as $k \leq 0.49 n$ and $n \approx 5000$. When $k \rightarrow n/2$, this multiplicative factor increases.

The following bound is due to Lovasz. It works better when $k \rightarrow n/2$; it stays within a factor of $2$ for $n \approx 5000$.

$\displaystyle \sum_{i=0}^{k-1} {n \choose i} \leq 2^{n-1} \exp\left(-\frac{(n - 2k - 2)^2}{4(n - k - 1)} \right)$ .

Sigma Field. A $\sigma$-field $(\Omega, \mathbb{F})$ consists of a sample space $\Omega$ and a collection of subsets $\mathbb{F}$ satisfying the following conditions:

1. $\emptyset \in \mathbb{F}$ .
2. $\mathcal{E} \in \mathbb{F} \implies \overline{\mathcal{E}} \in \mathbb{F}$ .
3. $\mathcal{E}_1, \mathcal{E}_2, \cdots \in \mathbb{F} \implies \mathcal{E}_1 \cup \mathcal{E}_2 \cup \cdots \in \mathbb{F}$ .

The last condition, coupled with the second condition, implies the closure under countable union and intersection.

Ramsey’s Theorem. Let $\displaystyle r \geq {n+m-2 \choose n-1}$. Every edge-two-coloring of $K_r$ induces a monochromatic subgraph which is either a $K_n$ or a $K_m$.

Generally, given integers $m$ and $N = \{n_1, \cdots, n_m$, there exists a number $R = R(N; m)$—called the Ramsey number—such that for every $r \geq R$ and every edge-$m$-coloring of $K_r$ there will be a monochromatic $K_n$ for some $n \in {1, \cdots, m}$.

Generalized Arithmetic Progression. Let $Z$ be an additive group, i.e., an abelian group with the operation $+$. Let $S = [0,N] = \{0, 1, \cdots, N\}$.

The set

$\displaystyle P = P(a,v, Z, d, N) = a + S \cdot v = a + \sum_{i \leq d, n_i \leq N_i}{n_i v_i}$

is called a (generalized) arithmetic progression in $Z$ where

• $d \in \mathbb{Z}_+$ is the rank of $P$
• N = (N_1, \cdots, N_d) \subset \mathbb{Z}_+^d$is the dimension of $P$, • $a \in Z$ is the starting point of $P$ • $v = (v_1, \cdots, v_d) \in Z^d$ are the basis vectors of $P$ • $\displaystyle vol(P) := \prod_{i \leq d}{N_j + 1}$ the volume of $P$ • the function $\cdot$ is the usual dot product. The progression is proper if the map $n \mapsto n \cdot v$ is injective (i.e., one-to-one) on $S = [0, N]$. Equivalently, $vol(P) = \# P$. An improper progession may result if the basis vectors are linearly dependent over $\mathbb{Z}$. The progression $P$ is symmetric if $P = -P$. Hales-Jewett Theorem. Let $S$ be an alphabet of cardinality $n$. For a “sufficiently large” dimension $d = d(n, k)$ and a $k$-partition of the words $S^d$, there must be a part that contains a “combinatorial line.” In an arithmetic setting, suppose $k \geq 1, n \geq 1$, and $S = {0, 1, \cdots, n-1}$. Then there is an integer latex $d = d(n, k)$ such that if $S^d \subset \mathbb{Z}^d$ is partitioned into $k$ non-empty sets, then at least one of these parts must contain a proper arithmetic progression of the form $\displaystyle a + S \cdot v = a + v + 2v + \cdots + (n-1) v$ where $a \in S^d$ and $v \in \{0,1\}^d$. Like Szemeredi’s Theorem—the one-dimensional specialization of the Hales-Jewett theorem—the Hales-Jewett theorem has a density version. Van der Waerden’s Theorem. Given integers $k, m \geq 1$, there exists an integer$latex N(m, k)$—called the van der Waerden number—such that if we color a proper arithmetic progression $P$ of length at least N \geq N(m, k)$ using $m$ different colors, there will be a monochromatic proper sub-progression of length at least $k$. The smallest $n_{r,k}$ is called the Van der Waerden number. The most natural case is when $P = \mathbb{Z}_+$.

Equivalently, in an arbitrary partition of $\{1, \cdots, N\}, N = N(m, k)$ into $m$ subsets, at least one subset contains an arithmetic progression of length $k$.

Szemeredi’s Theorem. Informally, it says that dense subsets of integers contain long arithmetic progressions.

Let $\delta \in (0,1), k \in \mathbb{Z}_{\geq 1}$. Let $r_k(A)$ be the size of the largest subset of an additive set $A$ avoiding an arithmetic progession of length $k$. The following are equivalent:

1. Every “sufficiently large” subset of $[N]$ must contain infinitely many arithmetic progressions of length $k$ for every $k > 1$. Here, “sufficiently large” means having a positive upper density.
2. Every subset of $[N]$ of size at least $\delta N$ must contain an progression of length $k$.
3. $r_k([N]) = o([N])$.

Gowers (2001) proved that

$\displaystyle r_k(N) \leq \frac{N}{ \left( \log \log N\right)^{ 1/2^{2^{k + 9}} } }$.

Setting $\delta = 1/m$ in the second formulation, Szemeredi’s theorem implies van der Waerden’s theorem.

Roth’s Theorem. For any finite additive group $Z$ of odd order we have

$\displaystyle r_3(Z) = o_{\vert Z \vert \rightarrow \infty}(\vert Z \vert)$ .

Triangle Removal Lemma. If a graph has $o(n^3)$ triangles, it is possible to remove all triangles by removing only $o(n^2)$ edges.

Fibonacci Numbers Generating Function.

$\displaystyle x/(1 - x - x^2)$ .

Harmonic Numbers Generating Function.

$\displaystyle \frac{1}{1-x} \log \frac{1}{1-x}$ .

Binomial Coefficients Generating Function.

$\displaystyle \sum_k {n \choose k} x^k = (1+x)^n$ for fixed $n$.

$\displaystyle \sum_n{n \choose k} y^n = \frac{y^k}{(1-y)^{k+1}}$ for fixed $k$.

Bell Numbers Exponential Generating Function. The $n$th Bell number $b_n$ is the number of ways to partition $[n]$.

$\displaystyle B(x) = e^{e^x - 1}$ .

Catalan Numbers. The $n$th Catalan number can be interpreted as the number of ways $2n$ parentheses can be matched. The solution is

$\displaystyle c_n = {2n \choose n} - {2n \choose n+1} = {2n \choose n}/(n+1)$ .

They satisfy the recurrence relation

$\displaystyle c_0 = 1, \qquad c_{n+1} = \sum_{k=0}^n{ c_k c_{n-k}}$ .

This reveals that the generating function for Catalan numbers, $C(Z)$, must satisfy

$\displaystyle C(Z) = Z C(Z)^2 + 1$ .

## Graph Theory

Independent Set and Vertex Cover. A largest independent set is the complement to a minimum vertex cover. The vertex cover problem is the dual of the maximum matching problem. In a bipartite graph, the size of the maximum matching equals the size of the minimum vertex cover (Konig’s theorem).

Crossing Number. $cr(G)$ is the minimum number of edge crossing in a plane drawing of a graph $G$. If $m > 7n$, then $cr(G) \geq m^3/39n^2$. This is the crossing number inequality.

Szemeredi-Trotter Theorem. Suppose we have $n$ points $P$ and $m$ lines $L$ in a two-dimensional plane. Consider any bipartite graph $G=(P,L,E)$ where an edge $e =(p,\ell)$ implies that the point $p$ is on line $\ell$. Then

$\displaystyle \vert E \vert = O\big( ( m^2n^2 )^{1/3} + m + n \big)$ .

Szemeredi Regularity Lemma. Informally,

• Given a positive constant $\epsilon$ and a graph $G$ with a “large enough” vertex set, one can find a “nearly-uniform” partition of the vertices, into “not too many” parts, so that “most” of the part-pairs are $\epsilon$-regular.
• Every graph has a constant size approximate representation where the size of the representation depends only on the quality of the approximation.
• All graphs are mostly composed of random-looking bipartite graphs.

More formally,

• We are given an $\epsilon$ and a positive integer $m \geq 1$ .
• The partition-size is $k, m \leq k \leq m + O_{\epsilon, m}(1)$ .
• At least a $1 - \epsilon$ fraction of the part-pairs are $\epsilon$-regular.

Here, a “nearly-uniform” partition is a partition where the part-sizes differ by at most one. The edge density of two disjoint vertex-sets$A, B$ is defined as

$\displaystyle d(A,B) := \frac{ \# \text{edges between }A\text{ and }B }{\vert A \vert \cdot \vert B \vert }$ .

The disjoint vertex-sets $A, B$ are $\epsilon$-regular if

$d(A^\prime, B^\prime) \in d(A,B) \pm 1$

for all “large” subsets $A^\prime \subseteq A, B^\prime \subseteq B, \vert A^\prime \vert \geq \epsilon \vert A \vert , \vert B^\prime \vert \geq \epsilon \vert B \vert$.

Informally, a bipartite graph is $\epsilon$-regular if its edges are dispersed like the edges of a random graph.

## Algebraic Graph Theory

Expander Mixing Lemma. Let $G$ be a $d$-regular expander graph on $n$ vertices and $m$ edges. Suppose the second-largest eigenvalue of its adjacency matrix is at most $\lambda$. For any pair of vertex-sets $A,B$, let $E_{A,B}$ be the set of “cross edges” between $A$ and $B$. Let $\#A$ denote the cardinality of the set $A$. Then, for all $A, B \subset V$,

$\displaystyle \left \vert \# E_{A,B} - \frac{d}{n} \cdot \# A \#B \right \vert \leq \lambda \sqrt{\# A \cdot \# B} < \lambda n\,.$

In the left-hand side, the first term is the number of cross edges. The second term is the expected number of cross edges in a random graph with edge-probability $d/n$. The left-hand side thus measures how much $G$ deviates from this expectation. The expander mixing lemma says that if $\lambda$ is small, $G$ will be nearly random in this sense.

Random Walk on an Expander. Suppose $G$ is a $d$-regular expander whose second-largest eigenvalue is at most $\lambda$ in magnitude. Let $W$ be a set containing a $\rho$ fraction of vertices. The fraction of random walks of length $\ell$, on the vertices of $G$, that stay entirely within $W$, is small; in particular, it is at most

$\displaystyle \rho \left( \rho + (1 - \rho) \frac{\lambda}{d} \right)^\ell\,$.

Isoperimetric Number (Edge Expansion). For any vertex-subset $S$, its boundary is denoted by $\partial S = E_{S, \bar{S}}$. Its isoperimetric ratio is defined as

$\displaystyle \theta(S) := \frac{ \vert \partial S \vert}{\vert S \vert}$

The isoperimetric number of an $n$-vertex graph $G$ is the smallest $\theta(S)$ over all $S$ of size at least $n/2$.

One can show that

$\displaystyle \theta(S) \geq \lambda_2(1-s)$

where $s = \vert S \vert / \vert V \vert$ and $\lambda_2$ is the second smallest eigenvalue of the Laplacian of $G$.

Cheeger Inequality. Suppose $G$ is an $(n,d,\lambda_2)$-expander i.e., a $d$-regular graph on $n$ vertices with the second-largest eigenvalue of its adjacency matrix being $\lambda_2$. Since the largest eigenvalue is $d$, the quantity $d-\lambda_2$ is the spectral gap. For every $S \subset V$,

$\displaystyle \frac{d-\lambda_2}(1 - s) \leq h(G) \leq \sqrt{ 2d (d-\lambda_2) }$.

where $s = \vert S \vert / \vert V \vert$. Equivalently,

$\displaystyle \frac{h(G)^2}{2d} \leq d- \lambda_2 \leq 2 h(G)$.

The Laplacian Matrix. Let $A$ be the adjacency matrix of the graph $G=(V,E)$ on $n$ vertices and $m$ edges. Let $D$ be the diagonal degree sequence matrix. Then the matrix $L =D - A$ is called the (combinatorial) Laplacian matrix of $latex G$. Let the eigenvalues of $L$ be

$\displaystyle \sigma(L) = \{0 = \lambda_1 \leq \cdots \leq \lambda_n \}$.

Let the eigenvalues of $A$ be

$\displaystyle \sigma(A) = \{deg(G) = \mu_1 \geq \cdots \geq \mu_n \}$.

The Laplacian matrix has some amazing properties.

• The rank of $L$ is $n - c$ where $c$ is the number of connected components of $G$. For a connected graph, $rank(G) = n - 1$.
• Since the rows of $L$ sum to one, the all-ones vector is an eigenvector of $L$ with eigenvalue zero.
• Since $L$ is a real symmetric matrix, all eigenvalues are real.
• If $G$ is $k$-regular, $\lambda_i = k - \mu_i$ for $i \in [n]$.
• The diameter of $G$ is strictly less than the number of distinct eigenvalues of $L$.
• If we remove a cut-vertex from $G$, the size of the largest component is at least $\lambda_2 - 1$.
• Edge Addition. If we add an edge to $G$, the non-zero eigenvalues never decreases. In particular, $\lambda_i \leq \lambda^\prime_i$ for all $i \geq 2$ where $\lambda^\prime$ are the new eigenvalues.
• Let $v(G), e(G)$ be the vertex- and edge-connectivity of $G \neq K_n$, respectively. Then

$\displaystyle 2 e(G)(1-\cos(\pi/n)) \leq \lambda_2 \leq v(G)$.

• $L = \sum_e w_e L_e$, where $L_e = b_e b_e^T$ and $b_{(u,v)} = 1_u - 1_v$. We can think $L_e$ as “baby Laplacians” where the two diagonal entries are one and the two off-diagonal entries are minus one.
• For any vector $x$,

$\displaystyle Lx = \sum_e{ w_e L_e x} = \sum_{(u,v)\in E}{ w_e \begin{pmatrix}x_u - x_v \\ x_v - x_u\end{pmatrix} }$ .

• Let $x$ be the characteristic vector for a vertex-set $X$. If the graph is unweighted i.e., $w_e = 1$ for all $e\in E$, then

$\displaystyle x^T L x = \sum_e x^T b_e b_e^T x = \sum_{(u,v) \in E}{ (x_u - x_v)^2} = \vert \partial S \vert$

where $\partial S$ is the boundary of $S$, i.e., the number of edges with exactly one endpoint in $S$.

• One can show that $\displaystyle \theta(S) \geq \lambda_2(1-s)$ where $s = \vert S \vert / \vert V \vert$. This means a graph with a large $\lambda_2$ has a high edge expansion, or “well-connected.”
• If $G \preceq c\cdot H$ then $\lambda_k(G) \leq c \lambda_k(H)$ for all $k$.
• Let $G = (V, E, w)$ and $H=(V, E, z)$ be two weighted graphs that differ in only their weights. Then

$\displaystyle G \succeq \left( \min_{e \in E}{\frac{w(e)}{z(e)}} \right) H$ .

• For path graphs,

$\displaystyle L_{1,n} \succeq \left( \sum_{i=1}^n{\frac{1}{w_i}} \right) \sum_i^{n-1}L_{i, i+1}$ .

• $\lambda_2(P_n) = \theta(1/n^2)$ and $\lambda_2(T_n) \in [2/(n-1), 1/(n-1)\log_2 n]$.

The Spectrum of a Graph Product. Suppose $G$ has $m$ eigenvalue-eigenvector pairs $(\lambda_i, \alpha_i)$. Similarly, suppose $H$ has $n$ eigenvalue-eigenvector pairs $(\mu_i, \beta_i)$. Then, for each $i \in [m], j \in [n]$, the Cartesian product $G \times H$ will have an eigenvalue $\alpha_i + \beta_j$ associated with the eigenvector $\gamma_{i,j}$ defined as

$\displaystyle \gamma_{i,j}(u,v) = \alpha_i(u) \beta_j(v)$ for $u \in [m], v \in [n]$.

Hoffman’s Bound on the Size of an Independent Set. The size of an independent set of a $d$-regular graph on $n$ vertices is at most

$\displaystyle n \cdot (-\lambda_{min})/(d - \lambda_{min}$ .

## Linear Algebra

A complex matrix $A$ is unitary if $A^* = A^{-1}$. A real, unitary matrix is orthogonal. $A$ is Hermitian if $A^* = A$. It is normal if $A^*A = AA^*$.

Positive Semidefinite Matrices. A Hermitian matrix $A \in \mathbb{C}^{n \times n}$ is Positive Semidefinite if

• all its eigenvalues are non-negative; in particular, its determinant is positive. (Recall that the eigenvalues of a Hermitian matrix are real.)
• it is the Gram matrix of linearly independent vectors; that is, there is a set of $n$ linearly-independent vectors $\{u_k\}$ such that $a_{ij} = \langle u_i, u_j \rangle$.
• it can be written as $A = L L^*$ where the columns of $L$ are independent.
• the associated bilinear form $x^* A y,\, x, y \in \mathbb{C}$ is an inner product
• its leading principal minors are all positive.

Cramer’s Rule. Suppose we want to solve a matrix-vector equation $Ax = b$. Let $A_i$ denotes the matrix obtained by replacing the $i$th column of the matrix $A$ by the $b$. Then,

$\displaystyle x_i = \frac{\text{det}(A_i)}{\text{det} (A)}\, .$

Spectral Decomposition.  A matrix $A$ is Hermitian i.e., $A = A^*$, if and only if $A$ has an orthonormal set of eigenvectors with real eigenvalues i.e., $A = \sum{\lambda_i v_i v_i^*}$ where $(v_i, \lambda_i)$ are an eigenvector-eigenvalue pair.

See here for a proof of the Hermitian case: orthogonal eigenvectors and real eigenvalues.

Normal Matrices. A matrix $A$ is normal if it commutes with its conjugate transpose, i.e., $A A^* = A^* A$. This happens if and only if $A$ is unitarily diagonalizable, i.e., there exists a diagonal $D$ and unitary $U$ such that $A = U D U^*$.

A block upper-triangular matrix is normal if and only if all on-diagonal blocks are normal and all off-diagonal blocks are zero.

Polar Decomposition. Every matrix $A$ can be written as $A = UR = LU$ where $U$ is unitary, $R,L$ are positive semidefinite, and $L = \sqrt{AA^*}, R = \sqrt{A^* A}$. Moreover, if $A$ is invertible then $U$ is unique.

Singular Value Decomposition. Every square matrix $A$ can be written as $A = U D V$ where $U,V$ are unitary and $D$ has non-negative real diagonal entries.

Proof: By polar decomposition, $A = LR$ where $L$ is unitary and $R$ is positive semidefinite. By spectral theorem, $R = T D T^*$ for some unitary $T$ and real diagonal $D$. Moreover, since $R$ is PSD, its eigenvalues are non-negative; hence $D$ has non-negative real entries. We set $U := LT, V := T^*$.

Cholesky Factorization. Every positive semidefinite complex matrix $A$ can be factored into $A = L L^*$ where $L$ is lower-trialgular with real, positive diagonal entries.

Schur Decomposition. Any complex square matrix $A$ can be written as $A = U T U^*$ where $U$ is unitary and $T$ is upper triangular.

Square Root. Every positive semidefinite matrix $A$ (with bounded largest eigenvalue) can be factored into $A = B^* B$ where $B$ is also positive semidefinite.

Hilbert-Schmidt Norm and Inner Product.

Let $a_i$ denote the $i$-th column of the matrix $A$. Then,

$\displaystyle \langle A, B \rangle_{HS} := \mathrm{tr}(A^* B) = \sum_{i}{\langle a_i, b_i \rangle}$.

$\displaystyle \Vert A \Vert_{HS}^2 = \Vert A \Vert_{F}^2 := \mathrm{tr}(A^*A)$.

Trace.

Trace of a matrix is the sum of its eigenvalues.

$\displaystyle \mathrm{tr}(AB) = \mathrm{tr}(BA)$.

$\displaystyle \mathrm{tr}(A x x^*) = \mathrm{tr}(x^* A x) = \langle x \vert A \vert x \rangle$.

Projection Matrices. Any projection matrix $P$ can be written as $P = \sum{x x^*}$ where $\{x\}$ is a basis of the target subspace. $P$ has eigenvalues $\{0,1\}$. $P$ is always Hermitian i.e. $P^* = P$, as evidenced by the sum formulation. $P$ satisfies $P^2 = P, P^*P = P$.

Frobenius Norm. A generalization of the usual $\ell_2$ norm for matrices.

$\displaystyle \Vert A \Vert_F^2 = \sum_{a_{ij}^2} = \mathrm{tr}(A^*A)$.

This norm, like the $\ell_2$ norm, is invariant under rotation.

Tensor Inner Product. The tensor inner product on the space $V \otimes W$ is defined as follows:

$\displaystyle \left\langle \sum_i{\alpha_i v_i \otimes w_i} \bigg\vert \sum_j{\beta_j v^\prime_j \otimes w^\prime_j} \right\rangle \quad := \quad \sum_{ij}{ \alpha_i^* \beta_j \langle v_i \vert v_j \rangle \langle w_i \vert w^\prime_j \rangle }$ ,

where $\alpha_i, \beta_j$ are scalars and $v_i, v^\prime_j \in V$, and $w_i, w^\prime_j \in W$.

Conjugate Gradient Method (for Solving Symmetric Linear Systems). Suppose we would like to solve $Ax = b$ for $x$ where $A$ is a real symmetric Positive Semidefinite matrix. The matrix $A$ defines the inner product $\langle x, y \rangle_A = \langle x \vert A \vert y \rangle$. We can find a set of basis vectors $P = \{p_1, \cdots, p_n\}$ spanning $\mathbb{R}^n$ that are mutually orthogonal according to this inner product. Then, the solution $x^*$ can be written as a linear combination $\{ \alpha_i \}$ of these basis vectors.

In practice, the vectors $\{p_i\}$ and corresponding $\alpha_i$ can be found using a process similar to the Gram-Schmidt orthonormalization. Any subset of these basis vectors gives an approximation of $x^*$.

Facts about the Matrix Exponential. If $X,Y$ commute, then $e^X e^Y = e^{X+Y}$. In addition, if $X,Y$ commute then $e^X, e^Y$ commute, as do $\log X, \log Y$.

Moreover, for any invertible $Y$, $\displaystyle e^{Y X Y^{-1}} = Y e^X Y^{-1}$. Next, $det(e^X) = e^{tr(X)}$. If $D=diag(a,b,\cdots)$ then $e^D = diag(e^a, e^b, \cdots)$.

Golden-Thompson inequality. For any square matrices $X,Y$,

$\displaystyle tr(e^{X+Y}) \leq tr(e^X e^Y) = tr(\sqrt{e^X} e^Y )\sqrt{e^X}$.

Lieb’s Theorem. Let $H$ be Hermitian and $A$ positive definite. The function

$\displaystyle A \mapsto \mathrm{tr} \exp \big[H + \log A \big]$

is concave.

Corollary (Tropp). Let $H$ be fixed, $X$ be random, both Hermitian. Then

$\displaystyle \mathbb{E} \mathrm{tr} \exp(H+X) \leq \mathrm{tr} \exp(H + \log \left[ \mathbb{E} e^X \right])$.

See Joel Tropp’s paper (Section 3.4).

Perron-Frobenius Theorem. Suppose $A$ is a real matrix with strictly positive entries. Let $r$ be its largest eigenvalue with corresponding eigenvector $v$.

• $r$ is a positive real with multiplicity one. Every other eigenvalue has a strictly smaller absolute value.
• $r$ is no smaller than the minimum row-sum of $A$, and it is no larger than the maximum column-sum.
• Every component of $v$ is a positive real. There is no other positive real eigenvector.

Gershgorin Disk Theorem. Let $A$ be a complex matrix, and let $D_{ii}$ be the closed disk centered at $a_{ii}$ and radius $r_{i} \triangleq \sum_{j \neq i} |a_{ij}|$. Then,

• Every eigenvalue of $A$ lies in at least one of the disks
• If the union of any $k$ disks is disjoint from the union of the remaining $n - k$ disks, then that union contains exactly $k$ of the eigenvalues.

Weyl Inequalities.

Eigenvalues after a Hermitian Update. Let $A, B$ be order-$n$ Hermitian matrices with eigenvalues $\{\alpha_i\}, \{\beta_i\}$, respectively, for $i \leq n$, arranged in increasing order. Then

$\displaystyle \gamma_i \in \alpha_i + [\beta_1, \beta_n]$ .

Eigenvalues after a PSD Update. Let $H$ be Hermitian and $P$ be positive semidefinite. Then

$\lambda_i(H) \leq \lambda_i(H+P)$ for all $i$.

Eigenvalues after a Rank-one Update. Let $A$ be Hermitian and $B = A + zz^*$ where $z$ is a vector. Let $\{\alpha_i\}, \{\beta_i\}$ be their eigenvalues, arranged in increasing order. Then

$\alpha_1 \leq \beta_2 \leq \alpha_3 \leq \cdots \leq \beta_k \leq \alpha_{k+1} \leq \beta_{k+2} \leq \alpha_{k+3} \leq \cdots$ .

Eigenvalues after a Rank-$r$ Update. In general, if $rank(C) = r$ and $\{\gamma_i\}$ the eigenvalues of $A+C$, then

$\displaystyle \gamma_i \leq \alpha_{i+r} \leq gamma_{i+2r} \leq \alpha_{i+3r} \leq \cdots$ .

Additive Bounds. If $j, k \in [n]$ such that $j+k \geq n+1$, then

$\displaystyle \lambda_{(j+k) - n}(A+B) \leq \lambda_j(A) + \lambda_k(B)$ .

Eigenvalues by Bordering. Let

$\displaystyle B = \left[\begin{array}{c|c}A & y \\ \hline y^* & a \end{array}\right]$

where $A$ is Hermitian, $y$ an arbitrary complex vector, and $a$ an arbitrary real. Let $\{\alpha_i\}, \{\beta_i\}$ be their eigenvalues arranged in increasing order. Then

$\beta_1 \leq \alpha_1 \leq \beta_2 \leq \alpha_2 \leq \cdots \leq \alpha_n \leq \beta_{n+1}$ .

Eigenvalues of a Submatrix. Let $A_r$ be the order-$r$ principal submatrix of an order-$r$ Hermitian matrix $A$. Let $\{\rho_i\}, \{\alpha_i\}$ be their eigenvalues, arranged in increasing order. Then

$\displaystyle \alpha_i \leq \rho_i \leq \alpha_{i + n - r}$ for $i \in [r]$.

Majorizing Eigenvalues. The diagonal elements of a Hermitian matrix majorizes its eigenvalues.

Given a set $A$, let $f_k(A)$ be the minimum of the sum of every $k$-subset of $A$. $A$ majorizes $B$ if $f_k(A) \geq f_k(B)$ for every $k \in [n]$ with equality for $k = n$.

Sylvester’s Law of Inertia. If $A$ is symmetric and $B$ is non-singular, then the matrix $BAB^T$ has the same number of positive, negative, and zero eigenvalues as those of $A$.

Rayleigh-Ritz Theorem. For an order-$n$ Hermitian matrix $A$ with eigenvalues $\lambda_1 \leq \cdots \leq \lambda_n$ and all $x \in \mathbb{C}^n$, let

$\displaystyle r_A(x) := \frac{x^* A x}{x^* x}$

be the Rayleigh-Ritz ratio. Then

$\displaystyle x^* A x \in [\lambda_1, \lambda_n]$ ,

$\displaystyle \lambda_n = \max_{x^*x = 1}{x^* A x}$ , and

$\displaystyle \lambda_1 = \min_{x^*x = 1}{x^* A x}$ .

Courant-Fischer Theorem. For every $x \in \mathbb{C}^n$ of unit length, the $k$th largest eigenvalue $\lambda_k$ of an order-$n$ Hermitian matrix $A$ is

$\displaystyle \lambda_k = \max_{S \subseteq \mathbb{C}^n, dim(S) = {k-1}} \min{x \perp S} x^*A x$ ,

$\displaystyle = \min_{T \subseteq \mathbb{C}^n, dim(T) = n-k} \max{x \perp T} x^*A x$ .

Let $v_k$ be the corresponding eigenvector and let $r_A(x)$ be the Rayleigh-Ritz coefficient for a vector $x$. Intuitively, the first statement says that

• $v_k$ is orthogonal to the span of the previous $k-1$ eigenvectors
• For every disjoint partition $\mathbb{C} = S \cup S^\perp$ where $dim(S) = k-1$, first we minimize $r_A(S^\perp)$; let $v_k(S)$ be this value
• Then, $v_k$ is the vector that achieves the largest value in $\{v_k(S)\}$ for all $k-1$ dimensional subspace $S$
• Consequently, $\lambda_k$ is smaller than all subsequent eigenvalues and larger than all previous eigenvalues.

Since we can choose the $n-(k-1)$ dimensional subspace $S^\perp$ first, we have the second statement.

## Randomized Algorithms

Generating Pairwise Independent Random Variables. Choose a field $F$ of size $2^n$. Sample two uniformly random elements $x_1, x_2 \in F$ using $2n$ uniformly random bits. Then the following $2^n$ random variables are uniform over $F$ and pairwise independent:

$\displaystyle Y_i := x_1 + i \cdot x_2 \qquad \forall i \in \{0, \cdots, 2^n - 1\}$

See here for details.

Pairwise Independent Hash Functions. Let $H$ be a family of functions, each having $n$ input bits and $k$ output bits. Suppose the following holds: that if we pick $h \in H$ at random, the probability that it maps two arbitrary but distinct input strings to the same output string is exactly $1/2^{2k}$. Then $H$ is a family of pairwise independent hash functions. In other words,

$\forall x,x^\prime \in \{0,1\}^n, y, y^\prime \in \{0,1\}^k \forall h \in_R H \Pr[h(x)=y \wedge h(x^\prime) = y^\prime]$

$= \Pr[h(x)=y] \Pr[h(x^\prime) = y^\prime] = \frac{1}{\vert A \vert \vert B \vert} = 2^{-2k}$.

Equivalently, the probability that two arbitrary strings will collide under a random $h \in H$ is the square of the uniform probability on the output space. Intuitively, if we pick a random $h \in H$ then for arbitrary $x, y$ in the domain, the strings $h(x), h(y)$ are independent and uniformly distributed.

Valiant-Vazirani Lemma. Let $\mathcal{H}$ be a collection of pairwise independent hash functions from $\{0,1\}^n \rightarrow \{0,1\}^k$. Let $S \subseteq \{0,1\}^n$ such that $\vert S \vert \in [K/4,K/2]$ where $K = 2^k$. Then, with constant probability, the string $0^k$ has a unique preimage in $S$ under a randomly selected $h \in \mathcal{H}$.

Leftover Hash Lemma. Suppose an $n$-bit string $s$ has entropy at least $n-t$. The leftover hash lemma says that it is possible to extract $n-t$ bits from $s$ that will be almost uniformly distributed. The way to do it is to use the index $s$ to choose a member $h_s : \{0,1\}^n \rightarrow \{0,1\}^k$ from a pairwise-independent hash family $H$. If a string $x$ is randomly chosen from an arbitrary subset $W\subseteq \{0,1\}^n$, the distribution of the concatenated string $s \Vert h_s(x)$ will be at most $\epsilon$ away from the uniform distribution where $\epsilon = \sqrt{ \vert W \vert / 2^k}$.

Hitters. Fix a Boolean function $\sigma$ on $n$-bit inputs. Let $S$ be the set on inputs on which $\sigma$ outputs $1$, and suppose $\vert S \vert = \epsilon 2^n$. An efficient, randomized algorithm $A$, given only oracle access to $\sigma$, is a hitter for $\sigma$ if with probability at least $1 - \delta$, it needs to sample at most $m$ points from $\{0, 1\}^n$ before it “hits $S$,” that is, it outputs an element from $S$. We can construct hitters, using random walks on expanders, which takes $m = O(\epsilon^{-1} log(1/\delta))$ samples and uses only $n + O(1/\delta)$ random bits. In contrast, if we sample uniformly at random, it requires $mn$ random bits. See the appendix of Goldreich’s survey for details.

## Deviation/Moment Inequalities

Markov’s Inequality. Any nonnegative $X$ with mean $\mu$ satisfies $\displaystyle \Pr[X \geq \alpha \mu ] \leq 1/\alpha$ for any positive $\alpha$.

Chebyshev’s Inequality. Any random variable $X$ with mean $\mu$ and variance $\sigma^2$ satisfies

$\displaystyle \Pr[ \vert X -\mu \vert > \alpha] \leq \frac{\sigma^2}{\alpha^2}$

for any positive $\alpha$. Setting $\alpha \leftarrow \alpha \sigma$, we get

$\displaystyle \Pr[ \vert X -\mu \vert > \alpha \sigma] \leq \frac{1}{\alpha^2}$ .

Note: When studying the sum $\sum X_i$ of a set of random variables, the Chernoff-Hoeffding inequality requires mutual independence of the random variables $\{X_i\}$ while for Chebyshev’s inequality, it suffices that they have only pairwise independence (so that the covariances are zero). See Nick Harvey’s notes here.

Mean-Median Inequality. The median of a random variable is within one standard deviation of its mean.

See here for a proof using Chebyshev’s inequality. See below for a similar statement about the mean-median proximity of 1-Lipschitz functions.

Paley-Zygmund Inequality. Let $Z$ be a non-negative random variable with a finite variance. If $\theta \in [0,1]$ then

$\displaystyle \Pr\bigg[ Z > \theta \mathbb{E} Z \bigg] \geq (1-\theta)^2 \frac{(\mathbb{E} Z)^2}{\mathbb{E} Z^2}$.

Note: The Paley-Zygmund inequality complements the Chebyshev inequality. While the latter says that “right tail is thin,” the former says that “the left tail is thin.”

Hoeffding’s Tail Inequality, Additive Deviation, Symmetric. The probability that the sum of $n$ independent and identically distributed $[a,b]$-bounded random variables deviates at least $\lambda$ away from its expectation is no more than $\displaystyle \exp\left( - 2 \lambda^2 / n(a-b)^2\right)$.

Chernoff Bound, Multiplicative Deviation, Asymmetric. Let $S$ be the sum of many independent Bernoulli trials. Let $\mu = E[S]$ and let $\delta$ be any positive number between zero and one. The probability that $S$ is $\delta \mu$ below the mean is at most $\exp(-\delta^2\mu/2)$. However, The probability that $S$ is $\delta\mu$ above the mean is at most $\exp(-\delta^2\mu/3)$.

(Hoeffding-like statement.) Equivalently, the probability that $S$ is $\lambda$ below the mean is at most $\exp(-\lambda^2/2\mu)$. However, The probability that $S$ is $\lambda$ above the mean is at most $\exp(-\lambda^2/3\mu)$.

Note. When $\lambda$ is large, the normal approximation of the binomial distribution is less accurate for the positive tail. See the appendix of Alon-Spencer for a commentary.

Above are the plots of the Chernoff-tail and the Hoeffding-tail for both 0-1 and $\pm 1$ random variables. The plotted quantity is the bound for the probability that the sum deviates $t n$ away from the expectation, for $t \in [-0.1, 0.1]$ and $n=1000$.

Concentration for Subgaussian Random Variables. Let $R$ be a random variable with mean zero, variance one, and a subgaussian tail, i.e., $\Pr[R \geq r] \leq e^{-a r^2}$ for some $a > 0$. Let $x \in \mathbb{R}^n\,, \Vert x \Vert_2 = 1$ and define $Y = \sum_{i=1}^n x_i R_i$ where each $R_i$ is an independent copy of $R$. Then $Y$ has mean zero, variance one, and a subgaussian tail.

Bernstein Inequality. Let $X_1, \ldots, X_n$ be independent random variables where $|X_i| < M$ almost surely. Let $S = \sum X_i$ with mean $\mu$ and variance $\sigma^2$. Then for all $\lambda > 0$,

$\displaystyle \Pr[S - \mu > \lambda]\, \leq \exp\left( - \dfrac{\lambda^2}{2(\sigma^2 + M\lambda/3)}\right) \,.$

Rudelson-Vershynin Theorem on the Operator Norm of Random Projections.  Let $x_1, \cdots, x_n$ be $n$ identical copies of a random vector $x \in \mathbf{R}^d$ with $\mathbf{E} x^Tx = I$ and magnitude $\Vert x \Vert_2$ at most $R$. Let $\mu := (\sum_i{x_ix_i^T})/n$. For any $\lambda \in(0,1)$,

$\displaystyle \Pr\left[ \left\Vert \mu - I \right\Vert_2 > \lambda \right] < 2d \exp \left( -n\lambda^2/4R^2 \right).$

Note: We can think $xx^T$ as a “scaled” projection operator along $x$ multiplied by $1/\Vert x \Vert_2$. Then $\mu$ acts as the mean of these projection operators. The theorem says that the largest distortion inflicted by $\mu$ on any vector is no more than $\lambda$ with high probability. Notice also the similarity with Hoeffding’s tail inequality, which would give the exponent as $-2(n\lambda)^2/nR^2=-2n\lambda^2/R^2$ for the one-dimensional case. We lost a factor of 8 somewhere.

Azuma-Hoeffding Inequality for Martingales. Let $X_0, \cdots, X_n$ be real-valued random variables with the (super)martingale property, i.e., $\mathbf{E} \left[ X_{t+1} \mid X_0, \cdots, X_t \right] \leq X_t$, and the bounded-difference property, i.e., $\vert X_t-X_{t-1} \vert \leq c_t$ where $c_t$ may depend on $t$. Then, for all $n \geq 0$ and $\lambda > 0$,

$\displaystyle \Pr[ \vert X_n - X_0 \vert \geq \lambda ] \leq 2 \exp \left( -\frac{\lambda^2}{2 \sum_t{c_t^2} } \right)$ .

In particular, if $c_t = c$ for all $t$,

$\displaystyle \Pr[\vert X_n - X_0 \vert \geq \lambda c \sqrt{n} ] \leq 2 e^{-\lambda^2/2}$ .

The one-sided statement, i.e., for non-absolute difference, is as you have hoped. See Motwani-Raghavan Chapter 4, Theorem 4.16 for details.

Khintchine Inequality. Let $\displaystyle \varepsilon$ be uniformly random in $\displaystyle \{\pm 1 \}^n$ and let $\displaystyle x \in \mathbb{C}^n$ be arbitrary. For any bounded $p > 0$, let $\Vert x \Vert_p^p := \sum_i{ \vert x_i \vert }^p$. Let $m_p(Z)$ be the $p$th moment of $Z$. Define $X := \sum_i{\varepsilon_i x_i}$.  Then,

$\displaystyle \bigg( m_p(\vert X \vert ) \bigg)^{1/p} = \Theta\bigg( \Vert x \Vert_2 \bigg)$,

where the constants hidden by the $\Theta$ notation depend only on $p$. In aprticular, $\dfrac{\Vert x \Vert_2}{\sqrt{3} } \leq \mathbb{E}\vert X \vert \leq \Vert x \Vert_2$.

Marcinkiewicz-Zygmund Inequality. Let $x_1, \cdots, x_n \in \mathbb{C}$ be independent random variables with zero mean and bounded $p$th moment for any $1 \leq p < \infty$. Let $m_p(Z)$ be the $p$th moment of $Z$.  Then,

$\displaystyle m_p\bigg(\sum_i{x_i} \bigg) = \Theta\bigg(m_p\left( \Vert x \Vert_2 \right) \bigg)$ ,

where the constants hidden under the $\Theta$ notation depend only on $p$.

Concentration for Lipschitz Functions. A Lipschitz function is highly concentrated around its mean. In particular, a multivariate real function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ is $k$-Lipschitz if the function value changes by at most $k$ whenever the inputs differ in only one coordinate. Then

$\displaystyle \Pr[ | f - \mathbb{E}f | \geq t ]\, \leq 2 \exp\left(- \frac{2 t^2}{n k^2} \right)$ .

Note: A 1-Lipschitz function on the unit sphere is also concentrated around its median and the median is close to the mean. See Matousek’s book (Ch. 14.3) for details.

Concentration for Markov Chains. Let $M$ be an ergodic Markov chain with state space $[n]$ with

• $\epsilon$-mixing time $T$ for $\epsilon \leq 1/8$;
• stationary distribution $\pi$;
• initial distribution $\phi$.

Let

• $\|\,\|_\pi$ be the $\pi$-norm, i.e., $\displaystyle \|\phi\|_\pi \triangleq \sum_i\frac{\phi(i)^2}{\pi(i)}$ ;
• $f_i : [n] \rightarrow [0,1]$ be arbitrary functions with the same mean $\mu$; these functions assigns values to the states;
• $(V_1, \cdots, V_N)$ be an $N$-step random walk on $M$; that is to say, the walk visits vertex $V_i$ at step $i$. Obviously, $V_1 \sim \phi$;
• Define $X \triangleq \sum_{i \leq N} f_i(V_i)$, and notice that $\mathbb{E} X = \mu N$.

Then $X$ is tightly concentrated around $\mu N$. In particular, there exists an absolute constant $c$ such that for all $\delta > 0$,

$\displaystyle \Pr[ X \geq (1+\delta)\mu N]\, \leq \begin{cases} c\, \|\phi\|_\pi\, \exp\left( - \frac{\delta^2 \mu N}{72 T}\right) & \quad\text{if}\quad 0 < \delta \leq 1, \\ c\, \|\phi\|_\pi\, \exp\left( - \frac{\delta \mu N}{72 T}\right) & \quad\text{if}\quad \delta > 1\, . \end{cases}$

$\displaystyle \Pr[X \leq (1-\delta)\mu N]\, \leq c\, \|\phi\|_\pi\, \exp\left( - \frac{\delta^2 \mu N}{72 T}\right) \qquad\text{if}\quad 0 < \delta \leq 1\, .$

## Complexity Theory

Goldreich-Levin Theorem. Every injective one-way function $f$ has a corresponding boolean function $b$ which is hard to predict; predicting $b$ is effectively inverting $f$, which is unlikely when $f$ is an injective one-way function.

Lipton’s Random Self-Reducibility Theorem. A problem $P$ is called “random self-reducible” if you can solve $P$ on an arbitrary input by evaluating $P$ on a sequence of uniformly random inputs. Suppose you have a black-box to solve $P$ on a random input with high probability. Then you can construct a randomized algorithm that can solve $P$ on every input with high probability.

Yao’s Min-Max Lemma. You and I are playing a two-player game. You are the input player and I am the algorithm player. Your move is to select a distribution $\mathcal{D}$ over inputs $\mathcal{X} = \{x\}$. My move is to select a distribution $\mathcal{R}$ over algorithms $\mathcal{A} = \{A\}$. Both $\mathcal{A}, \mathcal{X}$ have finite size. Your goal is to make $cost(A,x)$ high, while my goal is to make it as low as possible. The best value for you is

$\displaystyle \max_{\mathcal{D}} \min_{A \in \mathcal{A}} \underset{x \sim \mathcal{D}}{\mathbb{E} } cost(A, x)$ ,

while the best value for me is

$\displaystyle \min_{ \mathcal{R} } \max_{x \in \mathcal{X}} \underset{A \sim \mathcal{R} }{\mathbb{E} } cost(A, x)$.

Yao’s min-max lemma says that the two values are equal.

BPP $\subseteq$ P/poly.

Proof: Suppose there is an algorithm $A^\prime$ which decides a language in BPP using a random string $r$ of length $poly(n)$. We can construct a majority-vote algorithm $A$ which uses a random string $R$ of length $latex poly(n)$ but the probability that it errs on a given input string of length $n$ is at most $1/e^n$, thanks to the Chernoff bound. For a fixed $R$, the probability that $A(R,.)$ is incorrect on some input is at most $2^n/e^n < 1$. Then by the probabilistic method, there exists a random string $R^*$ so that $A(x, R^*)$ is correct on every input.

Communication Complexity for Equality. Each part must communicate at least $\Omega(\sqrt{k})$ bits to compute a Boolean predicate $f(x,y) : x \overset{?}{=} y$, where $x, y \in \{0,1\}^k$.

SAT is reducible to 3SAT. It suffices to formulate a 3-CNF formula equivalent to a clause with four variables, say $C = (a \vee b \vee c \vee d)$. We introduce a dummy variable $e$ and write $C^\prime = (a \vee b \vee e) \wedge (c \vee d \vee \bar{e})$. Clearly, if there is an assignment which makes $C$ true, we can choose an assignment to the variable $e$ so that both clauses of $C^\prime$ are true. On the other hand, if an assignment makes $C$ false, then at least one clause of $C^\prime$ will be false regardless of which value we choose for $e$.

Every $n$-variable Boolean Function has a CNF Formula of size $n 2^n$. Consider an $n$-variable Boolean function $f(x), x \in \{0,1\}^n$. For every assignment $a \in \{0,1\}^n$, we construct an “indicator clause” $C_a$ that is false for $a$ but true for every other assignment. For example, if $a = 010$, the clause $C_x = (x_1 \vee \bar{x_2} \vee x_3)$. Then, the desired CNF formula $C$ is the AND of all the indicator clauses for all “bad assignments” $a$ which makes $f(a) = 0$. This formula outputs zero whenever $f(x) = 0$, and one otherwise. This uses at most $2^n$ clauses, each with $n-1$ OR symbols; these clauses are collected using $2^n - 1$ AND symbols. Thus the total number of symbols is at most $(n-1)2^n + (2^n - 1) = n 2^n - 1$.

Impagliazzo’s Hardcore Lemma. If a boolean function is “hard to compute on average” by small circuits, then there exists a set of inputs on which the same function is “extremely hard to compute on average” by slightly smaller circuits.

Yao’s XOR Lemma. Suppose a function $f$ is $\delta$-hard for size $S$, i.e., every circuit of that size fails to compute $f$ on at least a $\delta$ fraction of inputs. Define $f^k(x_1, \cdots, x_k) = f(x_1) + \cdots + f(x_k)$. If $\epsilon > 2(1-\delta)^k$, then $f^k$ is $1/2+\epsilon$-hard for circuits of “smaller” size, i.e., size at most

$\displaystyle \frac{ S \epsilon^2 }{ 100 \log (1/\delta \epsilon) }$.

See here for a proof of the XOR lemma from the hardcore lemma.

## Cryptography

Shamir’s $(t,n)$-Threshold Secret Sharing. Suppose you have a secret $\sigma$. If you have a degree-$(t-1)$ polynomial $P(x) = \sum_i{a_i x^i}$ over any field $F$, you can set $a_0=\sigma$ and create a set of $n$ arbitrary evaluations of this polynomial in $F$. Any $t$ of these evaluations can reconstruct $P$ via Lagrange interpolation and thus recover $\sigma = a_0$. Any smaller subset of evaluations cannot reconstruct $P$, and thus get no information about the secret $\sigma$.

Note. See the similarity with the Reed-Solomon code.

Feldman’s Verifiable $(t,n)-$Secret Sharing. Suppose you want to share a secret $\sigma \in \mathbb{Z}_p^*$.
Initialization: Make $p, g$ public where $g$ generates the field $\mathbb{Z}_p^*$.

Share generation: Like Shamir’s secrete sharing scheme, choose a (random) polynomial $P(x) = \sum_j{a_j x^j}$ of degree $t-1$, set $a_0=\sigma$, the secret, and distribute shares $\sigma_i = P(i) \bmod p$ to each user $i \in [n]$.

Commitments: Create commitments $c_j = g^{a_j}$ to each coefficient of $P(x)$, and make them public.

Verification: The user $i$ can verify whether

$g^{P(i)} \overset{?}{=}\prod_j{c_j^{i^j}}\,$ because

$g^{P(i)} = g^{\sum_j{a_j i^j}} = \prod_j{g^{a_j i^j}} = \prod_j{c_j^{i^j}}.$

Reconstruction: Lagrange interpolation from any $t$ shares.

Discrete-Log Setting $(G,g,q,p)$. Let $g$ be a generator of a multiplicative group of prime order $q$. An easy way to do it is to select a prime $q$ such that $p = 2q+1$. Then, $\mathbb{Z}_p^*$ has a subgroup $G$ of order $q$, and every element of $G$ is a generator. In particular, we can take $g$ to be a random element in $G$.

Discrete-Log Assumption. Given $g$ and $g^a$, it is hard to find $a$.

Decisional Diffie-Hellman (DDH) Assumption. Given $g, g^a, g^b$, it is hard to distinguish $g^{ab}$ from random.

Computational Diffie-Hellman (CDH) Assumption. Given $g, g^a, g^b$, it is hard to compute $g^{ab}$.

Bit Commitment from One-Way Permutations. Suppose you want to commit to a bit $b$. Initialization: Let $h$ be a hard-core predicate for a strong one-way permutation $f$. Commit: You sample a random string $r$ and send $\big( f(r), h(r) \oplus b \big)$ to the receiver. Reveal: You send $b$ and $r$.

This scheme is “computationally hiding” because to predict $b$ one has to predict the hard-core $h(r)$ using $f(r)$, which is “difficult” by definition. This is “perfectly binding” because there can be no $r^\prime$ such that $f(r) = f(r^\prime)$.

Bit Commitment from One-Way Functions. Suppose you want to commit to a bit $b$.
Initialization: Let $G$ be PRG. The receiver sends you a random string $r$.

Commit: You sample a random $n$-bit string $s$ and a random $3n$-bit string  $r^\prime$. If $b == 0$ you send $\big( G(s) \oplus r \big)$ to the receiver; otherwise, you send $\big( r^\prime \oplus r \big)$.

Reveal: You send $b, r^\prime, s$ to the receiver.

Recall that one-way functions imply PRG. The definition of a PRG implies that this scheme is computationally hiding. To cheat, you will have to find a $s^\prime$ such that $G(s) = G(s^\prime)$. The number of  $(s, s^\prime)$ pairs is $2^{2n}$ while the size of the PRG’s output space is $2^{3n}$. Hence the probability that you’ll succeed is at most $2^{-n}$. Hence this scheme is perfectly binding.

Pedersen’s Bit Commitment Scheme. Suppose a sender wants to commit to a bit $b$.
Initialization: Everyone agrees on a Discrete-Log setting $(G,g,q,p)$. The receiver sends a random $h \in G$.

Commit: The sender selects a random $r \in \mathbb{Z}_q$. The commitment is $c = g^r h^b \bmod p$.

Reveal: The sender sends $(r,b)$ to the receiver, and he verifies whether $c \overset{?}{=} g^r h^b \bmod p$.

The scheme is information-theoretically hiding; since $g^r$ is random, $c$ is also random regardless of $b$. The scheme is computatinally binding: to cheat, one has to find two different opening pairs, say $(r_0, 0)$ and $(r_1, 1)$, such that $c_0 = c_1$ where $c_0 = g^{r_0} \bmod p$ and $c_1 = g^{r_1} h \bmod p$. Suppose you can extract such a pair $r_0, r_1$ from the commitments $c_0, c_1$ and the common knowledge $g, h$. Since $h = g^a$ for some $a \in \mathbb{Z}_q$, you can solve the equation $c_0 = c_1$, or $h g^{r_1} = g^{r_0}$, or $h=g^{r_0-r_1}$, or $a = r_0 - r_1 \bmod p$. But this contradicts the Discrete-Log assumption, since now you know $a$, the discrete log of $h$ with respect to $g$. See here for more schemes. Here’s the original paper.

Chaum-Pedersen Discrete Log Proof-of-Knowledge Scheme.
Setup: Fix a multiplicative group $G$ of a prime order $q$.  Suppose you know some element $\alpha \in \mathbb{Z}_q$  and two elements $h_1, h_2 \in G$ such that $h_1 = g_1^\alpha$ and $h_2 = g_2^\alpha$ for two independently chosen generators $g_1, g_2$. Publish $g_1, h_1, g_2, h_2$.

Goal: you want to convince someone about your knowledge of $\alpha$ without revealing what $\alpha$ is.

Interaction: You select a random element $w$ from $G$ and give the verifier two elements $a_i = g_i^w$ from $G$ for $i \in [2]$. Then the verifier sends you a random element $c$ as a “challenge”. You then reply with a “response” $r = w - \alpha c \bmod q$.

Verification: The verifier then verifies whether $a_i = g_i^rh_i^c = g_i^{r+\alpha c}$.

Note: The Fiat-Shamir scheme turns a public-coin interactive proof-of-knowledge of the discrete log into a non-interactive proof-of-knowledge where the prover, instead of receiving random coins from the verifier, queries a random oracle to get the random bits himself. However, Goldwasser and Kalai have shown if Fiat-Shamir was used to make digital signatures, substituting the random oracle with any “hash function” results in an insecure digital signature scheme.

RSA Encryption Scheme.
Setup: Select two large random primes $p,q$ of similar magnitudes and compute $N = pq$. Let $\phi(N) = ord(\mathbb{Z}_N^*) = (p-1)(q-1)$. Pick an arbitrary element $e$ in the multiplicative group $\mathbb{Z}_N^*$. This means, $e$ is coprime to and invertible modulo $\phi(N)$. Let $d$ be this inverse. $(N,e)$ is the public key, $d$ is the private key.

Encryption: message $m \bmod N$ is encrypted as $c = m^e \bmod N$.

Decryption: $c \mapsto c^d \bmod N = m^{ed} \bmod N = m \bmod N$ since $ed = 1 \bmod \phi(N)$.

Diffie-Hellman Key Exchange. You and I publicly agree on a prime $p$ and a generator $g$ of the group $G=\mathbb{Z}_p^*$. I randomly pick a private value $a \in G$ and you randomly pick your private value $b \in G$.  I send you $A = g^a \bmod p$ in public, while you send me $B = g^b \bmod p$ in public. I construct $S = B^a = g^{ab}$ while you construct $A^b = g^{ab}=S$. This $S$ is our shared secret assuming the discrete log problem is hard.

ElGamal Encryption Scheme. First, you and I publicly agree on a prime $p$ and a generator $g$ of the group $G=\mathbb{Z}_p^*$. Then we construct a shared secret $s$ via the Diffie-Hellman key exchange protocol. You encrypt a message $m \in G$ as $c = m\cdot s$. Since $G$ is cyclic, I can compute the inverse of $s$, call it $s^{-1}$. Finally, I can decrypt your cyphertext as $c \mapsto cs^{-1} = mss^{-1} = m$.

ElGamal Digital Signature Scheme.
Initialization: Select a large prime $p$ and a generator $g$ of the group $G = \mathbb{Z}_p^*$. Select a random secret element $x \in \{2, \cdots, p-3 \}$ and set $y := g^x$. Let $H$ be a cryptographically secure hash function.

Signning: To sign a message $m$, randomly select some $k \in \{2,\cdots, p-2\}$ such that $k$ is invertible modulo $p-1$. Let $r:=g^k \bmod p$. Compute an $s$ such that $H(m) \equiv xr + ks \bmod p-1$. If $s=0$, start over. The tuple $(r,s)$ is the signature of $m$.

Verification: Check that $0 < r < p-1, 1 < s < p-2$ and $r^sy^r \equiv g^{H(m)} \bmod p-1$. For a correct signature, the LHS is $r^sy^r = g^{ks}g^{xr} = g^{H(m)}$.

Schnorr Digital Signature Scheme.
Setup: Everyone agrees on the Discrete-Log setting $(G,g,q,p)$. The signer’s secret signing key is some $x \in \mathbb{Z}_q$, and his public verification key is $v = g^x$. Everyone has access to a hash function $H$.

Signing: To sign the message $M$, the signer chooses a random $k \in \mathbb{Z}_q$, computes $r = g^k$, computes the hash $h = H(r \Vert M)$ where $\Vert$ denotes concatenation. The pair $(k - x h, h)$ is the signature for $M$ where, recall, that $x$ is the secret signing key.

Verification: Given $M$ and a purported signature $(s, h^\prime)$, one computes $\displaystyle r^\prime = g^s v^{h^\prime} = g^{s + x h^\prime}$; and verifies that $\displaystyle h^\prime \overset{?}{=} H(r^\prime \Vert M)$.

Hard-Core Predicate. An easy-to-compute boolean predicate $b : \{0,1\}^n \rightarrow \{0,1\}$ is a hard-core for a function $f$ if no efficient algorithm can predict $b(x)$ with non-negligible success even if they know $f(x)$.

Weak One-Way Functions Imply Storng Ones. If $f$ is a weak one-way function, then $g(x,y) = \big( f(x), f(y) \big)$ is a strong one-way function.

Every Strong One-Way Function has a Hard-Core Predicate. The Boolean inner product $x \odot r$ is a hard-core predicate of the function $g(x,r) := \big( f(x), r \big)$ where $f$ is a strong one-way function and $r$ is arbitrary.

Intuitively, you have to invert $f$ to predict the inner product.

Oblivious Transfer. The context is a 1-out-of-N multiplexer. The sender holds $N$ values $\{x_i\}$. The receiver holds an index $b$. The receiver wants to know $x_b$ without the sender knowing $b$. The sender wants that the receiver knows nothing other than $x_b$.

1-out-of-2 Oblivious Transfer via Public Key Encryption. This is a variant of Even-Goldreich-Lempel. The sender chooses a random $r$. The receiver chooses a valid key-pair $(sk_b, pk_b) \leftarrow (x, g^x)$ and an “orphan” public key $pk_{1-b}) \leftarrow r \cdot pk_b$. The idea is that the sender, upon receiving  $pk_0, pk_1$, makes sure that the “other” public key, $pk_{1-b}$, is forced to a random value chosen by the sender. Next, the sender encrypts two values: for $i \in \{0,1\}$, he sends $c_i \leftarrow Enc_{pk_i}(x_i)$ to the receiver. Then the receiver can find $x_b \leftarrow Dec_{pk_b}(c_b)$.

This gives computational security against the sender (breaking the encryption scheme) and the receiver as well (recovering the secret key from a given public key).

1-out-of-2 Oblivious Transfer via DDH Assumption. The sender and receiver agree on two generators $g_0, g_1$. The receiver chooses a random exponent $r$. He then sends the following to the sender:

$h_0 \leftarrow g_0^r \qquad h_1 \leftarrow g_1^{r+b}$ .

The sender samples four random exponents $a_1, a_2,b_1, b_2$. Then he creates two “encryptions” $c_i \leftarrow A_i B_i C_i$ where

$A_i = g_0^{a_i} g_1^{b_i} \qquad B_i = h_0^{a_i}$ and

$C_0 = h_1^{b_0} x_0 \qquad C_1 = h_1^{b_1}g_1^{-b_1}x_1$ .

The receiver, upon receiving $c_b = A_b B_b C_b$, can recover

$x_b \leftarrow B_b C_b A_b^{-r}$ .

Note that if $b = 0$, then $B_0 C_0 A_0^{-r} = h_0^{a_0} \left( h_1^{b_0} x_0 \right) \left( g_0^{-r a_0} g_1^{-r b_0} \right)$ $=g_0^{r a_0} g_1^{r b_0} x_0 g_0^{-r a_0} g_1^{-r b_0} = x_0$ .

On the other hand, if $b = 1$, then $B_1 C_1 A_1^{-r} = h_0^{a_1} \left( h_1^{b_1} g_1^{-b_1} x_1 \right) \left( g_0^{-r a_1} g_1^{-r b_1} \right)$ $= g_0^{r a_1} \left( g_1^{(r+1)b_1} g_1^{-b_1} x_1 \right) \left( g_0^{-r a_1} g_1^{-r b_1} \right) = x_1$ .

This gives information-theoretic security against the receiver since all structure is lost in a $c_i \neq c_b$. On the other hand, this gives computational security (i.e., the DDH assumption) against the sender since the sender can break $g_1^{r+b}$ if he can compute the discrete log from $h_1 = g_0^r$.

## Distributed Computing

Byzantine Agreement.  BA can be achieved by a deterministic algorithm on $n \geq 3f + 1$ processors in a fully connected unauthenticated network in which $f$ processors are malicious; exactly $f + 1$ communication rounds are required [Pease, Shostak, and Lamport]. This result was proved under the assumption that the network topology is known a priori to each processor, i.e., the identifiers of the processors and of the links between processors are known.

In the authenticated setting, however, we can allow $f$ to be arbitrarily large. [Dolev-Strong]

Micali’s BA protocol works on a completely asynchronous network when at least $2/3$ of the users are honest, and halts after 9 rounds in expectation. It is player-replaceable in the sense that players can be removed/corrupted immediately after sending a message.

The Micali-Vaikuntnathan protocol  works on a synchronous network, and allows the honest fraction to be as low as $\alpha = 1/2 + \delta, \delta> 0$. With probability $1 - \epsilon$, the number of rounds is at most

$\displaystyle \frac{\log 1/\epsilon}{\log \frac{2}{2-\alpha}}\, .$

The Impossibility of Byzantine Agreement. It is impossible to attain even a Boolean Byzantine Agreement among deterministic players in an asynchronous network if only one player is faulty [Fischer-Lynch-Paterson].

Moreover, BA is impossible in a partially/fully asynchronous setting, even with randomness or timeouts or authentication, if $f \geq n/3$ [Dwork-Stockmeyer].

## Optimization

Newton’s Method. Suppose you want to find a root of a locally-convex function $f(x)$ in the vicinity of the initial point $x_0$. The root will be approached in a sequence of points $x_0, x_1, x_2, \cdots$. Suppose you already know $x_n$. You make a linear approximation of $f(x)$ at $x = x_n$ as follows: $y = f(x_n) + (x_{n+1} - x_n) f^\prime(x_n)$. Because you want to find a root of $f$, you set $y = 0$ and solve for

$\displaystyle x_{n+1} = x_{n} - \frac{f(x_n)}{f^\prime(x_n)}$ .

Because you are doing a linear approximation, the error term decreases quadratically as you approach the true root $x = \alpha$. In particular, Taylor expansion of $f$ around $x_n$ tells us that the approximation error is $(\alpha - x_n)^2 f^{\prime \prime}(x_n)/2! + O((\alpha - x_n)^3)$

## Series / Function Plots

Exponential Powers.

Taylor Series of a real function $f(x)$ around the point $a$.

$\displaystyle f(x) = \sum_{k=0}^{\infty} f^{(k)}(a) (x-a)^k / k! .$

Logarithms. By definition,

$\displaystyle \log t = \int_{1}^t\dfrac{1}{x}dx$ .

If $|x| < 1$, the Taylor series expansion gives

$\displaystyle \log (1-x) = -x -x^2/2-x^3/3-\cdots$
$\displaystyle \log(1+x) = x - x^2/2 + x^3/3 + \cdots$
$\displaystyle \frac{1}{2} \log\frac{1+x}{1-x} = x + x^3/3 + x^5/5 + \cdots$

Fractional Powers of a Fraction.

$\dfrac{1+x}{1-x}$ for $x \in (0,1)$ .

$\dfrac{1-x}{1+x}$ for $x \in (0,1)$ .

$\dfrac{x}{1+x}$ for $x \in (0,1)$ .

$\dfrac{x}{1-x}$ for $x \in (0,1)$ .

$\dfrac{\log{x}}{x}$ for $x \in [1,3]$ .

$\dfrac{1}{(1-x)^k}$ for $k = 1, 1/2, 2, 3$ .

$\dfrac{1}{1-x^k}$ for $k = 1, 1/2, 2, 3$ .

$\dfrac{1}{(1-x)^2}$ vs. $\dfrac{x}{(1-x)^2}$  for $x \in [0, 1/2]$ .

$\dfrac{\log(x)}{x}$ and $\dfrac{x}{\log(x)}$ .

The growth of $\dfrac{x}{\log(x)}$ .

The growth of $\dfrac{\log(x)}{x}$ vs. $\dfrac{\log \log(x)}{\log(x)}$ .

Binary Entropy.

## Some Series Sums

If $\alpha \in (0, 1)$ then

$\displaystyle \sum_{k\geq 0} k \alpha^k = \mathrm{Li}_{-1}(\alpha) = \frac{\alpha}{(1-\alpha)^2}$.

## Inequalities

For a complex random variable $Z$,

$|\mathbb{E} Z| \leq \mathbb{E} |Z|$

where $|\cdot|$ denotes the modulus.

$(1-x)^{1-x} > \exp(-x+x^2/2)$ when $x \in (0, 1)$

The exponential function is steep above zero and flat below zero.

$e^x \geq 1+x$ for $x\in \mathbb{R}$.

Moreover, for any real $x \geq 0$,

$\displaystyle (1 - x) \leq e^{-x} \leq 1 - x + x^2/2$ .

Concave functions are subadditive. Suppose $f(x)$ is concave and $\alpha > 0$. Then

$f(\alpha x) \geq \alpha x$ ,
$f(a)+f(b) \geq f(a+b)$ .

Cauchy-Schwartz Inequality. The inner product is at most the product of individual $\ell_2$-lengths.

$\displaystyle \langle x \vert y \rangle \leq \Vert x \Vert_2 \cdot \Vert y \Vert_2.$

Similarly, for two square-integrable functions $X,Y$, we have

$\displaystyle \mathbb{E} XY \leq \sqrt{\mathbb{E} X} \sqrt{\mathbb{E} Y}$

with equality if and only if either (1) one of $X, Y$ vanishes a.s., or (2) $X = cY$ for some constant $c$.

Proof sketch: Let $c \geq 0$. Since $2ab \leq a^2 + b^2$, we write $a = X, b = cY$ and take expectation at both sides to get $\mathbb{E} xy \leq (1/2c) \mathbb{E}X^2 + (c/2) \mathbb{E}Y^2$. The $c$ that minimizes the r.h.s. is $c = \sqrt{ \mathbb{E}X^2}/\sqrt{ \mathbb{E}Y^2}$; thus the claim follows. For non-vanishing $X, Y$, an equality is possible iff $\mathbb{E}(X - cY)^2 = 0$, or if $X = cY$.

Holder’s Inequality. Let $1/p + 1/q = 1$ for real $p,q \geq 1$. Let $x, y \in \mathbb{C}^n$. Then

$\displaystyle \sum_k{ \vert x_k y_k \vert} \leq \Vert x \Vert_p \Vert y \Vert_q$ .

For random variables $A, B$,

$\displaystyle \mathbb{E}AB \leq \left(\mathbb{E} |A|^p\right)^{1/p} \left(\mathbb{E} |B|^q\right)^{1/q}$.

Lyapunov’s Inequality. For a reals $p, q, q > p \geq 1$ and real functions $f$ with finite $p$th moment, define

$\displaystyle \Vert f \Vert_p \triangleq \left(\mathbb{E} [|f(x)|^p] \right)^{1/p}\,.$

Then $latex \Vert f \Vert_1 \leq \Vert f \Vert_p \leq \Vert f \Vert_q\,.$

$\ell_1$ vs. $\ell_2$ Norm Inequality. Let $x \in \mathbb{C}^n, p, q \in \mathbb{R}^+, 1/p + 1/q = 1$.

$\displaystyle \Vert x \Vert_1 \leq \sqrt{n} \Vert x \Vert_2$ using $y = 1^n$ in Cauchy-Schwartz inequality.

$\displaystyle \Vert x \Vert_1 \leq n^{1/q} \Vert x \Vert_p$ using $y = 1^n$ in Holder’s inequality.

AM-GM Inequality.  Let $\{w_i\}$ denote a convex combination i.e., $\sum{w_i}=1, w_i \geq 0$. Then

$\displaystyle \sum_i{w_i a_i} \geq \prod_i{a_i^{w_i}}$ .

Equality occurs if $w_1 = w_2 = \cdots$ .

Power Mean Inequality. For a vector $x \in \mathbb{R}^n$ and positive real $p$, define $\displaystyle \Vert x \Vert_p = \left(\sum_i{x_i^p}\right)^{1/p}$. For positive real $p \geq q$,

$\displaystyle \frac{\Vert x \Vert_p^p}{n} \geq \frac{\Vert x \Vert_q^q}{n}$.

In other words, $\Vert x \Vert_p$ is “not too short:” for any $t > 0$,

$\displaystyle \Vert x \Vert_{p+t} < \Vert x \Vert_p < \Vert x \Vert_{p+t} \ n^{\frac{t}{p(p+t)}}$.

Chebyshev’s Order Inequality. Essentially, this gives a converse for the Cauchy-Schwarz inequality. Let $f, g$ be non-decreasing functions, $x_1, \ldots, x_n$ be a non-decreasing sequence, and $p_1, \ldots, p_n$ be probabilities (i.e., $p_i \geq 0$ and $\sum p_i = 1$). Then

$\displaystyle \left( \sum f(x_k) p_k \right) \left( \sum g(x_k) p_k \right) \leq \sum f(x_k) g(x_k) p_k\,.$

In other words,

$\displaystyle E[f(X)]\, E[g(X)] \leq E[f(X) g(X)]\,.$

That is, if random variables $Y,Z$ can be written as non-decreasing functions of another random variable $X$, then they are non-negatively correlated. (Recall that two variables are negatively correlated if the expectation of their product is smaller than the product of their expectations.)

Rearrangement Inequality. Let $(a_k)$ and $(b_k)$ be two real ordered lists of size $n$ each, and let $\sigma : [n] \rightarrow [n]$ be any permutation. Then

$\displaystyle \sum a_k b_{n-k+1} \leq \sum a_k b_{\sigma(n)} \leq a_k b_k\, .$

Nesbitt’s Inequality. For any $a >0, b > 0, c > 0$, one has

$\displaystyle \dfrac{3}{2} \leq \frac{a}{b + c} + \frac{b}{c + a} + \frac{c}{a + b} \,.$

Kantorovich’s Inequality for Reciprocals. Let $\displaystyle 0 < m \leq x_1 \leq \ldots x_n \leq M < \infty$ and $\displaystyle p_1, \ldots, p_n$ probabilities. Then

$\displaystyle \left( \sum p_k x_k \right) \left( \sum p_k/x_k \right) \leq \mu^2/\gamma^2\,,$

where $\displaystyle \mu = (m + M)/2$ and $\displaystyle \gamma = \sqrt{mM}$.

A Logarithmic Inequality.Let $f:x \mapsto \log \dfrac{1}{1-x}$. Then

$\displaystyle y f(x) \leq x f(y) \qquad 0 < x \leq y \leq 1$.

(Comment: It seems obvious now since $f$ grows exponentially while the first factor is only linear.)

Proof: Let $f$ be a convex function satisfying $f(a) = 0$. Let $a < x \geq b$ be two points in the domain so that $x = \alpha a + \beta b$ for some $\alpha, \beta >0, \alpha + \beta = 1$. By Jensen, we have $f(x) \leq \alpha f(a) + \beta f(b) = \beta f(b)$ since $f(a)=0$ by assumption.

Now consider the function $f : x \mapsto -\log(1-x)$ which is convex in $[0,1)$. For this function, $a=0$ i.e., $f(0)=0$. This means for any $x$, we are free to choose any $b \geq x$ so that $x = \alpha \cdot 0 + (x/b)\cdot b$. It follows that $f(x) \leq (x/b) f(b)$, or $b f(x) \leq x f(b)$. Writing $y \leftarrow b$ yields the above inequality.

Bernoulli Inequality. This arises from the binomial expansion. In particular, for any real $p, q, q \geq 1$, we have $\displaystyle (1-p)^q \geq 1 - pq$ .

## Quantum Computing

State Vectors. $n$ qubits can specify $N=2^n$ indices, one for each coordinate of an $N$-dimensional state vector in $\mathbb{C}^N$. An orthonormal basis of an $n$-qubit system is the so-called “computational basis” $\{ \vert k \rangle\}, k \in 0,1,2,\cdots N-1$, where $\vert k \rangle$ is simply the standard basis vector $e_k \in \{0,1\}^N$.

Pauli Matrices/Gates. Let $\mathcal{P} = \{X,Y,Z\}$ where

$X = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} \qquad Y = \begin{bmatrix} 0 & -i \\ i & 0 \end{bmatrix}\qquad Z = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}$.

For each $M \in \mathcal{P}$, it holds that

$\displaystyle M^* = M \text{ (Hermitian)},\qquad M^2 = I \text{ (involutory)}, \qquad M^* M = I\text{ (unitary)}$ .

$\mathrm{tr}(M) = 0$ and $\mathrm{det}(M) = -1$.

Their commutators have a nice structure:

$[X,Y] = 2iZ \qquad [Y,Z] = 2iX \qquad [Z,X] = 2iY$.

The eigenvalues of these matrices are $\lambda(M) = \pm 1$. The corresponding eigenvectors $v_M(\lambda)$ are:

$\displaystyle v_X(\lambda) = \begin{pmatrix} \lambda \\ 1 \end{pmatrix}/\sqrt{2}$ .

$\displaystyle v_Y(\lambda) = \begin{pmatrix} -\lambda i \\ 1 \end{pmatrix}/\sqrt{2}$ .

$\displaystyle v_Z(\lambda) = \begin{pmatrix} 1+\lambda \\ 1-\lambda \end{pmatrix}/2$ .

$\displaystyle \qquad H = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} = (X+Z)/\sqrt{2}$.

$\displaystyle H : \vert b \rangle \mapsto \big( \vert 0 \rangle + (-1)^b \vert 1 \rangle \big)/\sqrt{2}\,, \qquad b \in \{0,1\}$.

Specifically, $\displaystyle H \vert 0 \rangle = \big(\vert 0 \rangle + \vert 1 \rangle \big)/\sqrt{2} = \big( \vert 0 \rangle + \exp(2 \pi i) \vert 1 \rangle \big)/\sqrt{2}$ . Notice that

$\displaystyle (-1)^b = \exp( \pi i b )= \exp( 2 \pi i b/2 ) = \exp( 2 \pi i 0.b )$

where we use the notation

$0.x_1x_2\cdots x_k = x/2^k$ where $x = x_1 x_2 \cdots x_n$ for $n \geq k, x_i \in \{0,1\}$ .

Let $x = x_1 \cdots x_n \in \{0,1\}^n$, and $N = 2^n$. An $n$-qubit Hadamard gate is

$\displaystyle H^{\otimes n} : \vert x \rangle \quad \mapsto \quad \frac{1}{\sqrt{N}} \sum_{z \in \{0,1\}^n}{ (-1)^{x \odot z} \vert z \rangle }$ ,

where $\odot$ denotes the Boolean inner product of two strings. It can also be written as the tensor product

$\displaystyle H^{\otimes n} : \vert x \rangle \quad \mapsto \quad \frac{1}{\sqrt{N}} \bigotimes_{\ell=1}^{n}{ \big[ \vert 0 \rangle + \exp(2 \pi i x_\ell/2) \vert 1 \rangle \big] }$ .

Other Quantum Gates.

• Rotate by an angle of $2^{-k}\cdot 2\pi$: $\displaystyle \qquad R_{k} = \begin{bmatrix} 1 & 0\\ 0 & \exp(2\pi i 2^{-k}) \end{bmatrix}$.
• Phase gate i.e., Rotate by $90^\circ$: $\displaystyle \qquad S = R_{2} = \begin{bmatrix} 1 & 0\\ 0 & i \end{bmatrix}$.
• Axis-specific rotations: for $m \in \{x,y,z\}$ and $M \in \{X,Y,Z\}$, $\displaystyle R_m(\theta) = \exp(i\theta M)$

Measurements. A measurement apparatus $M = \{ M_x \}$ satisfies the completeness relation, i.e., it is a decomposition of the identity $I = \sum_x{M_x^* M_x}$. The probability of observing the outcome $z$ after measuring $\vert \phi \rangle$ is

$\displaystyle p(z) = p(z; \vert \phi \rangle, M) = \langle M_z \phi \vert M_z \phi \rangle = \langle \phi \vert M_z^* M_z \vert \phi \rangle = \Vert M_z \vert \phi \rangle \Vert_2^2$.

The state of the system after the measurement is $\displaystyle \frac{M_z \vert \phi \rangle}{\Vert M_z \vert \phi \rangle \Vert_2} = \frac{M_z \vert \phi \rangle}{\sqrt{p(z)}}$.

Projective Measurements. The components of a projective measurement $M = \{M_x\}$ correspond to the spectral decomposition $\{(e_x, \lambda_x)\}_x$ in that each $M_x$ is a projection operator for the eigenspace $e_x e_x^*$:

$\displaystyle M = \sum_x{\lambda_x M_x} = \sum_x{\lambda_x e_x e_x^*}$.

This implies, the probability of observing a vector $\vert z \rangle$ after measuring a quantum state $\vert \phi \rangle$ is the squared modulus of the (complex) inner product of these two vectors:

$\displaystyle p(z) = p(z; \vert \phi \rangle, M) = \langle \phi \vert M_z \vert \phi \rangle = \vert \langle \phi \vert e_z \rangle \vert^2$.

The expectation and variance of a projective measurement has a friendly form:

$\langle M \rangle_\phi := \mathbb{E}(M; \vert \phi \rangle) = \sum_x{x p(x ; \vert \phi \rangle)} = \langle \phi \vert M \vert \phi \rangle$.

$\displaystyle [\Delta(M)]^2 := \langle (M - \langle M \rangle)^2 \rangle = \langle M^2 \rangle - \langle M \rangle^2$,

where $\Delta(.)$ is the “standard deviation.”

The Heisenberg Uncertainty Principle. Let $C,D$ be two projective measurements. Then, with respect to any quantum state $\vert \phi \rangle$, the product of the standard deviations (across a large number of trials) is “large”:

$\displaystyle \Delta(C; \vert \phi \rangle) \Delta(D; \vert \phi \rangle) \geq \frac{\big\vert \langle \phi \vert [C, D] \vert \phi \big\vert}{2}$.

Entangled State. A quantum state in a $k$-bit system is entangled if it cannot be written using a single $k$-bit index. For example, suppose $k=2$. The $k$-bit integers $\{1,2,3, 4\}$ index the $2^k$ basis states. We need two “independent” basis indices (namely 1 and 4) to specify the (unnormalized) state $\vert 1 \rangle + \vert 4 \rangle$; hence it is an entangled state.

The Bell Basis. Each of the following states is entangled. They form an orthonormal basis of $\mathbb{C}^2$.

$\displaystyle \Phi^{\pm}:=\frac{1}{\sqrt{2}}\left(\vert 00 \rangle \pm \vert 11 \rangle\right)\,, \qquad \Psi^{\pm}:=\frac{1}{\sqrt{2}}\left(\vert 01 \rangle \pm \vert 10 \rangle \right)\, .$

Put in another way, let $\vert \phi_{xy} \rangle$ denote the four Bell states denoted by the bits $x,y$. In particular, $y = 0$ corresponds to $\Phi^{\pm}$, and $y = 1$ corresponds to $\Psi^{\pm}$. Then,

$\displaystyle \vert \phi_{xy} \rangle = \big( \vert 0,y \rangle + (-1)^x \vert 1, 1\oplus y \rangle \big)/\sqrt{2}$

Bloch Sphere Representation. A single qubit $q := a \vert 0 \rangle + b \vert 1 \rangle$ can be represented by the tuple $(\theta, \phi)$, as follows: $q$ can be identified with a point $v$ on the 2-sphere in 3-dimensions, with the vector $\vert 0 \rangle$ at the North pole. Then $\theta$ is the angle of the vector $late v$ with the $z$-axis, and $\phi$ is the angle between the $x$-axis and the projection of $v$ on the $xy$-plane. The Bloch vector for $v$ is

$\displaystyle v = (\sin \theta \cos \phi, \sin \theta \sin \phi, \cos \theta)$ .

This is equivalent to setting $\displaystyle a := \cos(\theta/2), b := \exp(i \phi) \sin(\theta/2)$ .

Quantum Fourier Transform. Let $j, k \in \{0,1\}^n$, $i$ the imaginary unit, and $N = 2^n$. Let $\omega_j = \exp(2 \pi ij /N)$ be the $j$th power of the $N$th root of unity. The QFT is a linear operator that takes

$\displaystyle \vert j \rangle \quad \mapsto \quad \frac{1}{\sqrt{N}} \sum_{k=0}^{N-1} \omega_j^k \vert k \rangle$ .

The right hand side equals

$\displaystyle \frac{1}{\sqrt{N}} \bigotimes_{\ell = 1}^n {\big[ \vert 0 \rangle + \underbrace{e^{2\pi i j/2^\ell }}_{\omega_j^{N/2^\ell}} \vert 1 \rangle \big]}$ .

## Discrete Geometry

Radon’s Theorem. Any set of $d+2$ points in $\mathbf{R}^d$ can be partitioned into red-blue points so that the respective convex hulls intersect.

Point-Line Incidences. Let $F$ be any field. Consider a set of $n$ points in the plane $F^2$. The number of point-line incidences is $O(n^{3/2})$.

Brunn-Minkowski Inequality. For any non-empty compact sets $A, B \in \mathbb{R}^n$,

$\displaystyle \text{vol}(A)^{1/n} + \text{vol}(A)^{1/n} \leq \text{vol}(A + B)^{1/n}$ .

Isometric Embedding into $\ell_2$.

• Dvoretzky’s Theorem. For every $d, \epsilon > 0$ there is an $n = n(d, \epsilon) = 2^{O(d/\epsilon^2)}$ such that the intersection of an $n$-dimensional centrally symmetric convex body and a $d$-dimensional linear subspace is $(1+\epsilon)$-almost spherical.In other words, $\ell_2^d$ can be $(1+\epsilon)$-embedded into $\ell_p^n$ for every $p$.
• For a finite $X$, $(X, D)$ embeds in $\ell_2$ iff the matrix

$\displaystyle \big( D(x_0, x_i)^2 + D(x_0, x_j)^2 - D(x_i, x_j)^2 \big)_{i, j}$

is positive semidefinite. Moreover, its rank is the smallest dimension for such an embedding.

• SDP. Finding the smallest $c$ such that a given finite $(X, D)$ can be $c$-embedded in $\ell_2$ can be fourmulated as a semidefinite programming problem and thus solved in polynomial time [Linial-London-Rabinovich], but no similar result is known for embedding in $\ell_2^d$ with $d$ given.

Embedding into $\ell_1$.

• Deciding isometric embeddability in $\ell_1$ is NP-hard.
• An unweighted graph embeds isometrically into $(\{0,1\}^d, \ell_1)$ if and only if it is bipartite and satisfies the pentagonal inequality.
• $\ell_2^d$ embeds isometrically into $\ell_1^{Cd}$ where $C$ is “not too much.”

Embedding into $\ell_\infty$.

• Every $n$-point metric space embeds isometrically into $\ell_\infty^n$ via the Frechet embedding. (See here.)
• Any $n$-point metric space can be embedded in $\ell_\infty^d$ with distortion $c$ where $\displaystyle d = O(b\,n^{1/b}\, \log n)$ and $b > 0$ an integer and $c = 2b-1$. There exists an almost-matching lower bound using graphs with short cycles.

Embedding into $\ell_p$ metrics. All $\ell_p$ metrics embed with bounded distortion into $\ell_q$ metrics when

• $p = q$, or
• $p = 2$, or
• $q = \infty$, or
• $1 \leq q \leq p \leq 2$; in this case, $\ell_p^d$ $(1+\epsilon)$-embeds into $\ell_q^{Cd}$ where $C = C(p, q, \epsilon)$ is “not too big.”

Isometric embedding exists in all these cases, but these embeddings are probabilistic and an explicit one is not known.

All $n$-point metric spaces $O((\log n)/p)$-embed into $\ell_p$. [Matousek]. There are (almost) matching lower bounds. (LLR and Matousek.)

Johnson-Lindenstrauss Flattening Lemma for $\ell_2$. Let $\epsilon \in (0,1)$ be given. There exists a $(1 + \epsilon)$-distortion embedding of any $n$-point set $X \subset \ell_2$ into $\ell_2^d$, where $d = O(\epsilon^{-2} \log n)$.

All known proofs first place the metric under consideration in $\ell_2^n$ and then map it into $\ell_2^d$ by a random linear map. See [Achlioptas] for a discussion.

No JL-type Flattening for $\ell_1$. For every fixed $c > 1$ and every $n$, there is an $n$-point $\ell_1$-metric that cannot be $c$-embedded into $\ell+1^d$ unless $d = n^{\Omega(1/c^2)}$. (Due to [Brinkman-Charikar and Lee-Naor].)

Measure Concentration on the Sphere. Let $P$ be the probability measure on the unit sphere $S^{n-1}$ so that $P[A], A \subset S^{n-1}$ denotes the probability that a random element in $S^{n-1}$ falls in $A$. Let $A_t$ denote the $t$-neighborhood of $A$. Then

$\displaystyle P[\overline{A_t}] \leq 2 e^{-n t^2/2} \quad \text{if} \quad P[A] \geq 1/2\, .$

Levy’s Lemma. Any $1$-Lipschitz function $f : S^{n-1} \rightarrow \mathbb{R}$ is sharply concentrated around its median. That is, for all $\delta \in (0,1)$,

$\displaystyle \Pr[ \vert f - \text{med}(f) \vert \geq \delta] \leq e^{-\delta^2 n /2}$ ,

where $\Pr[f > a]$ means $\Pr[\, \{x \in S^{n-1} \, : \, f(x) > a \}\,]$ and the median is defined as

$\displaystyle \text{med}(f) \triangleq \sup \{t \in \mathbb{R} \, : \, \Pr[f \leq t] \leq 1/2 \}$.

Proof of the first inequality: Let $A = \{x \in S^{n-1} \, : \, f(x) \leq \text{med}(f) \}$ and thus $P[A] \geq 1/2$. Let $A_t$ be a $t$-neighborhood of $A$. Since $f$ is 1-Lipschitz, we have $f(x) \leq \text{med}(f) + t$ for all $x \in A_t$. Taking the complement, we get $\Pr[f(x) \geq \text{med}(f) + t] = P[\overline{A_t}] \leq 2 e^{-t^2 n/2}$ using the measure concentration on the sphere.

The Median of a Lipschitz Function is Close to Its Mean. Let $f : S^{n-1} \rightarrow \mathbb{R}$ be $1$-Lipschitz. Then

$\displaystyle | \text{med}(f) - \mathbb{E}\, f | \leq 12/\sqrt{n}$ .

The Concentration of Length under Random Projection. Let $x$ be chosen uniformly from the $n$-dimensional unit sphere, and let $f(x)$ be the length of the projection of $x$ on the first $k$ coordinates. Alternatively, we can take $x$ be fixed and consider a random $k$-dimensional projection of $x$. In either case, $f(x)$ is sharply concentrated around some $m = m(n, k) = \text{med}(f)$ if we take $k \geq 10 \log n$ and $m \geq \sqrt{k/n}/2$.

Distortion vs. Dimension. Suppose there is some $d$ and $p$ such that $\ell_p^d$ can host all $n$-point metric spaces with distortion at most $D$. Then $d$ cannot be “small.”

Moreover, as $n \rightarrow \infty$,

$\displaystyle d = \begin{cases} \Omega(n)\, , & \text{for } D < 3 \\ \Omega(\sqrt{n})\, , & \text{for } D < 5 \\ \Omega(n^{1/3})\, , & \text{for } D < 7 \\\end{cases}$ ,

and so on.

However, it is possible to 3-embed all metric spances into a particular (which?) normed space of dimension close to $\sqrt{n}$. (Matousek)

• For all fixed $d \geq 1$, there are $n$-pont mteric spaces requiring $\displaystyle \Omega( n^{1/\lfloor (d+1)/2 \rfloor } )$ for embedding into $\ell_2^d$.
• Every $n$-point space $O(n)$-embeds into the real line, and $\displaystyle O(n^{2/d}\, \log^{3/2} n)$-embeds in $\ell_2^d, d \geq 3$.

See Proposition 15.3.3 in Matousek’s book for details.

Lower Bounds for Embedding into $\ell_2^d$.

• It is impossible to embed the $\{0, 1\}^m$ into $\ell_2$ with distortion $D < \sqrt{m} = \sqrt{\log_2(n)}$ where $n = 2^m$. Hence the natural embedding is optimal.
• It is impossible to $D$-embed the shortest-path metric of an $n$-vertex expander into $\ell_2$ with $D \leq c \log n$ where $c$ is independent of $n$.
• Embedding a general $n$-point metric space in $\ell_2$ (or any $\ell_p$ for a fixed $p < \infty$) must endure a $\Omega(\log n)$ distortion.
• The shortest path metric of any $k$-regular graph of girth $g$ requires $\Omega(\sqrt{g})$ distortion for embedding in $\ell_2$.

Embedding Trees into Normed Spaces.

• Every tree has an isometric embedding in $\ell_1^n$, and (therefore) a $O(\sqrt{n})$-distortion embedding in $\ell_2$. It can also be embedded isometrically into $\ell_\infty^{O(\log n)}$.
• Any convex combination of tree metrics embeds isometrically into $\ell_1$.
• A complete binary tree of size $n$ has an $O(\sqrt{\log n})$-embedding into $\ell_2$.
• Every tree has an $\ell_2$ embedding with distortion $O(\log log n)$, due to Linial, Magen, and Saks, and with distortion $O(\sqrt{\log \log n})$, due to Matousek. A similar result holds for $\ell_p, p > 2$ with distortion $\displaystyle O((\log \log n)^{\min(1/2, 1/p)})$.
• For a fixed $d\geq 1$, any $n$-point tree metric $O(n^{1/(d-1)}$-embeds into $\ell_2^d$. For $d = 2$ and unit-length edges, this improves to $O(\sqrt{n})$.

(Bourgain’s) Embedding of General Metrics into Normed Spaces. (Bourgain (1985); Linial, London, Rabinovich (1995))

$\displaystyle (X, d) \quad \overset{O(\log n)}{\hookrightarrow} \quad \ell_p^{O(\log(n)^2)}$

for all $p > 0$ and $n = |X|$.

Embedding Planar Graphs into $\ell_2$. Any $n$-point planar graph is $O(\sqrt{\log n}$-embeddable into $\ell_2$ [Rao]. Similar results hold for graphs with forbidden minors.

Embedding into Sparse Spanners. A $t$-spanner gives a $t$-embedding. For every $t \geq 2$, every $n$-point graph has a $t$-spanner with $\displaystyle m(g, n)$ edges, where $m(g, n)$ is the maximum possible number of edges of an $n$-vertex graph of girth $g + 1$. It is known that

$\displaystyle \frac{1}{9} n^{1+1/(g - 1)} \leq m(g, n) \leq n^{1+1/\lfloor g/2 \rfloor } + n$ .

Embedding into a Distribution on Trees (due to Fakcharoenphol, Rao, and Talwar). Any $n$-point metric as a $O(\log n)$-embedding into a distribution $D$ of dominating trees.

In the FRT/Bartal constructions, for every $k$, there are some $r = r(k)$ trees $\{ T_i \}$, each a $k$-HST (Hierarchically Separable Tree) so that the distortion is $O( k/\log k \cdot \log n)$. In the Alon-Karp-Peleg-West construction, however, these trees are the spanning trees of the input graph.

Embedding on the Hypercube.

• All trees are cubical, i.e., isometrically embeds into some hypercube. All cubical graphs are bipartite.
• Rectangular and hexagonal meshes, and even cycles are cubical.

## Analysis

Gamma function. The integral

$\displaystyle \Gamma(z) = \int_0^\infty{ x^{z-1} e^{-x} dx}$,

converges absolutely if $z \in \mathbb{C}, \mathrm{Re}(z) > 0$.  It is the unique function that satisfies $\Gamma(1) = 1, \Gamma(z+1) = z \Gamma(z)$. In particular, $\Gamma(1/2) = \sqrt{\pi}$. Euler proved that

$\displaystyle \Gamma(z) = \prod_{n=1}^\infty \bigg[ \frac{ \big(1+1/n \big)^z }{ 1+z/n } \bigg]$.

Weierstrass showed that

$\displaystyle \Gamma(z) = \frac{e^{-\gamma z}}{z} \prod_{n=1}^\infty \bigg[ \frac{ e^{zn} }{ 1+z/n } \bigg]$,

where $\displaystyle \gamma \approx 0.577216$ is the Euler-Mascheroni constant. Euler’s reflection formula says that

$\displaystyle \Gamma(z) \Gamma(1-z) = \frac{ \pi }{\sin (\pi z)}$ if $z \not \in \mathbb{Z}$ .

The duplication formula says that

$\displaystyle \Gamma(z) \Gamma( z + 1/2) = \frac{2 \sqrt{\pi}}{2^{2z}} \Gamma(2z)$ .

Moreover, $\overline{\Gamma(z)} = \Gamma(\overline{z})$. The integral derivatives are

$\displaystyle \Gamma^{(n)}(z) = \int_0^\infty{ x^{z-1} e^{-x} (\ln x)^z dx }$ .

Fractional Differentiation of a monomial. Let $y(x) = x^\mu$ for $\mu \in \mathbb{R}$. Then for any real $s > 0$,

$\displaystyle \frac{d^s}{dx^s} = \frac{\Gamma(\mu + 1)}{\Gamma(\mu - s + 1)} x^{\mu - s}$ .

This agrees with the usual integral differentiation if $s \in \mathbb{N}$.

Cauchy’s Repeated Integration. Let $f$ be a Lebesgue-integrable function defined in $[a,b]$. Define the integration operator

$\displaystyle I_a f (x) := \int_a^x{f(t) dt}$ .

A repeated application of this operator can be expressed as

$\displaystyle I_a^{(n)} f (x) = \int_a^x \int_a^{t_n} \cdots \int_a^{t_2}{ f(t_1) dt_1 \cdots dt_n}$

$\displaystyle = \frac{1}{(n-1)!} \int_a^x{ (x-t)^{n-1} f(t) dt }$.

Riemann-Liouville Fractional Integral. The natural extension of Cauchy’s repeated integration’s order from an integer $n$ to a positive real $s$ is the Riemann-Liouville fractional integral:

$\displaystyle I_a^{(s)} f(x) = \frac{1}{\Gamma(s)} \int_a^x{f(t) (x-t)^{s-1} dt }$

where $f$ is a locally-integrable function and $\alpha \in \mathbb{C}, \mathrm{Re}(\alpha) > 0$.

It can be showed that for $x\in \mathbb{R}$,

$\displaystyle \widehat{I_{-\infty}^s f}(x) = (i x)^{-s} \hat{f}(x)$ ,

where $\hat{\cdot}$ denotes the Fourier transform.

Riemann-Liouville Fractional Derivative. To differentiate to the order $s \in (k-1,k]$, we cannot directly use $-s$ as the order in Riemann-Liouville integral. Instead, integrate to the order $k-s$ and then differentiate to the integral order $k$.

$\displaystyle D_a^{s} f(x) = D_a^{k} I_a^{k-s} f(x) = \frac{1}{\Gamma(k-s)} \frac{d^k}{dx^k} \int_a^x{f(t) (x-t)^{k-s-1} dt }$.

Note that the value of the derivative depends on all values of $f(t)$ for $t \in [a,x]$, whereas the value of the classical derivatives are “local” in that they depend only on $x$.

It can be showed that for $x\in \mathbb{R}$,

$\displaystyle \widehat{D_{-\infty}^s f}(x) = (i x)^s \hat{f}(x)$ ,

where $\hat{\cdot}$ denotes the Fourier transform.

Fourier Transform. Let $e_s(x) = e^{-2 \pi i s x}$ for any real $s$. Then, for any function $f : \mathbb{R} \rightarrow \mathbb{C}$, its Fourier transform is a function $\hat{f} : \mathbb{R} \rightarrow \mathbb{C}$ defined as

$\displaystyle (\mathcal{F} f)(s) = \hat{f}(s) = \int_{x = -\infty}^\infty{ f(x) e_s(x) dx}$

for any real $s$. The inverse transform is defined as

$\displaystyle (\mathcal{F}^{-1} \hat{f})(x) = f(x) = \int_{s = -\infty}^\infty{ \hat{f}(s) e_{s}(-x) ds}$

for any real $x$. The transform of a convolution is the product of the transforms:

$\displaystyle \mathcal{F}(f \star g)(s) = \mathcal{F} f (s) \cdot \mathcal{F}g (s)$ .

The transform of a derivative is a scaling of the transform:

$\displaystyle \widehat{f^{(n)}}(s) = (2 \pi i s)^n \hat{f}(s)$ .

The Plancherel theorem says that the Fourier transform (of a square-integrable function) preserves  the squared-$\ell_2$ norm of the function:

$\displaystyle \int{\vert f(x) \vert^2 dx} = \int{\vert \hat{f}(s) \vert^2 ds}$ .

This is generalized by Parceval’s theorem for two square-integrable functions $f,g$

$\displaystyle \int{ f(x) \overline{g(x)} dx} = \int{ \hat{f}(s) \overline{\hat{g}(s)} ds}$ .

Laplace Transform. Very similar to the Fourier transform, except the coefficients are functions of complex variable and the integration runs only on the non-negative reals. For functions $f : \mathbb{R}_+ \rightarrow \mathbb{C}$,

$\displaystyle \hat{f}(s) = \int_{t=0}^\infty{ f(t) e^{-st} dt} \, , \qquad s \in \mathbb{C}$ .

The inverse transform is given by Fourier-Mellin’s inverse formula aka the Bromwich line integral:

$\displaystyle f(t) = \frac{1}{2\pi i} \lim_{T \rightarrow \infty} \int_{\gamma - iT}^{\gamma + iT}{ \hat{f} (s) e^{st} ds}$ .

Laplace transform is linear. Moreover,

$\displaystyle \widehat{e^{ct}}(s) = \frac{1}{s-c}$ .

$\displaystyle \widehat{e^{ct} f(t) }(s) = \hat{f}(s-c)$ .

$\displaystyle \widehat{f(t - T) }(s) =e^{-Ts} \hat{f}(s)$ .

$\displaystyle \widehat{ t f(t) }(s) = - \frac{d}{ds} \hat{f}(s)$ .

$\displaystyle \widehat{f^\prime (t) }(s) = s \hat{f}(s) - f(0)$ .

$\displaystyle [\mathcal{L} I_0^t f ](s) =\frac{1}{s} \mathcal{L} f(s)$,

where $\displaystyle I_0^t$ is the integration operator denoting $\displaystyle I_0^t f := \int_{x=0}^t{f(x) dx}$ and $\mathcal{L}$ is the Laplace transform.

If $X$ is a random variable with density $f$, then for any complex $s$, $\hat{f}(s) = \mathbb{E}\ e^{-sX}$ .

Tonelli’s Theorem. If a two-variable function $f$ is non-negative, then we can change the order of integration in a double integral, i.e., if $f(x,y) \geq 0$ then

$\displaystyle \int_{X,Y} f(x,y)dxdy = \int_X \bigg( \int_Y f(x,y) dy \bigg) dx = \int_Y \bigg( \int_X f(x,y) dx \bigg) dy$ .

Fubini’s Theorem has the same conclusion as Tonelli’s theorem, but it stipulates that $f$ has a finite integral over $X \times Y$.

Differentiation under the Integral. Also called “the Leibniz rule.”

$\displaystyle \frac{d}{dx} \int_{a(x)}^{b(x)} f(x,t) dt = f(x, b(x) ) b^\prime(x) - f(x, a(x) ) a^\prime(x) + \int_{a(x)}^{b(x)} \frac{\partial }{\partial x} f(x,t) dt$ .

If $a(x), b(x)$ are constants, the first two terms are zero.

Polylogarithm Function. For complex order $s$ and argument $z$,

$\displaystyle \mathrm{Li}_s(z) = \sum_{k=1}^\infty{ \frac{z^k}{k^s}}$ .

It can be expressed as a repeated integral of itself:

$\displaystyle \mathrm{Li}_{s+1}(z) = \int_0^z{\frac{\mathrm{Li}_s(t)}{t} dt}$ .

Moreover,

$\displaystyle \mathrm{Li}_{-n}(z) = {1 \over (1-z)^{n+1}} \sum_{k=0}^{n-1} \left\langle {n \atop k} \right\rangle z^{n-k}$ ,

where $n=1,2,3,\ldots$ and the $\left\langle {n \atop k} \right\rangle$ notation denotes the Eulerian numbers. This expression is strikingly similar to the $n$th moment of a geometrically distributed random variable $X$ with success parameter $1-\alpha$ for some $\alpha \in (0,1)$. Then

$\displaystyle\mathbb{E}\ X^n = \frac{1}{(1-\alpha)^n} \sum_{k=0}^{n-1} \left\langle {n \atop k} \right\rangle \alpha^{k}$.

The Hypergeometric Series. For a complex $z, \vert z \vert < 1$ and positive integers $a, b, c$ ,

$\displaystyle {}_2F_1(a, b; c; z) = \sum_{n=0}^\infty \frac{ (a)_n (b)_n }{(c)_n} \frac{z^n}{n!}$ ,

where $(a)_n = a(a+1)\cdots (a+n-1)$ is the rising Pochhammer symbol. If either $a$ or $b$ is non-positive, the series reduces to the polynomial

$\displaystyle {}_2F_1(-m, b; c; z) = \sum_{n=0}^\infty (-1)^n {m \choose n} \frac{ (b)_n }{(c)_n} z^n$ ,

Laguerre Polynomials.

$\displaystyle L_n(x) = \sum_{k=0}^n{ {n \choose k} \frac{(-1)^k x^k}{k!} }$ .

They satisfy the differential equation

$\displaystyle x L_n^{(2)}(x) + (1-x) L_n^{(1)}(x) + n L_n(x) = 0$ .

They are orthogonal with respect to the inner product

$\displaystyle \langle L_m, L_n \rangle = \int_0^\infty e^{-x} L_m(x) L_n(x) dx$latex =\delta_m(n).

Brouwer Fixed-Point Theorem. Let $I$ be a compact convex space, and let $f : I \rightarrow I$ is continuous. Then there exists some $x \in I$ with $f(x) = x$.

Example 1: If $I$ is a real interval, then every continuous curve $f : I \rightarrow I$ must cross the diagonal straight line $g : x \mapsto x, x \in I$ at some point.

Example 2: Consider the $n$-dimensional space of probability distributions on the set $\{1, \cdots, n\}$, and let us call this space $P$. This space is convex, closed, and bounded (hence compact). A right-stochastic $n\times n$ real matrix (whose rows sum to one) $M$ has a trivial left-eigenvector $1^n$ with eigenvalue 1. Moreover, by the Gershgorin disk theorem,  it is the largest possible eigenvalue (in absolute value). Recall that if $\lambda$ is a left-eigenvalue, it is also a right-eigenvalue. The matrix $M$, via matrix-multiplication from the left, acts as a continuous function on $P$. Hence, $M$ must have a left-eigenvector, too. See here for details.

Prime Number Facts.

$\displaystyle \sum_{p \leq n}{\log p} = O(n) .$

$\displaystyle \sum_{p \leq n}\frac{\log p}{p} = \log n + O(1) .$

$\displaystyle \sum_{p \leq n}\frac{1}{p} = \log \log n + O(1) .$

(See here and here for proofs of the above three facts.)

It suffices to sample only $\Theta(n / \log n)$ elements from any distribution to test its entropy and support size. See here.
Let $G$ be an $(n, d, \lambda)$ expander. Let $F \subset E$ contain an $\alpha$ fraction of all edges. Consider a random walk that starts at a random edge from $F$. The probability that it uses an edge from $F$ at the $t$-th step is at most $\alpha + ( \lambda/d)^{t-1}$. See here.