This page is always under construction.

(For me only)  Edit


Number Theory Coding Theory Algebra Linear Algebra
Combinatorics Deviation Inequalities Randomized Algorithms Analysis
Complexity Theory Cryptography Distributed Computing
Algebraic Graph Theory Graph Theory Discrete Geometry Optimization
Inequalities/Identities Series Quantum Computing Interesting Facts

Edit | Back to top

Number Theory

Chinese Remainder Theorem. Given a (pairwise co)prime factorization of an integer M=m_1m_2, there is a bijection between \mathbb{Z}_M and the Cartesian product U_1 \times U_2 where U_i is the set of units modulo m_i i.e., the set of integers z that are coprime to m_i. These are also the invertible elements modulo m_i.

More. This bijection takes an element m \bmod M to the tuple (m \bmod m_1, m\bmod m_2). Moreover, given a tuple (a_1 \bmod m_1, \cdots, a_n \bmod m_2), the corresponding element in \mathbb{Z}_M is \sum{a_i N_i H_i} \bmod M where  N_i = (\prod{m_j})/m_i and H_i is the multiplicative inverse of N_i modulo m_i.

Fermat’s Integer Factorization of N. Try to find two integers a,b such that a \neq \pm b \bmod N and a^2 \equiv b^2 \bmod N. Then N must divide the product (a+b)(a-b). Consequently, the gcd of N and at least one of these factors must be non-trivial. This gcd divides N.

Edit | Back to top


Finite Abelian Groups. Every finite Abelian group G can be writen as a direct product

\displaystyle G \cong \bigoplus_{i=1}^t  \mathbb{Z}_{q_i} \cong \bigoplus_{i=1}^t  \mathbb{Z}_{k_i}

where 0 < t < \infty and q_i are powers of (possibly repeating) primes and k_1 divides k_2, k_2 divides k_3 and so on.


Cyclic Groups. These are abelian groups which are generated by a single element. \mathbb{Z}_p is cyclic. For every positive divisor d of n, the quotient group G_n := \mathbb{Z}/(n) has precisely one subgroup of order d, the one generated by the residue class of n/d. The number of generators of G_n equals the number of elements coprime to n. This number is given by Euler’s totient function \phi(n). The order of the element d in G_n is n/gcd(d, n). The direct product of two cyclic groups G_n and G_m is cyclic if and only if n and m are coprime.

Every finite abelian group can be expressed as “the” direct sum of cyclic subgroups of prime-power order, and these orders are uniquely determined.


Pontryagin Dual. Let G be an additive finite Abelian group and let T be the unit torus with multipliciation as the group operation. Let \chi : G \rightarrow T be a group homomorphism defined as \chi(0) = 1, \chi(x+y) = \chi(x) \chi(y). Then, the set \hat{G} = \{\chi\} forms an Abelian group with the group operation being “point-wise multiplication”

(\chi \cdot \rho)(x) = \chi(x) \rho(x)\,, \qquad x \in G ,

and the identity operation being the constant function

1_G: x \mapsto 1\, , \qquad x \in G .

\hat{G} is called the Pontryagin dual of G, and the character 1_G is called the trivial character.


Edit | Back to top

Coding Theory

Reed-Muller Code. The order-r Reed-Muller code R(r,m) is the set of Boolean polynomials on m variables and degree at most r.

Reed-Solomon Code. In the Reed-Solomon code RS_{q, \mathcal{E} }(n,k), a degree-(k-1)  polynomial m(x) over GF(q) — the message — is encoded into a degree-(n-1) polynomial c(x) over GF(q) — the codeword — where the n coefficients c_0, \cdots, c_{n-1} of the codeword are obtained by evaluating m(x) on a fixed set \mathcal{E} = \{ \alpha_0, \cdots, \alpha_{n-1} \}.

Since m(x) can have at most k-1 zeros at any field, at most k-1 of the c_js can be zero. This means the minimum distance d is at least n - (k-1) which matches the singleton bound. By definition, the Reed-Solomon code is MDS.

Singleton Bound. For a linear [n,k,d] code, the distance d is at most n-k+1. Any code matching this bound is called MDS (maximum distance separable).

Plotkin Bound. If the distance d of a code is strictly larger than half the block length n, that is, if d > n/2 + \alpha, the number of codewords is at most \displaystyle \lfloor d/ \alpha \rfloor.

Johnson Bound. Consider a code whose relative distance is “large” i.e., at least 1/2 - \epsilon. Then, for every codeword and every “large” \alpha > \sqrt{\epsilon}, only a “small” number of strings are “close” to a codeword. In particular, for any codeword x, the number of strings (in the target space) that are within a relative distance 1/2 - \alpha from x is at most 1/2\alpha^2.

Gilbert-Vershamov Bound. For every \delta < 1/2 and all sufficiently large n, there exists a code with relative distance \delta on n bit strings with block length n/(1-H(\delta). Here, H is the binary entropy function.

Hadamard Matrix. Let H_1 = \left( \begin{array}{cc} 1 & 1 \\ 1 & -1 \end{array}\right) . Then

H := H_{2^n} = \left( \begin{array}{cc} H_{2^{n-1}} & H_{2^{n-1}} \\ H_{2^{n-1}} & -H_{2^{n-1}} \end{array}\right)

is the Hadamard matrix of order 2^n.

  • The rows/columns of H are indexed by n-bit Boolean strings. Let \odot denote the inner-product modulo 2. Then

H(x,y) = (-1)^{x \odot y}.

  • The map H : \{0,1\}^n \mapsto \{0,1\}^{2^n} is a [2^n, n, 2^{n-1}] linear code.
  • The volume enclosed by the (unit-length) columns of H (i.e., the determinant) is the largest possible for any real matrix of the same order.

Discrete Fourier Transform. See here.

Edit | Back to top


Expectation. Suppose a random variable X takes its value from the domain \chi.

\displaystyle E[X] = \sum_{x \in \chi}{x \Pr[X = x]} .

Additionally, if X \geq 0,

\displaystyle E[X] = \sum_{x \in \chi}{ \Pr[X \geq x]} .

Variance. In general,

\displaystyle Var(X) = E[X^2] - E[X]^2 .

If X_1, \cdots, X_n are independent, then

\displaystyle Var(\prod_i X_i) = \prod_i E[X_i^2] - \prod_i E[X_i]^2 .

The Birthday Paradox. Suppose k = \Omega(\sqrt{n}). When you randomly draw k times from an n-set with replacement, the resulting k-subset will contain a repeated element with constant probability.

Explanation. Suppose we choose k independent samples X=\{ x_1, \cdots, x_k \} from the numbers 1, \cdots, n. For any i\neq j, the probability that x_i=x_j is 1/n. The expected number of collisions in X is at most {k \choose 2}/n. This is a constant whenever k = \Omega(\sqrt{n}).

Coupon Collector’s Problem. Suppose you want to color all items in a set S of size n as follows: at round 1, 2, 3, \cdots, you pick a random element from S; if it is still uncolored, you color it red, put it back, and repeat. This process will require \Theta(n \log n) rounds to color all items.

Balls in Bins. If n balls are thrown into n bins independently and uniformly at random, the “heaviest” bin contains at most \displaystyle \frac{\ln n}{\ln \ln n} \big(1+o(1) \big) balls with high probability. However, if the balls are placed sequentially—each into the “lightest” among d bins sampled uniformly at random—then the largest load is at most \displaystyle \frac{\ln \ln n}{\ln d} + \Theta(1) . Thus even if d=2, the bins are highly balanced in the sequential case.

Poisson Distribution. Let X \in \{0, 1, 2, \cdots\} be a discrete random variable. If it has the Poisson distribution P(\mu), its probability mass function at point k is the probability that X is k given \mathbf{E}\,X = \mu. That is,

f(k;\mu) := = \frac{\mu^k }{k! e^\mu}.

The expectation and variance of X is \mu. The expected deviation of X from its mean is f( \lfloor \mu \rfloor; \mu). A binomial distribution B(n,p) tends to a Poisson distribution when p \rightarrow 0.

Note: The events X = a and X = b must be independent.

Hypergeometric Distribution. This is the binomial distribution without replacement. In particular, let h(k;N,K) be the probability that exactly k “good” items are obtained in n sample-without-replacement trials from a population of N items containing exactly K “good” items in total.

\displaystyle h(k;N,K) = \frac{ {K \choose k} {N-K \choose n-k} }{ {N \choose n} } .

The mean of this distribution is, surprisingly, nK/N, which equals \mathbf{E}_\mathrm{binomial}. The variance is

\displaystyle nK/N (N-K)/N \cdot (N-n)/(N-1) = \mathbf{Var}_\mathrm{binomial} \cdot (1- o(1) ) if n = o(N).

Probability Generating Functions. Let \mathsf A(X) = \sum_{t=0}^\infty{a_t X^t} be a formal power series with a_t \geq 0. If \mathsf{A}(1) = \sum_t{a_t} = 1 then \mathsf{A} “generates” the probability distribution A(t) := \Pr[A = t] =  a_t. Suppose \mathsf{B} generates the distribution B(t). It is easy to see that the convolution A \ast B generates the probability distribution C = A + B whose mass function is

\displaystyle C(t) := \sum_{t = x + y}{A(x) + B(y)}.

Composition of Generating Functions. Let a_n and b_n denote the number of ways to build two structures, respectively of types 1 and 2, on a n-element set where a_0 = 0 and b_0 = 1. Let c_n be the number of ways to divide the set [n] into an unspecified number of intervals, build a type-1 structure on each interval, then build a type-2 structure on the set of intervals. Set c_0 = 1. Denote by \mathsf{A}(x), \mathsf{B}(x),\mathsf{C}(x) the generating functions of the sequences \{a_n\}, \{b_n\}, \{c_n\}. Then \mathsf{C}(x) = \mathsf{B}\left(\mathsf{A}(x) \right).

The Gambler’s Ruin Problem. See here.

The Unbiased Random Walk. See here.

Sum of first binomial coefficients. Suppose k \leq n/2. Then,

\displaystyle \sum_{i=0}^k {n \choose i} \leq {n \choose k} \frac{n - k+1}{n - 2k + 1} .

This bound is due to Michael Lugo. (Aside: This implies, if k \leq n/3, the sum is at most {n \choose k} .)

The above bound is within a factor of 1.5 as long as k \leq 0.49 n and n \approx 5000. When k \rightarrow n/2, this multiplicative factor increases.

The following bound is due to Lovasz. It works better when k \rightarrow n/2; it stays within a factor of 2 for n \approx 5000.

\displaystyle  \sum_{i=0}^{k-1} {n \choose i} \leq 2^{n-1} \exp\left(-\frac{(n - 2k - 2)^2}{4(n - k - 1)} \right) .

Shearer’s Inequality. Let S_i \subseteq [N], i \leq M be coordinate-sets such that each coordinate i \in [N] is covered by at least r different S_is. Let X = (X_1, \cdots, X_N) be a collection of independent random variables. The entropy of X is at most 1/r times the sum of the entropies of its restriction to each coordinate set. That is,

\displaystyle H(X_1, \cdots, X_N) \leq \frac{1}{r} \sum_{i=1}^M{ H(X\rvert_{S_i}}) .

Total Variation Distance of two distributions f, g is their largest difference over all subsets of the domain X. That is,

\displaystyle \Vert f - g \Vert_{tv} = \sup_{S \subseteq X} \vert f(S) - g(S) \vert.


\displaystyle \Vert f - g \Vert_{tv} = (1/2) \Vert f - g \Vert_1 = \sum_{x \in X}{\vert f(x) - g(x) \vert}.

See here for a proof.

Pinsker’s Inequality. The Kullback-Leibler divergence of two distributions is at least twice the square of their total-variaion distance. Namely,

\displaystyle D_\text{KL}(f \Vert g) \geq 2\Vert f - g \Vert_{tv}^2.

Kullback-Leibler Divergence. Given a probability distribution p on a discrete sample space X, let s(i ; p) := \log \frac{1}{p(i)} denote the “amount of surprise” associated with the event i according to the distribution p. The Entropy is the expected amount of surprise, namely

\displaystyle H(p) = \sum_{i \in X}{p(i) s(i; p)}.

The KL Divergence of q with respect to p is

  • the expected amount of surprise “remaining” in q after revealing p.
  • the entropy of q relative to p.
  • the expectation (in p) of the pointwise-difference of the log-probabilities.

\displaystyle D_\text{KL}(p \Vert q) = \sum_i{ p(i) \left(s(i;q) - s(i;p) \right) } = \sum_i{ p(i) \log \frac{p(i)}{q(i)}}.

Markov Chains.

  • A Markov chain is aperiodic if it has self-loops. It is connected if every state is reachable. It is ergodic if it has a unique stationary distribution \pi, which happens if and only if the chain is aperiodic and connected. Let P = (p_{ij}) be the transition probability matrix of the chain. The chain is reversible if \pi_i p_{ij} = \pi_j p_{ji} for all i,j. A random walk is lazy if it does not move, with probability 1/2. A lazy, connected random walk is ergodic.
  • The mixing time \tau of a random walk with an initial distribution x is the number of steps t it takes so that the distribution after t steps is very close to its stationary distribution.
  • An ergodic chain mixes rapidly: In particular, if \lambda = \max{\lambda_2(P), \lambda_n(P)}, then

\displaystyle \vert P^t x - \pi \Vert_1 \leq \lambda^t \Vert x \Vert_2.

  • The mixing time in the above case is \displaystyle O( \frac{\log \min \frac{1}{\pi_i} }{\lambda}). For a d-regular graph, this simplifies to O((\log n) / \lambda).

See here and here for nice expositions.

Edit | Back to top

Graph Theory

Independent Set and Vertex Cover. A largest independent set is the complement to a minimum vertex cover. The vertex cover problem is the dual of the maximum matching problem. In a bipartite graph, the size of the maximum matching equals the size of the minimum vertex cover (Konig’s theorem).

Crossing Number. cr(G) is the minimum number of edge crossing in a plane drawing of a graph G. If m > 7n, then cr(G) \geq m^3/39n^2. This is the crossing number inequality.

Szemeredi-Trotter Theorem. Suppose we have n points P and m lines L in a two-dimensional plane. Consider any bipartite graph G=(P,L,E) where an edge e =(p,\ell) implies that the point p is on line \ell. Then

\displaystyle \vert E \vert = O\big( ( m^2n^2 )^{1/3} + m + n \big) .

Edit | Back to top

Algebraic Graph Theory

Expander Mixing Lemma. Let G be a d-regular expander graph on n vertices and m edges. Suppose the second-largest eigenvalue of its adjacency matrix is at most \lambda. For any pair of vertex-sets A,B, let E_{A,B} be the set of “cross edges” between A and B. Let \#A denote the cardinality of the set A. Then, for all A, B \subset V,

\displaystyle \left \vert \# E_{A,B} - \frac{d}{n} \cdot \# A \#B \right \vert \leq \lambda \sqrt{\# A \cdot \# B} < \lambda n\,.

In the left-hand side, the first term is the number of cross edges. The second term is the expected number of cross edges in a random graph with edge-probability d/n. The left-hand side thus measures how much G deviates from this expectation. The expander mixing lemma says that if \lambda is small, G will be nearly random in this sense.

Random Walk on an Expander. Suppose G is a d-regular expander whose second-largest eigenvalue is at most \lambda in magnitude. Let W be a set containing a \rho fraction of vertices. The fraction of random walks of length \ell, on the vertices of G, that stay entirely within W, is small; in particular, it is at most

\displaystyle \rho \left( \rho + (1 - \rho) \frac{\lambda}{d} \right)^\ell\, .

Cheeger Inequality. Suppose G is an (n,d,\lambda_2)-expander i.e., a d-regular graph on n vertices with the second-largest eigenvalue of its adjacency matrix being \lambda_2. Since the largest eigenvalue is d, the quantity d-\lambda_2 is the spectral gap. The boundary of a vertex set S is denoted by \partial S = E_{S, \bar{S}}. The edge expansion of G is defined as

\displaystyle h(G) := \min_{S \subset V, \vert S \vert \leq n/2}\frac{\vert \partial S \vert}{\vert S \vert}.


\displaystyle \frac{d-\lambda_2}{2} \leq h(G) \leq \sqrt{ 2d (d-\lambda_2) }.


\displaystyle \frac{h(G)^2}{2d} \leq d- \lambda_2 \leq 2 h(G).

Edit | Back to top

Linear Algebra

A complex matrix A is unitary if A^* = A^{-1}. A real, unitary matrix is orthogonal. A is Hermitian if A^* = A. It is normal if A^*A = AA^*. A Hermitian matrix is Positive Semidefinite if all its eigenvalues are non-negative.

Cramer’s Rule. Suppose we want to solve a matrix-vector equation Ax = b. Let A_i denotes the matrix obtained by replacing the ith column of the matrix A by the b. Then,

\displaystyle x_i = \frac{\text{det}(A_i)}{\text{det} (A)}\, .

Spectral Decomposition.  A matrix A is Hermitian i.e., A = A^*, if and only if A has an orthonormal set of eigenvectors with real eigenvalues i.e., A = \sum{\lambda_i v_i v_i^*} where (v_i, \lambda_i) are an eigenvector-eigenvalue pair.

See here for a proof of the Hermitian case: orthogonal eigenvectors and real eigenvalues.

Normal Matrices. A matrix A is normal if it commutes with its conjugate transpose, i.e., A A^* = A^* A. This happens if and only if A is unitarily diagonalizable, i.e., there exists a diagonal D and unitary U such that A = U D U^*.

A block upper-triangular matrix is normal if and only if all on-diagonal blocks are normal and all off-diagonal blocks are zero.

Polar Decomposition. Every matrix A can be written as A = UR = LU where U is unitary, R,L are positive semidefinite, and L = \sqrt{AA^*}, R = \sqrt{A^* A}. Moreover, if A is invertible then U is unique.

Singular Value Decomposition. Every square matrix A can be written as A = U D V where U,V are unitary and D has non-negative real diagonal entries.

Proof: By polar decomposition, A = LR where L is unitary and R is positive semidefinite. By spectral theorem, R = T D T^* for some unitary T and real diagonal D. Moreover, since R is PSD, its eigenvalues are non-negative; hence D has non-negative real entries. We set U := LT, V := T^*.

Cholesky Factorization. Every positive semidefinite complex matrix A can be factored into A = L L^* where L is lower-trialgular with real, positive diagonal entries.

Schur Decomposition. Any complex square matrix A can be written as A = U T U^* where U is unitary and T is upper triangular.

Square Root. Every positive semidefinite matrix A (with bounded largest eigenvalue) can be factored into A = B^* B where B is also positive semidefinite.

Hilbert-Schmidt Norm and Inner Product.

Let a_i denote the i-th column of the matrix A. Then,

\displaystyle \langle A, B \rangle_{HS} := \mathrm{tr}(A^* B) = \sum_{i}{\langle a_i, b_i \rangle}.

\displaystyle \Vert A \Vert_{HS}^2 = \Vert A \Vert_{F}^2 := \mathrm{tr}(A^*A).


Trace of a matrix is the sum of its eigenvalues.

\displaystyle \mathrm{tr}(AB) = \mathrm{tr}(BA).

\displaystyle \mathrm{tr}(A x x^*) = \mathrm{tr}(x^* A x) = \langle x \vert A \vert x \rangle.

Projection Matrices. Any projection matrix P can be written as P = \sum{x x^*} where \{x\} is a basis of the target subspace. P has eigenvalues \{0,1\}. P is always Hermitian i.e. P^* = P, as evidenced by the sum formulation. P satisfies P^2 = P, P^*P = P.

Frobenius Norm. A generalization of the usual \ell_2 norm for matrices.

\displaystyle \Vert A \Vert_F^2 = \sum_{a_{ij}^2} = \mathrm{tr}(A^*A).

This norm, like the \ell_2 norm, is invariant under rotation.

Tensor Inner Product. The tensor inner product on the space V \otimes W is defined as follows:

\displaystyle \left\langle \sum_i{\alpha_i v_i \otimes w_i} \bigg\vert   \sum_j{\beta_j v^\prime_j \otimes w^\prime_j} \right\rangle \quad := \quad \sum_{ij}{ \alpha_i^* \beta_j \langle v_i \vert v_j \rangle \langle w_i \vert w^\prime_j \rangle } ,

where \alpha_i, \beta_j are scalars and v_i, v^\prime_j \in V, and w_i, w^\prime_j \in W.

Conjugate Gradient Method (for Solving Symmetric Linear Systems). Suppose we would like to solve Ax = b for x where A is a real symmetric Positive Semidefinite matrix. The matrix A defines the inner product \langle x, y \rangle_A = \langle x \vert A \vert y \rangle. We can find a set of basis vectors P = \{p_1, \cdots, p_n\} spanning \mathbb{R}^n that are mutually orthogonal according to this inner product. Then, the solution x^* can be written as a linear combination \{ \alpha_i \} of these basis vectors.

In practice, the vectors \{p_i\} and corresponding \alpha_i can be found using a process similar to the Gram-Schmidt orthonormalization. Any subset of these basis vectors gives an approximation of x^*.

Facts about the Matrix Exponential. If X,Y commute, then e^X e^Y = e^{X+Y}. In addition, if X,Y commute then e^X, e^Y commute, as do \log X, \log Y.

Moreover, for any invertible Y, \displaystyle e^{Y X Y^{-1}} = Y e^X Y^{-1}. Next, det(e^X) = e^{tr(X)}. If D=diag(a,b,\cdots) then e^D = diag(e^a, e^b, \cdots).

Golden-Thompson inequality. For any square matrices X,Y,

\displaystyle tr(e^{X+Y}) \leq tr(e^X e^Y) = tr(\sqrt{e^X} e^Y )\sqrt{e^X}.

Lieb’s Theorem. Let H be Hermitian and A positive definite. The function

\displaystyle A \mapsto \mathrm{tr} \exp \big[H + \log A \big]

is concave.

Corollary (Tropp). Let H be fixed, X be random, both Hermitian. Then

\displaystyle \mathbb{E} \mathrm{tr} \exp(H+X) \leq \mathrm{tr} \exp(H + \log \left[ \mathbb{E} e^X \right]).

See Joel Tropp’s paper (Section 3.4).

Edit | Back to top

Randomized Algorithms

Generating Pairwise Independent Random Variables. Choose a field F of size 2^n. Sample two uniformly random elements x_1, x_2 \in F using 2n uniformly random bits. Then the following 2^n random variables are uniform over F and pairwise independent:

\displaystyle Y_i := x_1 + i \cdot x_2 \qquad \forall i \in \{0, \cdots, 2^n - 1\}

See here for details.

Pairwise Independent Hash Functions. Let H be a family of functions, each having n input bits and k output bits. Suppose the following holds: that if we pick h \in H at random, the probability that it maps two arbitrary but distinct input strings to the same output string is exactly 1/2^{2k}. Then H is a family of pairwise independent hash functions. In other words,

\forall x,x^\prime \in \{0,1\}^n, y, y^\prime \in \{0,1\}^k \forall h \in_R H \Pr[h(x)=y \wedge h(x^\prime) = y^\prime]

= \Pr[h(x)=y] \Pr[h(x^\prime) = y^\prime] = \frac{1}{\vert A \vert \vert B \vert} = 2^{-2k}.

Equivalently, the probability that two arbitrary strings will collide under a random h \in H is the square of the uniform probability on the output space. Intuitively, if we pick a random h \in H then for arbitrary x, y in the domain, the strings h(x), h(y) are independent and uniformly distributed.

Valiant-Vazirani Lemma. Let \mathcal{H} be a collection of pairwise independent hash functions from \{0,1\}^n \rightarrow \{0,1\}^k. Let S \subseteq \{0,1\}^n such that \vert S \vert \in [K/4,K/2] where K = 2^k. Then, with constant probability, the string 0^k has a unique preimage in S under a randomly selected h \in \mathcal{H}.

Leftover Hash Lemma. Suppose an n-bit string s has entropy at least n-t. The leftover hash lemma says that it is possible to extract n-t bits from s that will be almost uniformly distributed. The way to do it is to use the index s to choose a member h_s : \{0,1\}^n \rightarrow \{0,1\}^k from a pairwise-independent hash family H. If a string x is randomly chosen from an arbitrary subset W\subseteq \{0,1\}^n, the distribution of the concatenated string s \Vert h_s(x) will be at most \epsilon away from the uniform distribution where \epsilon = \sqrt{ \vert W \vert / 2^k}.

Hitters. Fix a Boolean function \sigma on n-bit inputs. Let S be the set on inputs on which \sigma outputs 1, and suppose \vert S \vert = \epsilon 2^n. An efficient, randomized algorithm A, given only oracle access to \sigma, is a hitter for \sigma if with probability at least 1 - \delta, it needs to sample at most $m$ points from \{0, 1\}^n before it “hits S,” that is, it outputs an element from S. We can construct hitters, using random walks on expanders, which takes m = O(\epsilon^{-1} log(1/\delta)) samples and uses only n + O(1/\delta) random bits. In contrast, if we sample uniformly at random, it requires $mn$ random bits. See the appendix of Goldreich’s survey for details.

Edit | Back to top

Deviation/Moment Inequalities

Markov’s Inequality. Any nonnegative X with mean \mu satisfies \Pr[X \geq \alpha \mu ] \leq 1/\alpha for any positive \alpha.

Chebyshev’s Inequality. Any random variable X with mean \mu and variance \sigma^2 satisfies \Pr[ \vert X -\mu \vert > \alpha \sigma] \leq 1/\alpha^2 for any positive \alpha.

Note: While Chernoff-Hoeffding inequality requires mutual independence of all random variables, it suffices to have pairwise independence for Chebyshev’s inequality. See Nick Harvey’s notes here.

Paley-Zygmund Inequality. Let Z be a non-negative random variable with a finite variance. If \theta \in [0,1] then

\displaystyle \Pr\bigg[ Z > \theta \mathbb{E} Z \bigg] \geq (1-\theta)^2 \frac{(\mathbb{E} Z)^2}{\mathbb{E} Z^2} .

Note: The Paley-Zygmund inequality complements the Chebyshev inequality. While the latter says that “right tail is thin,” the former says that “the left tail is thin.”

Hoeffding’s Tail Inequality, Additive Deviation, Symmetric. The probability that the sum of n independent and identically distributed [a,b]-bounded random variables deviates at least \lambda away from its expectation is no more than \displaystyle \exp\left( - 2 \lambda^2 / n(a-b)^2\right).

Chernoff Bound, Multiplicative Deviation, Asymmetric. Let S be the sum of many independent Bernoulli trials. Let \mu = E[S] and let \delta be any positive number between zero and one. The probability that S is \delta \mu below the mean is at most \exp(-\delta^2\mu/2). However, The probability that S is \delta\mu above the mean is at most \exp(-\delta^2\mu/3).

(Hoeffding-like statement.) Equivalently, the probability that S is \lambda below the mean is at most \exp(-\lambda^2/2\mu). However, The probability that S is \lambda above the mean is at most \exp(-\lambda^2/3\mu).

Note. When \lambda is large, the normal approximation of the binomial distribution is less accurate for the positive tail. See the appendix of Alon-Spencer for a commentary.


Above are the plots of the Chernoff-tail and the Hoeffding-tail for both 0-1 and \pm 1 random variables. The plotted quantity is the bound for the probability that the sum deviates $t n$ away from the expectation, for t \in [-0.1, 0.1] and n=1000.

Rudelson-Vershynin Theorem on the Operator Norm of Random Projections.  Let x_1, \cdots, x_n be n identical copies of a random vector x \in \mathbf{R}^d with \mathbf{E} x^Tx = I and magnitude \Vert x \Vert_2 at most R. Let \mu := (\sum_i{x_ix_i^T})/n. For any \lambda \in(0,1),

\displaystyle \Pr\left[  \left\Vert \mu - I \right\Vert_2 > \lambda \right] < 2d \exp \left( -n\lambda^2/4R^2 \right).

Note: We can think xx^T as a “scaled” projection operator along x multiplied by 1/\Vert x \Vert_2. Then \mu acts as the mean of these projection operators. The theorem says that the largest distortion inflicted by \mu on any vector is no more than \lambda with high probability. Notice also the similarity with Hoeffding’s tail inequality, which would give the exponent as -2(n\lambda)^2/nR^2=-2n\lambda^2/R^2 for one-dimensional case. We lost a factor of 8 somewhere.

Azuma-Hoeffding Inequality for Martingales. Let X_0, \cdots, X_n be real-valued random variables with the martingale property i.e., \mathbf{E} \left[ X_{t+1} \mid X_0, \cdots, X_t \right] \leq X_t and the bounded-difference property i.e., \vert X_t-X_{t-1} \vert \leq c for some absolute constant c. Then, the probability that the difference X_n - X_0 is more than \lambda is at most \displaystyle \exp \left( -\lambda^2/2nc^2 \right).

Khintchine Inequality. Let \displaystyle \varepsilon be uniformly random in \displaystyle \{\pm 1 \}^n and let \displaystyle x \in \mathbb{C}^n be arbitrary. For any bounded p > 0, let \Vert x \Vert_p^p := \sum_i{ \vert x_i \vert }^p. Let m(Z,p) be the pth moment of Z. Define X := \vert \sum_i{\varepsilon_i x_i} \vert.  Then,

\displaystyle \bigg( m(X, p) \bigg)^{1/p}  = \Theta\bigg( \Vert x \Vert_2 \bigg) ,

where the constants hidden by the \Theta notation depend only on p.

Marcinkiewicz-Zygmund Inequality. Let x_1, \cdots, x_n \in \mathbb{C} be independent random variables with zero mean and bounded pth moment for any 1 \leq p < \infty. Let m(Z,p) be the pth moment of Z.  Then,

\displaystyle m\bigg(\sum_i{x_i}, p \bigg) = \Theta\bigg(m\left( \Vert x \Vert_2,p \right) \bigg) ,

where the constants hidden under the \Theta notation depend only on p.

Edit | Back to top

Complexity Theory

Goldreich-Levin Theorem. Every injective one-way function f has a corresponding boolean function b which is hard to predict; predicting b is effectively inverting f, which is unlikely when f is an injective one-way function.

Lipton’s Random Self-Reducibility Theorem. A problem P is called “random self-reducible” if you can solve P on an arbitrary input by evaluating P on a sequence of uniformly random inputs. Suppose you have a black-box to solve P on a random input with high probability. Then you can construct a randomized algorithm that can solve P on every input with high probability.

Yao’s Min-Max Lemma. You and I are playing a two-player game. You are the input player and I am the algorithm player. Your move is to select a distribution \mathcal{D} over inputs \mathcal{X} = \{x\}. My move is to select a distribution \mathcal{R} over algorithms \mathcal{A} = \{A\}. Both \mathcal{A}, \mathcal{X} have finite size. Your goal is to make cost(A,x) high, while my goal is to make it as low as possible. The best value for you is

\displaystyle \max_{\mathcal{D}} \min_{A \in \mathcal{A}} \underset{x \sim \mathcal{D}}{\mathbb{E} } cost(A, x) ,

while the best value for me is

\displaystyle \min_{ \mathcal{R} } \max_{x \in \mathcal{X}} \underset{A \sim \mathcal{R} }{\mathbb{E} } cost(A, x).

Yao’s min-max lemma says that the two values are equal.

BPP \subseteq P/poly.

Proof: Suppose there is an algorithm A^\prime which decides a language in BPP using a random string r of length poly(n). We can construct a majority-vote algorithm A which uses a random string R of length $latex poly(n)$ but the probability that it errs on a given input string of length n is at most 1/e^n, thanks to the Chernoff bound. For a fixed R, the probability that A(R,.) is incorrect on some input is at most 2^n/e^n < 1. Then by the probabilistic method, there exists a random string R^* so that A(x, R^*) is correct on every input.

Communication Complexity for Equality. Each part must communicate at least \Omega(\sqrt{k}) bits to compute a Boolean predicate f(x,y) : x \overset{?}{=} y, where x, y \in \{0,1\}^k.

SAT is reducible to 3SAT. It suffices to formulate a 3-CNF formula equivalent to a clause with four variables, say C = (a \vee b \vee c \vee d). We introduce a dummy variable e and write C^\prime = (a \vee b \vee e) \wedge (c \vee d \vee \bar{e}). Clearly, if there is an assignment which makes C true, we can choose an assignment to the variable e so that both clauses of C^\prime are true. On the other hand, if an assignment makes C false, then at least one clause of C^\prime will be false regardless of which value we choose for e.

Every n-variable Boolean Function has a CNF Formula of size n 2^n. Consider an n-variable Boolean function f(x), x \in \{0,1\}^n. For every assignment a \in \{0,1\}^n, we construct an “indicator clause” C_a that is false for a but true for every other assignment. For example, if a = 010, the clause C_x = (x_1 \vee \bar{x_2} \vee x_3). Then, the desired CNF formula C is the AND of all the indicator clauses for all “bad assignments” a which makes f(a) = 0. This formula outputs zero whenever f(x) = 0, and one otherwise. This uses at most 2^n clauses, each with n-1 OR symbols; these clauses are collected using 2^n - 1 AND symbols. Thus the total number of symbols is at most (n-1)2^n + (2^n - 1) = n 2^n - 1.

Edit | Back to top


Shamir’s (t,n)-Threshold Secret Sharing. Suppose you have a secret \sigma. If you have a degree-(t-1) polynomial P(x) = \sum_i{a_i x^i} over any field F, you can set a_0=\sigma and create a set of n arbitrary evaluations of this polynomial in F. Any t of these evaluations can reconstruct P via Lagrange interpolation and thus recover \sigma = a_0. Any smaller subset of evaluations cannot reconstruct P, and thus get no information about the secret \sigma.

Note. See the similarity with the Reed-Solomon code.

Feldman’s Verifiable (t,n)-Secret Sharing. Suppose you want to share a secret \sigma \in \mathbb{Z}_p^*.
Initialization: Make p, g public where g generates the field \mathbb{Z}_p^*.

Share generation: Like Shamir’s secrete sharing scheme, choose a (random) polynomial P(x) = \sum_j{a_j x^j} of degree t-1, set a_0=\sigma, the secret, and distribute shares \sigma_i = P(i) \bmod p to each user i \in [n].

Commitments: Create commitments c_j = g^{a_j} to each coefficient of P(x), and make them public.

Verification: The user i can verify whether

g^{P(i)} \overset{?}{=}\prod_j{c_j^{i^j}}\, because

g^{P(i)} = g^{\sum_j{a_j i^j}} = \prod_j{g^{a_j i^j}} = \prod_j{c_j^{i^j}}.

Reconstruction: Lagrange interpolation from any t shares.

Discrete-Log Setting (G,g,q,p). Let g be a generator of a multiplicative group of prime order q. An easy way to do it is to select a prime q such that p = 2q+1. Then, \mathbb{Z}_p^* has a subgroup G of order q, and every element of G is a generator. In particular, we can take g to be a random element in G.

Discrete-Log Assumption. Given g and g^a, it is hard to find a.

Decisional Diffie-Hellman (DDH) Assumption. Given g, g^a, g^b, it is hard to distinguish g^{ab} from random.

Computational Diffie-Hellman (CDH) Assumption. Given g, g^a, g^b, it is hard to compute g^{ab}.

Bit Commitment from One-Way Permutations. Suppose you want to commit to a bit b. Initialization: Let h be a hard-core predicate for a strong one-way permutation f. Commit: You sample a random string r and send \big( f(r), h(r) \oplus b \big) to the receiver. Reveal: You send b and r.

This scheme is “computationally hiding” because to predict b one has to predict the hard-core h(r) using f(r), which is “difficult” by definition. This is “perfectly binding” because there can be no r^\prime such that f(r) = f(r^\prime).

Bit Commitment from One-Way Functions. Suppose you want to commit to a bit b.
Initialization: Let G be PRG. The receiver sends you a random string r.

Commit: You sample a random n-bit string s and a random 3n-bit string  r^\prime. If b == 0 you send \big( G(s) \oplus r \big) to the receiver; otherwise, you send \big( r^\prime \oplus r \big).

Reveal: You send b, r^\prime, s to the receiver.

Recall that one-way functions imply PRG. The definition of a PRG implies that this scheme is computationally hiding. To cheat, you will have to find a s^\prime such that G(s) = G(s^\prime). The number of  (s, s^\prime) pairs is 2^{2n} while the size of the PRG’s output space is 2^{3n}. Hence the probability that you’ll succeed is at most 2^{-n}. Hence this scheme is perfectly binding.

Pedersen’s Bit Commitment Scheme. Suppose a sender wants to commit to a bit b.
Initialization: Everyone agrees on a Discrete-Log setting (G,g,q,p). The receiver sends a random h \in G.

Commit: The sender selects a random r \in \mathbb{Z}_q. The commitment is c = g^r h^b \bmod p.

Reveal: The sender sends (r,b) to the receiver, and he verifies whether c \overset{?}{=} g^r h^b \bmod p.

The scheme is information-theoretically hiding; since g^r is random, c is also random regardless of b. The scheme is computatinally binding: to cheat, one has to find two different opening pairs, say (r_0, 0) and (r_1, 1), such that c_0 = c_1 where c_0 = g^{r_0} \bmod p and c_1 =  g^{r_1} h \bmod p. Suppose you can extract such a pair r_0, r_1 from the commitments c_0, c_1 and the common knowledge g, h. Since h = g^a for some a \in \mathbb{Z}_q, you can solve the equation c_0 = c_1, or h g^{r_1} = g^{r_0}, or h=g^{r_0-r_1}, or a = r_0 - r_1 \bmod p. But this contradicts the Discrete-Log assumption, since now you know a, the discrete log of h with respect to g. See here for more schemes. Here’s the original paper.

Chaum-Pedersen Discrete Log Proof-of-Knowledge Scheme.
Setup: Fix a multiplicative group G of a prime order q.  Suppose you know some element \alpha \in \mathbb{Z}_q  and two elements h_1, h_2 \in G such that h_1 = g_1^\alpha and h_2 = g_2^\alpha for two independently chosen generators g_1, g_2. Publish g_1, h_1, g_2, h_2.

Goal: you want to convince someone about your knowledge of \alpha without revealing what \alpha is.

Interaction: You select a random element w from G and give the verifier two elements a_i = g_i^w from G for i \in [2]. Then the verifier sends you a random element c as a “challenge”. You then reply with a “response” r = w - \alpha c \bmod q.

Verification: The verifier then verifies whether a_i = g_i^rh_i^c = g_i^{r+\alpha c}.

Note: The Fiat-Shamir scheme turns a public-coin interactive proof-of-knowledge of the discrete log into a non-interactive proof-of-knowledge where the prover, instead of receiving random coins from the verifier, queries a random oracle to get the random bits himself. However, Goldwasser and Kalai have shown if Fiat-Shamir was used to make digital signatures, substituting the random oracle with any “hash function” results in an insecure digital signature scheme.

RSA Encryption Scheme.
Setup: Select two large primes p,q and compute N = pq. Let \phi(N) = ord(\mathbb{Z}_N^*) = (p-1)(q-1). Pick an arbitrary element e in the multiplicative group \mathbb{Z}_N^*. This means, e is coprime to and invertible modulo \phi(N). Let d be this inverse. (N,e) is the public key, d is the private key.

Encryption: message m \bmod N is encrypted as c = m^e \bmod N.

Decryption: c \mapsto c^d \bmod N = m^{ed} \bmod N = m \bmod N since ed = 1 \bmod \phi(N).

Diffie-Hellman Key Exchange. You and I publicly agree on a prime p and a generator g of the group G=\mathbb{Z}_p^*. I randomly pick a private value a \in G and you randomly pick your private value b \in G.  I send you A = g^a \bmod p in public, while you send me B = g^b \bmod p in public. I construct S = B^a = g^{ab} while you construct A^b = g^{ab}=S. This S is our shared secret assuming the discrete log problem is hard.

ElGamal Encryption Scheme. First, you and I publicly agree on a prime p and a generator g of the group G=\mathbb{Z}_p^*. Then we construct a shared secret s via the Diffie-Hellman key exchange protocol. You encrypt a message m \in G as c = m\cdot s. Since G is cyclic, I can compute the inverse of s, call it s^{-1}. Finally, I can decrypt your cyphertext as c \mapsto cs^{-1} = mss^{-1} = m.

ElGamal Digital Signature Scheme.
Initialization: Select a large prime p and a generator g of the group G = \mathbb{Z}_p^*. Select a random secret element x \in \{2, \cdots, p-3 \} and set y := g^x. Let H be a cryptographically secure hash function.

Signning: To sign a message m, randomly select some k \in \{2,\cdots, p-2\} such that k is invertible modulo p-1. Let r:=g^k \bmod p. Compute an s such that H(m) \equiv xr + ks \bmod p-1. If s=0, start over. The tuple (r,s) is the signature of m.

Verification: Check that 0 < r < p-1, 1 < s < p-2 and r^sy^r \equiv g^{H(m)} \bmod p-1. For a correct signature, the LHS is r^sy^r = g^{ks}g^{xr} = g^{H(m)}.

Schnorr Digital Signature Scheme.
Setup: Everyone agrees on the Discrete-Log setting (G,g,q,p). The signer’s secret signing key is some x \in \mathbb{Z}_q, and his public verification key is v = g^x. Everyone has access to a hash function H.

Signing: To sign the message M, the signer chooses a random k \in \mathbb{Z}_q, computes r = g^k, computes the hash h = H(r \Vert M) where \Vert denotes concatenation. The pair (k - x h, h) is the signature for M where, recall, that x is the secret signing key.

Verification: Given M and a purported signature (s, h^\prime), one computes \displaystyle r^\prime = g^s v^{h^\prime} = g^{s + x h^\prime}; and verifies that \displaystyle h^\prime \overset{?}{=} H(r^\prime \Vert M).

Hard-Core Predicate. An easy-to-compute boolean predicate b : \{0,1\}^n \rightarrow \{0,1\} is a hard-core for a function f if no efficient algorithm can predict b(x) with non-negligible success even if they know f(x).

Weak One-Way Functions Imply Storng Ones. If f is a weak one-way function, then g(x,y) = \big( f(x), f(y) \big) is a strong one-way function.

Every Strong One-Way Function has a Hard-Core Predicate. The Boolean inner product x \odot r is a hard-core predicate of the function g(x,r) := \big( f(x), r \big) where f is a strong one-way function and r is arbitrary.

Intuitively, you have to invert f to predict the inner product.

Oblivious Transfer. The context is a 1-out-of-N multiplexer. The sender holds N values \{x_i\}. The receiver holds an index b. The receiver wants to know x_b without the sender knowing b. The sender wants that the receiver knows nothing other than x_b.

1-out-of-2 Oblivious Transfer via Public Key Encryption. This is a variant of Even-Goldreich-Lempel. The sender chooses a random r. The receiver chooses a valid key-pair (sk_b, pk_b) \leftarrow (x, g^x) and an “orphan” public key pk_{1-b}) \leftarrow r \cdot pk_b. The idea is that the sender, upon receiving  pk_0, pk_1, makes sure that the “other” public key, pk_{1-b}, is forced to a random value chosen by the sender. Next, the sender encrypts two values: for i \in \{0,1\}, he sends c_i \leftarrow Enc_{pk_i}(x_i) to the receiver. Then the receiver can find x_b \leftarrow Dec_{pk_b}(c_b).

This gives computational security against the sender (breaking the encryption scheme) and the receiver as well (recovering the secret key from a given public key).

1-out-of-2 Oblivious Transfer via DDH Assumption. The sender and receiver agree on two generators g_0, g_1. The receiver chooses a random exponent r. He then sends the following to the sender:

h_0 \leftarrow g_0^r \qquad h_1 \leftarrow g_1^{r+b} .

The sender samples four random exponents a_1, a_2,b_1, b_2. Then he creates two “encryptions” c_i \leftarrow A_i B_i C_i where

A_i = g_0^{a_i} g_1^{b_i} \qquad B_i = h_0^{a_i} and

C_0 = h_1^{b_0} x_0 \qquad C_1 = h_1^{b_1}g_1^{-b_1}x_1 .

The receiver, upon receiving c_b = A_b B_b C_b, can recover

x_b \leftarrow B_b C_b A_b^{-r} .

Note that if b = 0, then B_0 C_0 A_0^{-r} =  h_0^{a_0} \left( h_1^{b_0} x_0 \right) \left( g_0^{-r a_0} g_1^{-r b_0} \right) =g_0^{r a_0}  g_1^{r b_0} x_0 g_0^{-r a_0} g_1^{-r b_0} = x_0 .

On the other hand, if b = 1, then B_1 C_1 A_1^{-r} =  h_0^{a_1} \left( h_1^{b_1} g_1^{-b_1} x_1 \right) \left( g_0^{-r a_1} g_1^{-r b_1} \right) = g_0^{r a_1}  \left( g_1^{(r+1)b_1} g_1^{-b_1} x_1 \right) \left( g_0^{-r a_1} g_1^{-r b_1} \right) = x_1 .

This gives information-theoretic security against the receiver since all structure is lost in a c_i \neq c_b. On the other hand, this gives computational security (i.e., the DDH assumption) against the sender since the sender can break g_1^{r+b} if he can compute the discrete log from h_1 = g_0^r.

Edit | Back to top

Distributed Computing

Byzantine Agreement.Micali’s BA protocol works on a completely asynchronous network when at least 2/3 of the users are honest, and halts after 9 rounds in expectation. It is player-replaceable in the sense that players can be removed/corrupted immediately after sending a message.

The Micali-Vaikuntnathan protocol  works on a synchronous network, and allows the honest fraction to be as low as \alpha = 1/2 + \delta, \delta> 0. With probability 1 - \epsilon, the number of rounds is at most

\displaystyle \frac{\log 1/\epsilon}{\log \frac{2}{2-\alpha}}\, .

The Impossibility of Byzantine Agreement. It is impossible to attain Byzantine Agreement among deterministic players in an asynchronous network.

Edit | Back to top


Newton’s Method. Suppose you want to find a root of a locally-convex function f(x) in the vicinity of the initial point x_0. The root will be approached in a sequence of points x_0, x_1, x_2, \cdots. Suppose you already know x_n. You make a linear approximation of f(x) at x = x_n as follows: y = f(x_n) + (x_{n+1} - x_n) f^\prime(x_n). Because you want to find a root of f, you set y = 0 and solve for

\displaystyle x_{n+1} = x_{n} - \frac{f(x_n)}{f^\prime(x_n)} .

Because you are doing a linear approximation, the error term decreases quadratically as you approach the true root x = \alpha. In particular, Taylor expansion of f around x_n tells us that the approximation error is (\alpha - x_n)^2 f^{\prime \prime}(x_n)/2! + O((\alpha - x_n)^3)

Edit | Back to top


Taylor Series of a real function f(x) around the point a.

\displaystyle f(x) = \sum_{k=0}^{\infty}  f^{(k)}(a) (x-a)^k / k! .

\displaystyle \log (1-x) = -x -x^2/2-x^3/3-\cdots
\displaystyle \log(1+x) = x - x^2/2 + x^3/3 + \cdots

Edit | Back to top


(1-x)^{1-x} > \exp(-x+x^2/2) when x \in (0, 1)

The exponential function is steep above zero and flat below zero.

e^x \geq 1+x for x\in \mathbb{R}.

Concave functions are subadditive. Suppose f(x) is concave and \alpha > 0. Then

f(\alpha x) \geq \alpha x ,
f(a)+f(b) \geq f(a+b) .

Cauchy-Schwartz Inequality. The inner product is at most the product of individual lengths.

\displaystyle \langle x \vert y \rangle \leq \Vert x \Vert_2 \cdot \Vert y \Vert_2.

AM-GM Inequality.  Let \{w_i\} denote a convex combination i.e., \sum{w_i}=1, w_i \geq 0. Then

\displaystyle \sum_i{w_i a_i} \geq \prod_i{a_i^{w_i}} .

Equality occurs if w_1 = w_2 = \cdots .

Power Mean Inequality. For a vector x \in \mathbb{R}^n and positive real p, define \displaystyle \Vert x \Vert_p = \left(\sum_i{x_i^p}\right)^{1/p}. For positive real p \geq q,

\displaystyle \frac{\Vert x \Vert_p^p}{n} \geq \frac{\Vert x \Vert_q^q}{n}.

In other words, \Vert x \Vert_p is “not too short:” for any t > 0,

\displaystyle \Vert x \Vert_{p+t} < \Vert x \Vert_p < \Vert x \Vert_{p+t} \ n^{\frac{t}{p(p+t)}}.

A Logarithmic Inequality. Let f:x \mapsto \log \dfrac{1}{1-x}. Then

\displaystyle y f(x) \leq x f(y) \qquad 0 < x \leq y \leq 1 .

Proof: Let f be a convex function satisfying f(a) = 0. Let a < x \geq b be two points in the domain so that x = \alpha a + \beta b for some \alpha, \beta >0, \alpha + \beta = 1. By Jensen, we have f(x) \leq \alpha f(a) + \beta f(b) = \beta f(b) since f(a)=0 by assumption.

Now consider the function f : x \mapsto -\log(1-x) which is convex in [0,1). For this function, a=0 i.e., f(0)=0. This means for any x, we are free to choose any b \geq x so that x = \alpha \cdot 0 + (x/b)\cdot b. It follows that f(x) \leq (x/b) f(b), or b f(x) \leq x f(b). Writing y \leftarrow b yields the above inequality.

Edit | Back to top

Quantum Computing

State Vectors. n qubits can specify N=2^n indices, one for each coordinate of an N-dimensional state vector in \mathbb{C}^N. An orthonormal basis of an n-qubit system is the so-called “computational basis” \{ \vert k \rangle\}, k \in 0,1,2,\cdots N-1, where \vert k \rangle is simply the standard basis vector e_k \in \{0,1\}^N.

Pauli Matrices/Gates. Let \mathcal{P} = \{X,Y,Z\} where

X = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} \qquad Y = \begin{bmatrix} 0 & -i \\ i & 0 \end{bmatrix}\qquad Z = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}.

For each M \in \mathcal{P}, it holds that

\displaystyle M^* = M \text{ (Hermitian)},\qquad M^2 = I \text{ (involutory)}, \qquad M^* M = I\text{ (unitary)} .

\mathrm{tr}(M) = 0 and \mathrm{det}(M) = -1.

Their commutators have a nice structure:

[X,Y] = 2iZ \qquad [Y,Z] = 2iX \qquad [Z,X] = 2iY.

The eigenvalues of these matrices are \lambda(M) = \pm 1. The corresponding eigenvectors v_M(\lambda) are:

\displaystyle v_X(\lambda) = \begin{pmatrix} \lambda \\ 1 \end{pmatrix}/\sqrt{2} .

\displaystyle  v_Y(\lambda) = \begin{pmatrix} -\lambda i \\ 1 \end{pmatrix}/\sqrt{2} .

\displaystyle v_Z(\lambda) = \begin{pmatrix} 1+\lambda \\ 1-\lambda \end{pmatrix}/2 .

Hadamard Gate.

\displaystyle \qquad H = \frac{1}{\sqrt{2}}  \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} = (X+Z)/\sqrt{2} .

\displaystyle H : \vert b \rangle \mapsto \big( \vert 0 \rangle + (-1)^b \vert 1 \rangle \big)/\sqrt{2}\,, \qquad b \in \{0,1\} .

Specifically, \displaystyle H \vert 0 \rangle = \big(\vert 0 \rangle + \vert 1 \rangle \big)/\sqrt{2} = \big( \vert 0 \rangle + \exp(2 \pi i) \vert 1 \rangle \big)/\sqrt{2} . Notice that

\displaystyle (-1)^b = \exp(  \pi i b )= \exp(  2 \pi i b/2 ) = \exp(  2 \pi i 0.b )

where we use the notation

0.x_1x_2\cdots x_k =  x/2^k where x = x_1 x_2 \cdots x_n for n \geq k, x_i \in \{0,1\} .

Let x = x_1 \cdots x_n \in \{0,1\}^n, and N = 2^n. An n-qubit Hadamard gate is

\displaystyle H^{\otimes n} : \vert x \rangle \quad \mapsto \quad \frac{1}{\sqrt{N}} \sum_{z \in \{0,1\}^n}{ (-1)^{x \odot z} \vert z \rangle } ,

where \odot denotes the Boolean inner product of two strings. It can also be written as the tensor product

\displaystyle H^{\otimes n} : \vert x \rangle \quad \mapsto \quad \frac{1}{\sqrt{N}} \bigotimes_{\ell=1}^{n}{ \big[ \vert 0 \rangle + \exp(2 \pi i x_\ell/2) \vert 1 \rangle \big] } .

Other Quantum Gates.

  • Rotate by an angle of 2^{-k}\cdot 2\pi: \displaystyle \qquad R_{k} =  \begin{bmatrix} 1 &  0\\ 0 & \exp(2\pi i 2^{-k}) \end{bmatrix}.
  • Phase gate i.e., Rotate by 90^\circ: \displaystyle \qquad S =  R_{2} = \begin{bmatrix} 1 &  0\\ 0 & i \end{bmatrix}.
  • Axis-specific rotations: for m \in \{x,y,z\} and M \in \{X,Y,Z\}, \displaystyle R_m(\theta) = \exp(i\theta M)

Measurements. A measurement apparatus M = \{ M_x \} satisfies the completeness relation, i.e., it is a decomposition of the identity I = \sum_x{M_x^* M_x}. The probability of observing the outcome z after measuring \vert \phi \rangle is

\displaystyle p(z) = p(z; \vert  \phi \rangle, M) = \langle M_z \phi \vert M_z \phi \rangle = \langle \phi \vert M_z^* M_z \vert \phi \rangle = \Vert M_z \vert \phi \rangle \Vert_2^2.

The state of the system after the measurement is \displaystyle \frac{M_z \vert \phi \rangle}{\Vert M_z \vert \phi \rangle \Vert_2} = \frac{M_z \vert \phi \rangle}{\sqrt{p(z)}}.

Projective Measurements. The components of a projective measurement M = \{M_x\} correspond to the spectral decomposition \{(e_x, \lambda_x)\}_x in that each M_x is a projection operator for the eigenspace e_x e_x^*:

\displaystyle M = \sum_x{\lambda_x M_x} = \sum_x{\lambda_x e_x e_x^*}.

This implies, the probability of observing a vector \vert z \rangle after measuring a quantum state \vert \phi \rangle is the squared modulus of the (complex) inner product of these two vectors:

\displaystyle p(z) = p(z; \vert  \phi \rangle, M) = \langle \phi \vert M_z \vert \phi \rangle = \vert \langle \phi \vert e_z \rangle \vert^2.

The expectation and variance of a projective measurement has a friendly form:

\langle M \rangle_\phi := \mathbb{E}(M; \vert \phi \rangle) = \sum_x{x p(x ; \vert \phi \rangle)} = \langle \phi \vert M \vert \phi \rangle.

\displaystyle [\Delta(M)]^2 := \langle (M - \langle M \rangle)^2 \rangle = \langle M^2 \rangle - \langle M \rangle^2,

where \Delta(.) is the “standard deviation.”

The Heisenberg Uncertainty Principle. Let C,D be two projective measurements. Then, with respect to any quantum state \vert \phi \rangle, the product of the standard deviations (across a large number of trials) is “large”:

\displaystyle \Delta(C; \vert \phi \rangle) \Delta(D; \vert \phi \rangle) \geq \frac{\big\vert \langle \phi \vert [C, D] \vert \phi  \big\vert}{2}.

Entangled State. A quantum state in a k-bit system is entangled if it cannot be written using a single k-bit index. For example, suppose k=2. The k-bit integers \{1,2,3, 4\} index the 2^k basis states. We need two “independent” basis indices (namely 1 and 4) to specify the (unnormalized) state \vert 1 \rangle + \vert 4 \rangle; hence it is an entangled state.

The Bell Basis. Each of the following states is entangled. They form an orthonormal basis of \mathbb{C}^2.

\displaystyle \Phi^{\pm}:=\frac{1}{\sqrt{2}}\left(\vert 00 \rangle \pm \vert 11 \rangle\right)\,, \qquad \Psi^{\pm}:=\frac{1}{\sqrt{2}}\left(\vert 01 \rangle \pm \vert 10 \rangle \right)\, .

Put in another way, let \vert \phi_{xy} \rangle denote the four Bell states denoted by the bits x,y. In particular, y = 0 corresponds to \Phi^{\pm}, and y = 1 corresponds to \Psi^{\pm}. Then,

\displaystyle \vert \phi_{xy} \rangle = \big( \vert 0,y \rangle + (-1)^x \vert 1, 1\oplus y \rangle \big)/\sqrt{2}

Bloch Sphere Representation. A single qubit q := a \vert 0 \rangle + b \vert 1 \rangle can be represented by the tuple (\theta, \phi), as follows: q can be identified with a point v on the 2-sphere in 3-dimensions, with the vector \vert 0 \rangle at the North pole. Then \theta is the angle of the vector $late v$ with the z-axis, and \phi is the angle between the x-axis and the projection of v on the xy-plane. The Bloch vector for v is

\displaystyle v = (\sin \theta \cos \phi, \sin \theta \sin \phi, \cos \theta) .

This is equivalent to setting \displaystyle a := \cos(\theta/2), b := \exp(i \phi) \sin(\theta/2) .

Quantum Fourier Transform. Let j, k \in \{0,1\}^n, i the imaginary unit, and N = 2^n. Let \omega_j =  \exp(2 \pi ij /N) be the jth power of the Nth root of unity. The QFT is a linear operator that takes

\displaystyle \vert j \rangle \quad \mapsto \quad \frac{1}{\sqrt{N}} \sum_{k=0}^{N-1} \omega_j^k \vert k \rangle .

The right hand side equals

\displaystyle \frac{1}{\sqrt{N}} \bigotimes_{\ell = 1}^n {\big[ \vert 0 \rangle + \underbrace{e^{2\pi i j/2^\ell }}_{\omega_j^{N/2^\ell}} \vert 1 \rangle \big]} .

Edit | Back to top

Discrete Geometry

Radon’s Theorem. Any set of d+2 points in \mathbf{R}^d can be partitioned into red-blue points so that the respective convex hulls intersect.

Edit | Back to top


Gamma function. The integral

\displaystyle \Gamma(z) = \int_0^\infty{ x^{z-1} e^{-x} dx} ,

converges absolutely if z \in \mathbb{C}, \mathrm{Re}(z) > 0.  It is the unique function that satisfies \Gamma(1) = 1, \Gamma(z+1) = z \Gamma(z). In particular, \Gamma(1/2) = \sqrt{\pi}. Euler proved that

\displaystyle \Gamma(z) = \prod_{n=1}^\infty \bigg[ \frac{  \big(1+1/n \big)^z }{ 1+z/n } \bigg] .

Weierstrass showed that

\displaystyle \Gamma(z) = \frac{e^{-\gamma z}}{z} \prod_{n=1}^\infty \bigg[ \frac{  e^{zn} }{ 1+z/n } \bigg] ,

where \displaystyle \gamma \approx 0.577216 is the Euler-Mascheroni constant. Euler’s reflection formula says that

\displaystyle \Gamma(z) \Gamma(1-z) = \frac{ \pi }{\sin (\pi z)} if z \not \in \mathbb{Z} .

The duplication formula says that

\displaystyle \Gamma(z) \Gamma( z + 1/2) = \frac{2 \sqrt{\pi}}{2^{2z}} \Gamma(2z) .

Moreover, \overline{\Gamma(z)} = \Gamma(\overline(z)). The integral derivatives are

\displaystyle \Gamma^{(n)}(z) = \int_0^\infty{ x^{z-1} e^{-x} (\ln x)^z dx } .

Fractional Differentiation of a monomial. Let y(x) = x^\mu for \mu \in \mathbb{R}. Then for any real s > 0,

\displaystyle \frac{d^s}{dx^s} = \frac{\Gamma(\mu + 1)}{\Gamma(\\mu - s + 1)} x^{\mu - s} .

This agrees with the usual integral differentiation if s \in \mathbb{N}.

Cauchy’s Repeated Integration. Let f be a Lebesgue-integrable function defined in [a,b]. Define the integration operator

\displaystyle I_a f (x) := \int_a^x{f(t) dt} .

A repeated application of this operator can be expressed as

\displaystyle I_a^{(n)} f (x) = \int_a^x \int_a^{t_n} \cdots \int_a^{t_2}{ f(t_1) dt_1 \cdots dt_n}

\displaystyle = \frac{1}{(n-1)!} \int_a^x{ (x-t)^{n-1} f(t) dt } .

Riemann-Liouville Fractional Integral. The natural extension of Cauchy’s repeated integration’s order from an integer n to a positive real s is the Riemann-Liouville fractional integral:

\displaystyle I_a^{(s)} f(x) = \frac{1}{\Gamma(s)} \int_a^x{f(t) (x-t)^{s-1} dt }

where f is a locally-integrable function and \alpha \in \mathbb{C}, \mathrm{Re}(\alpha) > 0.

It can be showed that for x\in \mathbb{R},

\displaystyle \widehat{I_{-\infty}^s f}(x) = (i x)^{-s} \hat{f}(x) ,

where \hat{\cdot} denotes the Fourier transform.

Riemann-Liouville Fractional Derivative. To differentiate to the order s \in (k-1,k], we cannot directly use -s as the order in Riemann-Liouville integral. Instead, integrate to the order k-s and then differentiate to the integral order k.

\displaystyle D_a^{s} f(x) = D_a^{k} I_a^{k-s} f(x) = \frac{1}{\Gamma(k-s)} \frac{d^k}{dx^k} \int_a^x{f(t) (x-t)^{k-s-1} dt } .

Note that the value of the derivative depends on all values of f(t) for t \in [a,x], whereas the value of the classical derivatives are “local” in that they depend only on x.

It can be showed that for x\in \mathbb{R},

\displaystyle \widehat{D_{-\infty}^s f}(x) = (i x)^s \hat{f}(x) ,

where \hat{\cdot} denotes the Fourier transform.

Fourier Transform. Let e_k(x) = e^{-2 \pi i k x}. Then, for any function f : \mathbb{R} \rightarrow \mathbb{C}, its Fourier transform is a function \hat{f} : \mathbb{R} \rightarrow \mathbb{C} defined as

\displaystyle (\mathcal{F} f)(s)  = \hat{f}(s) = \int_{-\infty}^\infty{  f(x) e_k(x) dx}.

The inverse transform is defined as

\displaystyle (\mathcal{F}^{-1} \hat{f})(x)  = f(x) = \int_{-\infty}^\infty{  \hat{f}(s) e_{k}(-s) ds}.

The transform of a convolution is the product of the transforms:

\displaystyle \mathcal{F}(f  \star g)(s) = \mathcal{F} f (s) \cdot \mathcal{F}g (s) .

The transform of a derivative is a scaling of the transform:

\displaystyle \widehat{f^{(n)}}(s) = (2 \pi i s)^n \hat{f}(s) .

The Plancherel theorem says that the Fourier transform (of a square-integrable function) preserves  the squared-\ell_2 norm of the function:

\displaystyle \int{\vert f(x) \vert^2 dx} = \int{\vert \hat{f}(s) \vert^2 ds} .

This is generalized by Parceval’s theorem for two square-integrable functions f,g

\displaystyle \int{  f(x) \overline{g(x)} dx} = \int{ \hat{f}(s) \overline{\hat{g}(s)} ds} .


Laplace Transform. Very similar to the Fourier transform, except the coefficients are complex and the integration runs only on the non-negative reals. For functions f : \mathbb{R}_+ \rightarrow \mathbb{C},

\displaystyle \hat{f}(s) = \int_0^\infty{ f(t) e^{-st} dt} \, , \qquad s \in \mathbb{C} .

The inverse transform is given by Fourier-Mellin’s inverse formula aka the Bromwich line integral:

\displaystyle f(t) = \frac{1}{2\pi i} \lim_{T \rightarrow \infty} \int_{\gamma - iT}^{\gamma + iT}{ \hat{f} (s) e^{st} ds} .

Laplace transform is linear. Moreover,

\displaystyle \widehat{e^{ct}}(s) = \frac{1}{s-c} .

\displaystyle \widehat{e^{ct} f(t) }(s) = \hat{f}(s-c) .

\displaystyle \widehat{f(t - T) }(s) =e^{-Ts} \hat{f}(s) .

\displaystyle \widehat{ t f(t) }(s) = - \frac{d}{ds} \hat{f}(s) .

\displaystyle \widehat{f^\prime (t) }(s) = s \hat{f}(s) - f(0) .

\displaystyle [\mathcal{L} I_0^t f ](s) =\frac{1}{s} \mathcal{L} f(s),

where \displaystyle I_0^t is the integration operator denoting \displaystyle I_0^t f := \int_{x=0}^t{f(x) dx} and \mathcal{L} is the Laplace transform.

If X is a random variable with density $f$, then for any complex s, \hat{f}(s) = \mathbb{E}\ e^{-sX} .



Tonelli’s Theorem. If a two-variable function f is non-negative, then we can change the order of integration in a double integral, i.e., if f(x,y) \geq 0 then

\displaystyle \int_{X,Y} f(x,y)dxdy = \int_X \bigg( \int_Y f(x,y) dy \bigg) dx = \int_Y \bigg( \int_X f(x,y) dx \bigg) dy .

Fubini’s Theorem has the same conclusion as Tonelli’s theorem, but it stipulates that f has a finite integral over X \times Y.

Differentiation under the Integral. Also called “the Leibniz rule.”

\displaystyle \frac{d}{dx} \int_{a(x)}^{b(x)} f(x,t) dt = f(x, b(x) ) b^\prime(x) - f(x, a(x) ) a^\prime(x) + \int_{a(x)}^{b(x)} \frac{\partial }{\partial x} f(x,t) dt .

If a(x), b(x) are constants, the first two terms are zero.


Polylogarithm Function. For complex order s and argument z,

\displaystyle \mathrm{Li}_s(z) = \sum_{k=1}^\infty{ \frac{z^k}{k^s}} .

It can be expressed as a repeated integral of itself:

\displaystyle \mathrm{Li}_{s+1}(z) = \int_0^z{\frac{\mathrm{Li}_s(t)}{t} dt} .


\displaystyle \mathrm{Li}_{-n}(z) = {1 \over (1-z)^{n+1}} \sum_{k=0}^{n-1} \left\langle {n \atop k} \right\rangle z^{n-k} ,


where n=1,2,3,\ldots and the \left\langle {n \atop k} \right\rangle notation denotes the Eulerian numbers. This expression is strikingly similar to the nth moment of a geometrically distributed random variable X with success parameter 1-\alpha for some \alpha \in (0,1). Then

\displaystyle\mathbb{E}\ X^n = \frac{1}{(1-\alpha)^n} \sum_{k=0}^{n-1} \left\langle {n \atop k} \right\rangle \alpha^{k} .


The Hypergeometric Series. For a complex z, \vert z \vert < 1 and positive integers a, b, c ,

\displaystyle {}_2F_1(a, b; c; z) = \sum_{n=0}^\infty \frac{ (a)_n (b)_n }{(c)_n} \frac{z^n}{n!} ,

where (a)_n = a(a+1)\cdots (a+n-1) is the rising Pochhammer symbol. If either a or b is non-positive, the series reduces to the polynomial

\displaystyle {}_2F_1(-m, b; c; z) = \sum_{n=0}^\infty (-1)^n {m \choose n} \frac{  (b)_n }{(c)_n} z^n ,



Laguerre Polynomials.

\displaystyle L_n(x) = \sum_{k=0}^n{ {n \choose k} \frac{(-1)^k x^k}{k!} } .

They satisfy the differential equation

\displaystyle x L_n^{(2)}(x) + (1-x) L_n^{(1)}(x) + n L_n(x) = 0 .

They are orthogonal with respect to the inner product

\displaystyle \langle L_m, L_n \rangle = \int_0^\infty e^{-x} L_m(x) L_n(x) dx =\delta_m(n).


Edit | Back to top

Interesting Results

The union of two random spanning trees of any graph is an expander. Moreover, it is a linear-size cut sparsifier for random graphs. Goyal-Vempala 2004

It suffices to sample only \Theta(n / \log n) elements from any distribution to test its entropy and support size. See here.

Let G be an (n, d, \lambda) expander. Let F \subset E contain an \alpha fraction of all edges. Consider a random walk that starts at a random edge from F. The probability that it uses an edge from F at the t-th step is at most \alpha + ( \lambda/d)^{t-1}. See here.

For certain self-reducible problems, approximate counting is equivalent to almost uniform sampling. See here.

Edit | Back to top