# Impagliazzo’s Hardcore Lemma: a Proof

Informally speaking, Impagliazzo’s hardcore lemma says that if a boolean function is “hard to compute on average” by small circuits, then there exists a set of inputs on which the same function is “extremely hard to compute on average” by slightly smaller circuits.

In this post, I am going to explain how I understand the proof of the hardcore lemma presented in the Arora-Barak complexity book (here). Because the formal proof can be found in the book I intend to write in an informal way. I think some subtleties are involved in turning the context of the lemma into a suitable two-player zero-sum game. Doing so enables one to use von Neumann’s minimax theorem to effectively “exchange the quantifiers” in the contrapositive statement of the lemma. Although the Arora-Barak proof mentions these subtleties, I am going to explore these in more detail and in a more accessible way for a beginner like me.

Please let me know about any mistakes in this post. I make a lot of mistakes.

This post is organized as follows. First, we informally present the idea of hard-on-average functions. Then we present John von Neumann’s minimax theorem for two-player zero-sum games and mention some of its implications. Next, we mention the hardcore lemma and proceed to prove it by contradiction. The contradictory assumption would lead us to construct a two-player zero-sum game between a circuit player and an input player, both playing with pure (deterministic) strategies. Then we allow the players to use mixed strategies (which are distributions on the pure strategies). This enables us to use von Neumann’s minimax theorem to show that there exists a certain distribution on small circuits. Circuits drawn from this distribution perform well on average on any suitable input distribution. By combining these circuits, we can construct a slightly larger circuit that would exactly compute the supposedly average-case hard function on a large fraction of inputs, contradicting the hardness assumption on the said function.

## Average-Case Hard Functions

Consider the function $f$ that is $\delta$-hard on average: that is, every small circuit $C$ intended to compute $f$ makes an error on at least $\delta$ fraction of inputs. Why small circuits? Because if we are allowed to have arbitrarily large circuits we can compute $f$ exactly.

The hardness quantified in the above manner is called “average-case hardness”.
In contrast, if every small circuit fails to compute $f$ on every input, $f$ is called “worst-case hard”. Keep in mind that both average-case and worst-case hardness is parameterized with an upper bound on the circuit size, which determines the largest size of the circuit being considered.

Why is $f$ hard? Or rather, given a fixed circuit, why does it succeed in computing $f$ on some inputs but fails on the other inputs? How is the “hardness” of $f$ distributed on its inputs? When we are restricted with small circuits to compute $f$, are all inputs equally hard? Or some inputs harder than others? An answer to these questions is given by Impagliazzo’s hardcore lemma.

## von Neumann’s Minimax Theorem on Two-Player Zero-Sum Games

Suppose two players $A$ and $B$ are playing a game. $A$ can pick a move $i \in \{1, 2, 3, \cdots, m\}$ whereas $B$ can pick a move $j \in \{1, 2, 3, \cdots, n\}$. They pick their moves this without knowing the opponent’s choice. The “value” of the game, after $A$ and $B$ has made their respective choices $i$ and $j$, is denoted by $P(i,j)$. $A$ always prefers a high value, so he is the “max player”. On the other hand, $B$ always prefers a low value, and hence he is the “min player”. If $A$ wins a certain value, it means $B$ loses the same value. This can be modeled as $A$ trying to  maximize $P(i,j)$ whereas $B$ trying to minimize $P(i,j)$. If $A$ has chosen move $i$, the minimum gain he can expect is $\min_j P(i,j)$. The worst case gain for $A$ for the entire game is then $v_A=\max_i \min_j P(i,j)$. Similarly, the worst case loss for $B$ is $v_B = \min_j \max_i P(i,j)$. The game is said to be in equilibrium if $v_A = v_B$, which means no player would have any incentive to change his move as long as the other player does not change his move. While it is not obvious whether an equilibrium should exist, von Neumann’s 1928 theorem says that an equilibrium would exist if the two players select their move from any distribution (that is not a Dirac delta function) on the candidate moves instead of always selecting a particular move. That is, suppose $\mathcal{I}$ is a distribution on $1, 2, \cdots, m$ and $\mathcal{J}$ is a distribution on $1, 2, \cdots, n$. Let $E_x[f(x)]$ denote the expected value of $f$ on its inputs, $x$. Then minimax theorem says:

$\max_{\mathcal{I}} \min_{\mathcal{J}} \mathbf{E}_{i\sim \mathcal{I},j\sim\mathcal{J}}[P(i,j)] = \min_{ \mathcal{J}} \max_{\mathcal{I}} \mathbf{E}_{i\sim \mathcal{I},j\sim\mathcal{J}}[P(i,j)].$

Some points to note:

1. The minimax theorem says that both players have a winning move as long as they pick from their own distributions. Here, a winning move is a move that best limits the other player’s amount of success.
2. It also says that the equilibrium value of the game for both players is the same regardless of who plays first.
3. For an equilibrium to exist, it does not matter what the actual distributions are. All that matters is that for each player, there exist multiple moves with positive probability.
4. The minimax theorem is equivalent to the strong duality of linear programming. In fact, both sides of the above equation can be formulated as linear programs, each dual of the other. The proof of equality then directly follows from the fact that the optimum values attained by both the primal and the dual programs are equal.
5. Lacks explanation. The domain of the distributions $\mathcal{I}$ and $\mathcal{J}$ has to be finite. In other words, the players must choose from a finite number of moves. This is because the existence of Nash equilibrium is proved only for finite games, which means both the number of players and the number of moves available to them have to be finite.

## Knowing the Opponent’s Mixed Strategy is No Advantage

Notice that the minimax principle is a worst-case loss-aversion policy. Suppose the first player is the min player. From his perspective, the value of the game would be

$v= \min_{\mathcal{J}} \max_{\mathcal{I}} \mathbf{E}_{i\sim \mathcal{I},j\sim\mathcal{J}}[P(i,j)]$.

This is a minimization problem for the first (min) player over all possible choices of the second (max) player. However, the inner maximization problem is bounded and feasible because of the finite number of choices and the bounded payoff function. Hence the payoff function would achieve its optimum value at some vertex of the convex feasibility polytope whose vertices are all the candidate pure strategies of the second player. Thus we can safely assume that the second player would ultimately choose a pure strategy as opposed to selecting a convex combination of multiple pure strategies. Hence

$v =\max_{\mathcal{I}} \min_{\mathcal{J}} \mathbf{E}_{i\sim \mathcal{I},j\sim\mathcal{J}}[P(i,j)] = \max_{\mathcal{I}} \min_{j \leq j \leq n} \mathbf{E}_{i\sim \mathcal{I}}[P(i,j)].$

Similarly, the equilibrium value of the second (max) player would be

$v = \min_{\mathcal{J}} \max_{\mathcal{I}} \mathbf{E}_{i\sim \mathcal{I},j\sim\mathcal{J}}[P(i,j)] = \min_{\mathcal{J}} \max_{1 \leq i \leq m} \mathbf{E}_{j\sim\mathcal{J}}[P(i,j)].$

Fact. Even if the first player knows the distribution used by the second player, he does not gain any extra information or advantage from it.

## Back to the Hardcore Lemma

Informally, the hardcore lemma says that given an average-case hard function $f$, there is always a “large” set of inputs on which no small circuit can do much better than guessing. In other word, inputs from this set are hard for all small circuits that intend to compute $f$. This is why this set is called the hard core (or hardcore) of $f$.

### Definitions

If a probability distribution $H_\delta$ assigns at most $1/\delta|D|$ probability to any objects in its domain $D$, it is called a $\delta$-density distribution. If a probability distribution is uniform on exactly $\delta$ fraction of the objects in its domain, it is called a $\delta$-flat distribution, or just $\delta$-flat in short. Note that for a fixed $\delta$ and a finite domain, there can be only a finite number of $\delta$-flats.

Suppose we are dealing with functions $f$ that take an $n$ bit input to a single bit output. Suppose there exists a function $S(n)$ such that we mention circuits of size at most $S(n)$ as small circuits. Additionally, we mention circuits of size at most $S(n) / \mathrm{poly}(n)$ as smaller circuits. The idea is that if we combine a polynomial number of smaller circuits, we get a small circuit.

Recall that a function $f$ is $\delta$-hard on average if every small circuit $C$ intended to compute $f$ makes an error on at least $\delta$ fraction of inputs $x\in \{0,1\}^n$.

If a circuit $C$ is able to compute $f$ on the inputs $x$ drawn from a distribution $H_\delta$ with probability  $\geq 1/2 + \epsilon$ for any fixed $0 < \epsilon < 1/2$, we say $C$ performs well on $H_\delta$. Otherwise, we say $C$ performs poorly on $H_\delta$. For the following discussion, let us fix an arbitrary positive real $\epsilon < 1/2$.

## Statement of the Hardcore Lemma

Consider the two statements below.

Statement $A$. $f$ is $\delta$-hard for small circuits, which means every small circuit fails to compute $f$ on at least $\delta$ fraction of inputs.

Statement $B$. There exists a $\delta$-density distribution $H_\delta$ on $n$-bit strings such that all smaller circuits $C$ perform poorly on $H_\delta$.

The hardcore lemma states that $A \implies B$.

See the definitions above for the phrases “performs poorly” and “small / smaller circuits”.

## Proof of the Hardcore Lemma

We are going to prove $A\implies B$ by contradiction. That is, we are going to assume that $A$ is true but $B$ is false, and then proceed to show that $\bar{B} \implies \bar{A}$, a contradiction.

Consider the statement $\bar{B}$, which is the contrapositive of the statement $B$, as follows.

Statement $\bar{B}$For every $\delta$-density distribution $H_\delta$ there exists some smaller circuit $C$ which performs well on $H_\delta$.

Keep in mind that for the sake of contradiction, we are assuming that Statement $\bar{B}$ is true.

## A Finite Two-Player Zero-Sum Game on Pure Strategies

Statements $B$ and $\bar{B}$ smells a lot like a game between an input player and a circuit player. The input player moves first and selects an input distribution $H_\delta$. The circuit player moves second and selects a circuit $C$. If $C$ performs well on $H_\delta$, the  input player pays the circuit player an amount of

$P(H_\delta, C) = \mathbf{Pr}_{x\sim H_\delta} [C(x)=f(x)]$

for a fixed positive $\epsilon < 1/2$. Otherwise, the circuit player pays to the input player the above amount. Therefore, we can think of it as a two-player zero-sum game.

Consider selecting a single $\delta$-density distribution for the input player. This is a pure strategy for him, but there exists uncountably many $\delta$-density distributions on the set of inputs $\{0, 1\}^n$. This seemingly implies that the number of choices is not finite. However, there is an important fact relating every $\delta$-density distribution to the set of all possible $\delta$-flat distributions.

Fact 1. Every $\delta$-density distribution is a convex combination of the $\delta$-flat distributions.

See the section “Omitted Proofs” for a proof.

Thus selecting a particular $H_\delta$ is equivalent to taking a convex combination of a finite number of $\delta$-flats, which is a finite operation.

Now the game is finite since both the number of small circuits and the number of $\delta$-density distributions are bounded. The strategies (moves) are pure because both players select their moves deterministically.

Statement $\bar{B}$ implies that for every $H_\delta$ produced by the input player, there exists a circuit $C^*$ that performs well of $H_\delta$. Suppose the payoff for this particular circuit is

$v^* \geq 1/2 + \epsilon$.

Since the circuit player is maximizing across all suitable circuits, he can guarantee himself a payoff of at least $v^*$.

## When the Input Player Uses a Mixed Strategy

What happens if we allow the input player to have randomized (so-called mixed) strategies? Suppose we allow the input player to select an arbitrary distribution $\mathcal{H}$ over all possible $\delta$-distributions. After that, he can draw a specific $H_\delta \sim \mathcal{H}$. The circuit player still holds the advantage since thanks to the statement $\bar{B}$ he has some well-performing circuit $C$ for every $H_\delta$ that the input player chooses.

The value attained by the circuit player at this stage is

$v = \min_{\mathcal{H}} \max_{C} \mathbf{Pr}_{H_\delta\sim \mathcal{H}, x\sim H_\delta} [C(x)=f(x)] = v^* \geq 1/2 + \epsilon$.

However, there is a problem: a mixed strategy $\mathcal{H}$ is a distribution  on all possible $\delta$-density distributions. We can easily see that the number $\vert \mathcal{H} \vert$ = the number of distinct $H_\delta$ is uncountably infinite. How does the input player choose his mixed strategy from an infinite set of pure strategies? Luckily, the following fact saves the day.

Fact 2. Any distribution on all $\delta$-density distributions is itself a $\delta$-density distribution.

See the section “Omitted Proofs” for a proof.

Thus selecting a $\delta$-density distribution $H_\delta \sim \mathcal{H}$ is the same as selecting an arbitrary $\delta$-density distribution which, according to Fact 1, is equivalent to taking a convex combination of a finite number of $\delta$-flat distributions.

## When Both Players Use Mixed Strategies

Next, suppose we force the circuit player to select an arbitrary distribution $\mathcal{C}$ on smaller circuits. Then he can draw a circuit $C\sim\mathcal{C}$ to play. Now both players are using mixed strategies and hence von Neumann’s minimax theorem applies. It tells us that an equilibrium with mixed strategies exists for both players, and moreover, the value of the game is the same no matter which player plays first.

## Value of the Game when the Circuit Player Plays Second

Recall the fact that knowing the distribution used by the second player yields no advantage to the first player when both players use mixed strategies in a two-player zero-sum game. Since the circuit player is playing second and he is picking the circuit $C$ from certain distribution $\mathcal{C}$, the value of the game would be the same as the value of the game where he used only pure strategies i.e., picking single circuits $C$ instead of drawing $C \sim \mathcal{C}$. In other words,

$v=\min_{ \mathcal{H}} \max_{\mathcal{C}} \mathbf{Pr}_{H_\delta\sim \mathcal{H}, x\sim H_\delta, C\sim \mathcal{C}} [C(x)=f(x)]$

$= \min_{ \mathcal{H}} \max_{C} \mathbf{Pr}_{H_\delta\sim \mathcal{H}, x\sim H_\delta} [C(x)=f(x)] = v^* \geq 1/2 + \epsilon$.

This gives us the following fact.

Fact 3. Assuming statement $\bar{B}$ is true, for every distribution $H_\delta$ put forth by the input player, the circuit player always has some circuit $C$ that performs well on it. In other words, the circuit player always wins a payoff at least $1/2 + \epsilon$ playing second.

We can also see it the following way. When the circuit player picks a distribution instead of a single circuit, he has more power and since we are maximizing, his payoff cannot decrease from what it was when he was restricted to only pure strategies.

## Value of the Game when the Circuit Player Plays First

By von Neumann’s minimax theorem, when both players use the mixed strategies that lead to the equilibrium, the value obtained by the circuit player playing first would be the same as the value he would have obtained playing second. But statement $\bar{B}$ implies that he would always have at least $v^* \geq 1/2 + \epsilon$ amount playing second. Therefore, he would also have the same payoff playing first. Isn’t it magic?

Fact 4. Assuming statement $\bar{B}$ is true, there exists a distribution $\mathcal{C}$ on small circuits for the circuit player such that for every mixed strategy $\mathcal{H}$ used by the input player, the circuit player can select a circuit $C\sim \mathcal{C}$ which always performs well on any distribution $H_\delta \sim \mathcal{H}$.

## Constructing a Small Circuit which Correctly Computes the Hard Function on a Large Fraction of Inputs

Suppose the circuit player has a distribution $\mathcal{C}$ according to Fact 4 above. We can independently sample $t = \mathrm{poly}(n)$ number of small circuits $C_1, C_2, \cdots, C_t$ from $\mathcal{C}$ and construct a silghtly larger circuit $C(x) = \mathrm{majority}(C_1(x), C_2(x), \cdots, C_t(x))$.

If the probability that all circuits $C\sim \mathcal{C}$ are able to compute $f(x)$ with probability at least $1/2 +\epsilon$ for any positive $\epsilon < 1/2$, we say that the input $x$ is good for the distribution $\mathcal{C}$. Otherwise, we say that the input $x$ is bad for the distribution $\mathcal{C}$. Note that being good or bad is a property of an input and the circuit distribution $\mathcal{C}$ and it does not depend on any specific input distribution.

Fact 3. Assuming $\bar{B}$ is true, the total number of bad inputs must be strictly less than $\delta 2^n$.

Proof. Because otherwise for every circuit distribution $\mathcal{C}$ the input player can construct a $\mathcal{H}$ equivalent to a $\delta$-flat distribution $H^*_\delta$ which would be uniform on the bad inputs for $\mathcal{C}$. Then no circuit $C\sim \mathcal{C}$ would be able to perform well on $H^*_\delta$. This would mean that the value attained by the circuit player would be strictly less than $1/2 + \epsilon$, contradicting the value promised by the minimax theorem and the assumption $\bar{B}$. $\square$

Let $x$ be a good input. Then with probability $\geq 1/2 + \epsilon$, each smaller circuit $C_i$ will succeed in computing $f(x)$. Then by a Chernoff bound, the probability that $C(x) \neq f(x)$ on a good $x$ is strictly less than $2^{-O(n)}$ because $t=\mathrm{poly}(n)$. Because there can be at most $2^n$ good $x$, by union bound, the probability that $C(x) \neq f(x)$ for a good $x$ is strictly less than $1$. This means with positive probability there exists a circuit $C$ which correctly computes $f(x)$ on all good $x$. However, because the number of bad $x$ is strictly less than $\delta 2^n$, the fraction of inputs on which the small circuit $C$ succeeds in correctly computing $f$ is strictly larger than $1-\delta$. But this violates the given statement that $f$ is $\delta$-hard on average.

Thus we have reached a contradiction: The statement $\bar{B}$ cannot be true and hence the claim of the hardcore lemma follows. $\square$

## Omitted Proofs

Fact 1. Every $\delta$-density distribution is a convex combination of the $\delta$-flat distributions.

Proof. We will only outline the proof, which is given as a hint on an exercise in the Arora-Barak book. Consider an embedding of the distributions on $\{0,1\}^n$i in a $2^n$ dimensional metric space. Every point must be contained inside the hypercube of side $1$ located at the all-positive orthant. Consider the simplex $C$ formed by all the $\delta$-flat distributions; this simplex is a proper subset of the said hypercube. All points inside the simplex $C$ are convex combinations of the vertices. Thus these points are distributions themselves.

Our claim states that every $\delta$-density distribution must occur inside the simplex including its convex hull.

Suppose not. Then there exists a $\delta$-density distribution $\bar{p}$ occurring outside the simplex $C$. By Farkas’s lemma, every point must occur either inside a convex body, or there exists a hyperplane separating it from the convex body. Let $\bar{r}$ be the normal vector to that separating plane. Then there must exist a nonnegative scalar $t$ such that $\langle \bar{x}, r \rangle \leq t$ for all $\bar{x} \in C$, and $\langle \bar{p}, \bar{r}\rangle > t$.

Consider the inner product $\langle \bar{p}, \bar{r}\rangle = \sum_{1 \leq i \leq 2^n}{p_i r_i}$ . Note tha although $0 \leq p_i \leq \delta$ for all $i$, the $r_i$s could attain any real value. Now let us reorder the terms in the above sum in a decreasing order and scan the terms from large values to small values. Let $j = 1, 2, \cdots, 2^n$ index the terms in this order. If we find a $p_j r_j$ terms such that $p_j < 1/\delta2^n$, we “move” some weight from $p_{j+1}$ to $p_j$ so that the weight in $p_j$ becomes $1/\delta2^n$. Sepcifically, we do the following: $d = 1/\delta2^n - p_j, p_j \gets p_j + d, p_{j+1} \gets p_{j+1}-d$. This is equivalent to moving $\bar{p}$ along the surface of the norm-1 ball centered at the origin. Doing so would ensure that the moved point still remains a distribution. However, its inner product with $r$ will strictly increase because of our descending ordering, meaning the point will still be separated from the simplex $C$. By doing this repeatedly $\bar{p}$ will become a $\delta$-flat distribution because every non-zero coordinate will have weight $1/\delta2^n$ and because we always stayed on the norm-1 ball, the sum of all weights must be 1. Hence we found a $\delta$-flat distribution which is outside the simplex $C$, contradicting the statement that $C$ is the simplex constructed from all the $\delta$-flat distributions to begin with. $\square$

Fact 2. Any distribution on all $\delta$-density distributions is itself a $\delta$-density distribution.

Proof. Let $D_1, D_2, D_3, \cdots$ are all the $\delta$-density distributions on the strings $\{0, 1\}^n$. Consequently we have $D_i(x) \leq \delta$ for all $i = 1, 2, 3, \cdots$. Suppose $\mu$ is a distribution on $D_1, D_2, D_3, \cdots$ such that we have $\mu(D_i) \leq 1, \sum_i{\mu(D_i)}=1$. Let us consider the following sampling strategy for sampling a single string string $x \in \{0, 1\}^n$ as follows. First, we fix an arbitrary distribution $\mu$ on the $D_i$s. Then we select a distribution $D_i \sim \mu$. Finally we select $x\sim D_i$. The probability that a particular $x$ will be drawn from this process is $\sum_i{\mu(D_i)D_i(x)} \leq \delta \sum_i{\mu(D_i)} \leq \delta$. Thus sampling in this way gives rise to a $\delta$-density distribution. Because $\mu$ was an arbitrary distribution, our claim follows. $\square$