Back to Overview

Finite Sample Expressivity of Neural Networks

— 12 January 2017

In the present blog post I would like to present a slighlty altered version of the proof of a theorem in (Zhang et al., 2016). The statement is about the finite-sample expressivity of neural networks. You can basically take the approach and present it in a arguably more direct and compact fashion, and explicitly writing down the desired network. But let us first take a look at said theorem.

Theorem (see Theorem 1 in (Zhang et al., 2016)). There exists a two-layer neural network with ReLU activations and $2n+d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.

That means if we choose $n$ mutually distinct samples $z_1, \ldots, z_n \in \mathbb{R}^d$ and real valued labels $y_1, \ldots, y_n \in \mathbb{R}$ there is a $2$-layer neural network $C$ depending only on $2n+d$ weight parameters such that for $i=1, \ldots , n$ we have

\[C(z_i) = y_i.\]

Moreover the network is of the form $\mathbb{R}^d \stackrel{F}{\to} \mathbb{R}^n \stackrel{w}{\to} \mathbb{R} $ with activation function given by \(\alpha(x) = \max \{ x ,0 \}\).

Remark and Question. What is somewhat interesting is that $d$ parameters can be chosen generically, i.e. a random choice will almost certainly give the desired result. This is not immediately clear from the presentation in (Zhang et al., 2016). If someone has a comment on this I’d be happy to hear about it!

Alternative version of the proof

First we are going to reduce the problem to the case where the samples are chosen from $\mathbb{R}$. Choose $a \in \mathbb{R}^d$ such that $x_i := a \cdot z_i$ are mutually distinct, i.e. we have

\[x_i \neq x_j \text{, for $i \neq j$.}\]

Relabel the data points and add a value $x_0$ such that we have

\[x_0 < x_1 < \ldots < x_n.\]

(Note that a generic $a$ will do the job.) We are now in the one-dimensional case, and will proceed by defining a family ${f_i}$ of affine-linear functions, that will only depend on $n$ parameters, which combined with the previous projection, and the rectifier will give the first layer of the desired network. We define $f_i \colon\thinspace \mathbb{R} \to \mathbb{R}$ by

\[f_i(x) := \tfrac{1}{x_i - x_{i-1}} (x - x_{i-1}).\]

Note that we have $f_i(x_i) = 1$, and $f_i(x) \leq 0$ for $x \leq x_{i-1}$. We are now ready to define the final layer of the network. Set $w_1 := y_1$ and define iteratively

\[w_j := y_j - \sum_{i=1}^{j-1} f_i(x_j)\cdot w_i.\]

One easily computes that

\[y_j = \sum_{i=1}^{n} \max\big\{ f_i(a \cdot z_j) , 0 \big\} \cdot w_i.\]
To derive the definition of the $w_i$ make the ansatz $$ y_i = F(z_i) = \sum_j (\alpha \circ f_j)(x_i) \cdot w_j = w_i + \sum_{j=i+1} f_j(x_i) \cdot w_j $$ and solve for $w_i$.

With $F(z) = [ f_1(a \cdot z), \ldots, f_n(a \cdot z) ] $ this can be expressed as \([ (\alpha \circ F)(z_j)]^T \cdot w = y_j\). Note that $F$ is a affine linear function $\mathbb{R}^d \to \mathbb{R}^n$ that depends on $d + n$ weights, namely those coming from $a$ and $x_1,\ldots,x_n$. With the additional $n$ weights defining $w$ we conclude the proof. QED


  1. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. ArXiv Preprint ArXiv:1611.03530.