In the present blog post I would like to present a slighlty altered version of the proof of a theorem in (Zhang et al., 2016). The statement is about the *finite-sample expressivity* of neural networks. You can basically take the approach and present it in a arguably more direct and compact fashion, and explicitly writing down the desired network. But let us first take a look at said theorem.

** Theorem** (see Theorem 1 in (Zhang et al., 2016)). There exists a two-layer neural network with ReLU activations and $2n+d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.

That means if we choose $n$ mutually distinct samples $z_1, \ldots, z_n \in \mathbb{R}^d$ and real valued labels $y_1, \ldots, y_n \in \mathbb{R}$ there is a $2$-layer neural network $C$ depending only on $2n+d$ weight parameters such that for $i=1, \ldots , n$ we have

\[C(z_i) = y_i.\]Moreover the network is of the form $\mathbb{R}^d \stackrel{F}{\to} \mathbb{R}^n \stackrel{w}{\to} \mathbb{R} $ with activation function given by \(\alpha(x) = \max \{ x ,0 \}\).

**Remark and Question.** What is somewhat interesting is that $d$ parameters can be chosen generically, i.e. a random choice will almost certainly give the desired result. This is not immediately clear from the presentation in (Zhang et al., 2016). If someone has a comment on this I’d be happy to hear about it!

# Alternative version of the proof

First we are going to reduce the problem to the case where the samples are chosen from $\mathbb{R}$. Choose $a \in \mathbb{R}^d$ such that $x_i := a \cdot z_i$ are mutually distinct, i.e. we have

\[x_i \neq x_j \text{, for $i \neq j$.}\]Relabel the data points and add a value $x_0$ such that we have

\[x_0 < x_1 < \ldots < x_n.\](Note that a generic $a$ will do the job.) We are now in the one-dimensional case, and will proceed by defining a family ${f_i}$ of affine-linear functions, that will only depend on $n$ parameters, which combined with the previous projection, and the rectifier will give the first layer of the desired network. We define $f_i \colon\thinspace \mathbb{R} \to \mathbb{R}$ by

\[f_i(x) := \tfrac{1}{x_i - x_{i-1}} (x - x_{i-1}).\]Note that we have $f_i(x_i) = 1$, and $f_i(x) \leq 0$ for $x \leq x_{i-1}$. We are now ready to define the final layer of the network. Set $w_1 := y_1$ and define iteratively

\[w_j := y_j - \sum_{i=1}^{j-1} f_i(x_j)\cdot w_i.\]One easily computes that

\[y_j = \sum_{i=1}^{n} \max\big\{ f_i(a \cdot z_j) , 0 \big\} \cdot w_i.\]With $F(z) = [ f_1(a \cdot z), \ldots, f_n(a \cdot z) ] $ this can be expressed as \([ (\alpha \circ F)(z_j)]^T \cdot w = y_j\). Note that $F$ is a affine linear function $\mathbb{R}^d \to \mathbb{R}^n$ that depends on $d + n$ weights, namely those coming from $a$ and $x_1,\ldots,x_n$. With the additional $n$ weights defining $w$ we conclude the proof. **QED**

# References

- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization.
*ArXiv Preprint ArXiv:1611.03530*.