# Finite Sample Expressivity of Neural Networks

— 12 January 2017

In the present blog post I would like to present a slighlty altered version of the proof of a theorem in (Zhang et al., 2016). The statement is about the finite-sample expressivity of neural networks. You can basically take the approach and present it in a arguably more direct and compact fashion, and explicitly writing down the desired network. But let us first take a look at said theorem.

Theorem (see Theorem 1 in (Zhang et al., 2016)). There exists a two-layer neural network with ReLU activations and $2n+d$ weights that can represent any function on a sample of size $n$ in $d$ dimensions.

That means if we choose $n$ mutually distinct samples $z_1, \ldots, z_n \in \mathbb{R}^d$ and real valued labels $y_1, \ldots, y_n \in \mathbb{R}$ there is a $2$-layer neural network $C$ depending only on $2n+d$ weight parameters such that for $i=1, \ldots , n$ we have

$C(z_i) = y_i.$

Moreover the network is of the form $\mathbb{R}^d \stackrel{F}{\to} \mathbb{R}^n \stackrel{w}{\to} \mathbb{R}$ with activation function given by $\alpha(x) = \max \{ x ,0 \}$.

Remark and Question. What is somewhat interesting is that $d$ parameters can be chosen generically, i.e. a random choice will almost certainly give the desired result. This is not immediately clear from the presentation in (Zhang et al., 2016). If someone has a comment on this I’d be happy to hear about it!

# Alternative version of the proof

First we are going to reduce the problem to the case where the samples are chosen from $\mathbb{R}$. Choose $a \in \mathbb{R}^d$ such that $x_i := a \cdot z_i$ are mutually distinct, i.e. we have

$x_i \neq x_j \text{, for i \neq j.}$

Relabel the data points and add a value $x_0$ such that we have

$x_0 < x_1 < \ldots < x_n.$

(Note that a generic $a$ will do the job.) We are now in the one-dimensional case, and will proceed by defining a family ${f_i}$ of affine-linear functions, that will only depend on $n$ parameters, which combined with the previous projection, and the rectifier will give the first layer of the desired network. We define $f_i \colon\thinspace \mathbb{R} \to \mathbb{R}$ by

$f_i(x) := \tfrac{1}{x_i - x_{i-1}} (x - x_{i-1}).$

Note that we have $f_i(x_i) = 1$, and $f_i(x) \leq 0$ for $x \leq x_{i-1}$. We are now ready to define the final layer of the network. Set $w_1 := y_1$ and define iteratively

$w_j := y_j - \sum_{i=1}^{j-1} f_i(x_j)\cdot w_i.$

One easily computes that

$y_j = \sum_{i=1}^{n} \max\big\{ f_i(a \cdot z_j) , 0 \big\} \cdot w_i.$
To derive the definition of the $w_i$ make the ansatz $$y_i = F(z_i) = \sum_j (\alpha \circ f_j)(x_i) \cdot w_j = w_i + \sum_{j=i+1} f_j(x_i) \cdot w_j$$ and solve for $w_i$.

With $F(z) = [ f_1(a \cdot z), \ldots, f_n(a \cdot z) ]$ this can be expressed as $[ (\alpha \circ F)(z_j)]^T \cdot w = y_j$. Note that $F$ is a affine linear function $\mathbb{R}^d \to \mathbb{R}^n$ that depends on $d + n$ weights, namely those coming from $a$ and $x_1,\ldots,x_n$. With the additional $n$ weights defining $w$ we conclude the proof. QED

# References

1. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. ArXiv Preprint ArXiv:1611.03530.