So far, we have seen how a neural network is a series of linear transformations interposed
with non-linear activations.
Here’s simple example, an one layer network we used in part 1, Representing Layers and Connections:
Take a look at the formula for the linear transformations that we defined
in that article:
(mathbf{h} = Wmathbf{x})
Each (h_i) is a dot product of the respective row of (W) and the input.
begin{gather*}
h^{(1)}_1 = w_{11}times{} x_1 + w_{12}times{} x_2\
h^{(1)}_2 = w_{21}times{} x_1 + w_{22}times{} x_2\
h^{(1)}_3 = w_{31}times{} x_1 + w_{32}times{} x_2\
h^{(1)}_4 = w_{41}times{} x_1 + w_{42}times{} x_2\
end{gather*}
Until now, we have set the initial weights and inputs to be in the range ([0, 1]),
as in the following example that you have seen many times by now.
(with-release [x (ge native-float 2 2 [0.3 0.9 0.3 0.9]) y (ge native-float 1 2 [0.50 0.50]) inference (inference-network native-float 2 [(fully-connected 4 tanh) (fully-connected 1 sigmoid)]) inf-layers (layers inference) training (training-network inference x)] (transfer! [0.3 0.1 0.9 0.0 0.6 2.0 3.7 1.0] (weights (inf-layers 0))) (transfer! [0.7 0.2 1.1 2] (bias (inf-layers 0))) (transfer! [0.75 0.15 0.22 0.33] (weights (inf-layers 1))) (transfer! [0.3] (bias (inf-layers 1))) (sgd training y quadratic-cost! 2000 0.05) (transfer (inference x)))
nil#RealGEMatrix[float, mxn:1x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 0.50 0.50 ┗ ┛
As all operands are between 0 and 1, the dot products (h) are likely to be in that range,
or at least not much larger.
If any of the weights (w) or inputs (x) are large numbers, (h) also has a chance to be large.
If the network didn’t have non-linear activations, the inputs to the following layer would
grow uncontrollably, which would propagate further. Some activation functions, such
as ReLU, are linear in the positive domain, so this would propagate to the output.
The sigmoid and hyperbolic tangent activation functions would saturate at (1) at the upper bound
and (0) or (-1) at the lower bound.
Although the saturation will contain the inputs to the next layer to the ([-1,1]) range,
it would make the learning difficult, since the saturated functions would have problems
propagating the gradients backwards.
We are still using a trivial example, which can easily illustrate this problem
(that’s why I’m still keeping it, despite it being silly).
Just change the weights to be numbers larger than one. Even though we are just
chasing one input/output example (((0.3, 0.9) mapsto 0.5)) where there is nothing even remotely
challenging to learn, our algorithm gets stuck in the saturation zone right away.
(with-release [x (ge native-float 2 2 [0.3 0.9 0.3 0.9]) y (ge native-float 1 2 [0.50 0.50]) inference (inference-network native-float 2 [(fully-connected 4 tanh) (fully-connected 1 sigmoid)]) inf-layers (layers inference) training (training-network inference x)] (transfer! [3 1 9 0 6 20 37 10] (weights (inf-layers 0))) (transfer! [7 2 11 2] (bias (inf-layers 0))) (transfer! [75 15 22 33] (weights (inf-layers 1))) (transfer! [3] (bias (inf-layers 1))) (sgd training y quadratic-cost! 2000 0.05) (transfer (inference x)))
nil#RealGEMatrix[float, mxn:1x2, layout:column, offset:0] ▥ ↓ ↓ ┓ → 0.00 0.00 ┗ ┛
It is obvious that we should keep the average absolute value of weights below 1. But, how small should they be?
If weights are too small, the signal will be feeble. Feeble signal might not have problems
in small networks, but when passed through a large number of layers, it would be dampened before
reaching the output. I’d need a larger example to illustrate it, so for this one I’d have to ask
you to trust me.
Let’s say that 0.001 is not too small, and yet not too large.
Why don’t we pick a universally good value and set all weights to it? The problem with that approach
is that all neurons would behave in the same manner. We wouldn’t have a variability in the neurons
that is needed for proper learning.
Although there is not a universal best strategy for setting weighs, a few things are certain enough:
- 1) it should be done automatically
- 2) the values should be small enough to avoid saturation
- 3) the values should be large enough to be able to propagate the signal
- 4) the initial weights should be random