Home | Archive | About | Other

Principal of Neural Network

Inspiration

From human brains of neural nets
brain_neurons

The binary property in the AI world

Sigmoid Function

\[h_\theta(X)=g(\theta^TX)=\frac{1}{1+e^{-\theta^TX}}=\frac{1}{1+e^{-z}}=P(y=1|x)\]


where
\(z=\theta^TX\) sigmoid

Cost Function

MLE (maximum likelihood estimation)

According to likelihood:

obviously (Bernoulli Distribution)

log both sides:

total cost is to maximize


or to minimize

Gradient Descent

if negative log-likelihood is: (just a different form from above $J(\theta)$)

gradient descent process is

gradient is

where

then $\forall\mathbf{w}_j \in [1, 2, …, m]$:

\(\frac{\partial NLL(D, W)}{\partial{W}_j}= -\frac{\partial\sum_{i=1}^{n}[(1-y_i)log(1-\sigma(W^TX_i)) + y_ilog\sigma(W^TX_i)]}{\partial{W}_j}\)
\(\frac{\partial NLL(D, \mathbf{W})}{\partial\mathbf{W}_j}= -\sum_{i=1}^{n}[(y_i)\frac{1}{\sigma(\mathbf{W}^TX_i)}\frac{\partial\sigma(\mathbf{W}^TX_i)}{\partial\mathbf{W_j}} - (1-y_i)\frac{1}{1-\sigma(\mathbf{W}^TX_i)}\frac{\partial\sigma(\mathbf{W}^TX_i)}{\partial\mathbf{W_j}}]\)
\(\frac{\partial NLL(D, \mathbf{W})}{\partial\mathbf{W}_j}= -\sum_{i=1}^{n}[(y_i)\frac{1}{\sigma(\mathbf{W}^TX_i)} - (1-y_i)\frac{1}{1-\sigma(\mathbf{W}^TX_i)}]\frac{\partial\sigma(\mathbf{W}^TX_i)}{\partial\mathbf{W_j}}\)
\(\frac{\partial NLL(D, \mathbf{W})}{\partial\mathbf{W}_j}= -\sum_{i=1}^{n}[(y_i)\frac{1}{\sigma(\mathbf{W}^TX_i)} - (1-y_i)\frac{1}{1-\sigma(\mathbf{W}^TX_i)}]\sigma(\mathbf{W}^TX_i)(1-\sigma(\mathbf{W}^TX_i))\frac{\partial\mathbf{W}^TX}{\partial\mathbf{W_j}}\)
\(\frac{\partial NLL(D, \mathbf{W})}{\partial\mathbf{W}_j}= -\sum_{i=1}^{n}[(y_i)(1-\sigma(\mathbf{W}^TX_i)) - (1-y_i)(\sigma(\mathbf{W}^TX_i))]X_i^j\)
\(\frac{\partial NLL(D, \mathbf{W})}{\partial\mathbf{W}_j}= -\sum_{i=1}^{n}[y_i - \sigma(\mathbf{W}^TX_i)]X_i^j\)
\(\frac{\partial NLL(D, \mathbf{W})}{\partial\mathbf{W}_j}= \sum_{i=1}^{n}[\sigma(\mathbf{W}^TX_i) - y_i]X_i^j\) \

where

Stochastic Gradient Descent (SGD)

update $\mathbf{w_t}$ based on a single pair of $X, y$ , SGD is:

with L2 Norm regularization ($J= NLL + \mu\Vert{\mathbf{W}}\Vert_2^2$), SGD is:

Neural Network Basic Form

A basic structure of neural network is 1 input layer, certain number of hidden layers, and 1 output layer. For each arrow in the below illustration example, it’s an activation function, for example, one popular activation function is Logistic Regression; each activation neuron in one layer is an output from from each activation function with last layer’s activation neurons.
simple_nn

Neural Network Forward Propagation

In the above example, each layer of activation neurons (input can be viewed as the $0_{th}$ layer of activation neuron) is an input to each logistic regression to output next layer of neurons, until the final output of binary or multiclass.

Formally, the forward propagation does the following:


where

Cost Function

We know each activation function in the network is a single Logistic Regression who input from last layer neurons, and output to next layer neurons.

One form of cost function, for a multiclass neural network, total cost including L2 Norm is:


where

Another form of cost function without regularization, is as below:


Same notations.

Backward Propagation

Notation

neural_network2.png

Given above 4-layered NN structure, following previous notation customs, we make following notations:

Gradient Computation

‘Error’ term

Intuition: $\delta_j^{(l)}$ is the ‘error’ of node $j$ in layer $l$.

For each output unit ($.*$ is element wise vector multiplication, here are all vector form):

Gradient Derivation 1

  1. \[\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta)=a_j^{(l)}\delta_i^{(l+1)}\]

derivation (ignore regularization):

$\frac{\partial{L}}{\partial{\Theta_{ij}^{(l)}}}=\frac{\partial{L}}{\partial{z_{i}^{(l)}}}\frac{\partial{z_{i}^{(l)}}}{\partial{\Theta_{ij}^{(l)}}}=\delta_i^{(l)}a_j^{(l-1)}$

  1. \[\delta_{j}^{(l)}=\frac{\partial}{\partial{z_{j}^{(l)}}}cost(i)=[(\delta^{(l+1)})^T\Theta_j^{(l)}]g'(z_j^l)\]

derivation:

Be careful with the sum here because of the sum of error forms in Cost Function

Thus vectorwise we have:

Backward propagation computation process

Set $\Delta_{ij}^{(l)}=0$ as error vectors for all layers For $i=1$ to $m$:

Finally:

BP Intuition

nn_bp

Similar intuition from FP, $\delta_{j}^{(l)}$ is the ‘error’ of cost for $a_{j}^{(l)}$; it’s weighted sum from this layer’s corresponding $\theta s$ and latter layer’s ‘errors’ multiplied by first derivative of activation function. The gradient for each $\theta$ is the product of latter layer’s corresponding error, and this layer’s corresponding activation unit value.

The key idea of gradient descent to optimize $\Theta$ in neural network for $m$ training samples, is to understand error term and gradient in backward propagation

Random Initialization

Since if $\Theta=0$, gradient update will also be the same, $\Theta$ needs to initialize $\Theta$ with random small values.

  1. Michael Nielsen, Neural Networks and Deep Learning, 2.4, 2.5