## Back-propagation of gradient for simplest network, cont'd

Summary of notation:

c number of classes
j,k general class indices
n number of training images
i general image index
hw number of pixels
q general pixel index
*W*^{T} array of weights (c by hw)
X array of flattened images (hw by n)
S array of image class scores (c by n) = *W*^{T}*X*
P array of image class "probabilities" (output of softmax) (c by n)
l array of individual image losses (length n)
L overall (scalar) loss

## Numpy implementation of loss and gradient

Plan:

1.
Write code that computes the loss and also the gradient via back-prop.

- Let's make it as vectorized as possible.
- Let's develop the code with small random data.

2.
Check it for correctness by comparing with finite difference approximation.

3.
Train the network by gradient descent.

A few guidelines:

- invert images in X
- normalize X by its sum, say, to avoid plugging large numbers into exponential function
- start with all weights 0 to avoid prejudice
- a journey begins with a single step - see where a single step takes us
- helpful to picture the rows of $W^T$