# Appendix

## Vanishing Gradients

***Vanishing gradients*** occur when the gradients (partial derivatives of the loss function with respect to each weight) become very small as they propagate backward through a deep neural network during training. 

This is particularly common in networks with many layers, making it difficult for the model to learn.

As gradients move backward from the output to the input layer, they can diminish exponentially, leading to the *vanishing gradients* problem.

<img src='./img/stanford/7-vanishingGradients.png' width='600' height='300'>


üëâ In very deep networks, <mark>repeated multiplication of small gradients across layers causes the gradient to vanish</mark> as it propagates.


For example: $\frac{J^{(4)}}{h^{(1)}} < \frac{J^{(4)}}{h^{(2)}} < \frac{J^{(4)}}{h^{(3)}} < \frac{J^{(4)}}{h^{(3)}}$. And more loops the training goes through, the longest distance gradients (weights far away from the J) tend to be vanishing as they approach 0.

<img src='./img/stanford/7-whyVGaProblem.png' width='600' height='300'>


Because of vanishing gradients:

- **Slow or Stalled Learning:** Layers receive very small updates, slowing down or halting learning altogether.

- **Difficulty in Capturing Long-Range Dependencies:** The network struggles to learn relationships between distant elements, a key challenge in tasks like language modeling.



Too long a sequence. This is a major problem because for a vanilla RNN (the RNN introudced so far, the state is constantly updated in each time step, which makes it impossible or hardly possible for the model to preserve long-distance dependency. In other word, the longer distance a piece of info is, the harder it will be kept in the model. 


<img src='./img/stanford/7-VGEffectonRNN-LM0.png' width='600' height='300'>


## Exploding Gradient

***Exploding gradients*** refer to a problem in the training of deep neural networks where the gradients (partial derivatives of the loss function with respect to each weight) become excessively large. This causes unstable updates to the network's weights, leading to erratic behavior during training.

Similar to the vanishing gradients problem, exploding gradients occur during the backpropagation process, where gradients are propagated from the output layer back to the input layer to update the network's weights.

<mark>Unlike vanishing gradients, which cause gradients to become too small, exploding gradients cause them to become excessively large, often leading to numerical instability.</mark>


<img src='./img/stanford/7-explodingGradient.png' width='600' height='300'>


Because of Exploding Gradient we can face:

- **Numerical Instability:** Extremely large gradients can cause numerical issues, such as overflow, resulting in `NaN` (Not a Number) values in the network's parameters.
- **Diverging Loss:** Instead of minimizing the loss function, the network's loss may increase uncontrollably, preventing the model from learning effectively.
- **Oscillating Weights:** Weight updates can become so large that the network oscillates around the minimum of the loss function without converging.


**To solve the problem of exploding gradient, we can use different techniques**


<img src='./img/stanford/7-GradientClipping.png' width='600' height='300'>

Reference: [‚ÄúDeep Learning‚Äù, Goodfellow, Bengio and Courville, 2016. Chapter 10.11.1. Sequence Modeling: Recurrent and Recursive Nets](https://www.deeplearningbook.org/contents/rnn.html)
  
<img src='./img/stanford/7-GradientClipping2.png' width='600' height='300'>

üëâ Can also employ more sophisticated optimizers, like Adam, Adagrad, RMSprop etc., to overcome the exploding gradients.



### General solutions to vanishing/exploding gradient 

Obviously, vanishing/exploding gradient is a program that is not only relevant for RNN, but for all NN (including feed-forward and convolutional), especially deep ones. **Although, for RNN, these problems are more serious due to the design of RNN (i.e., the repeated multiplication by the same weight matrix)**. 

See: ‚ÄùLearning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf.

**<mark>Causes and solutions</mark>:** 

- Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
- Thus lower layers are learnt very slowly (hard to train)
- **Solution**: lots of new deep feedforward/convolutional architectures that add more direct connections (thus allowing the gradient to flow)

#### <mark>Residual connections</mark>

- **Reference**: [He et al.2015. Deep Residual Learning for Image Recognition.](https://arxiv.org/pdf/1512.03385.pdf)
- This is a very general trick
  
<img src='./img/stanford/7-ResNet.png' width='600' height='300'>


#### <mark>Dense connections</mark>

- **Reference**: ‚ÄùDensely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf
- This is more specific to CNN
  
<img src='./img/stanford/7-DenseNet.png' width='600' height='300'>


#### <mark>Highway connections</mark>

- **Reference**: ‚ÄùHighway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf
- Highway connections aka ‚ÄúHighwayNet‚Äù
- Similar to residual connections, but the identity connection vs the transformation layer is controlled by a dynamic gate
- Inspired by LSTMs, but applied to deep feedforward/convolutional networks

## Long Short-Term Memory (LSTM)

**Reference**: [Hochreiter and Schmidhuber, 1997. ‚ÄúLong short-term memory‚Äù.](https://www.bioinf.jku.at/publications/older/2604.pdf)

<img src='./img/stanford/7-LSTMDesc.png' width='600' height='300'>


üëâ <mark>Forget gate is similar to the idea of Dropout in Deep Neural Network, an intuitive trick to reduce the risk of Vanishing Gradient.</mark>
</br>
</br>

<img src='./img/stanford/7-LSTMDesc2.png'>


<img src='./img/stanford/7-LSTMDiag.png' width='600' height='300'>


## Gated Recurrent Units (GRU)

**Reference**: "Learning Phrase Representations using RNN Encoder‚ÄìDecoder for Statistical Machine Translation", Cho et al. 2014, https://arxiv.org/pdf/1406.1078v3.pdf

<img src='./img/stanford/7-GRUDesc.png' width='600' height='300'>


### LSTM vs GRU

üëâ Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used

üëâ The biggest difference is that **GRU is quicker** to compute and has fewer parameters

üëâ There is **no conclusive evidence** that one consistently performs better than the other

üëâ **LSTM is a good default choice** (especially if your data has particularly long dependencies, or you have lots of training data)

üëâ **Rule of thumb**: start with LSTM, but switch to GRU if you want something more efficient

## Bidirectional RNNs

**Contextual representation**

üëâ <mark>Look for both directions</mark>

<img src='./img/stanford/7-BiRNN.png' width='600' height='300'>


### Structure 
<img src='./img/stanford/7-BiRNN2.png' width='600' height='300'>
<img src='./img/stanford/7-BiRNN3.png' width='600' height='300'>
<img src='./img/stanford/7-BiRNN4.png' width='600' height='300'>


### Restrictions

<img src='./img/stanford/7-BiRNNRestriction.png' width='600' height='300'>



## Multi-layer RNNs (Stacked)

üëâ  RNNs are already ‚Äúdeep‚Äù on one dimension (they unroll over many timesteps)

üëâ  We can also make them ‚Äúdeep‚Äù in another dimension by applying multiple RNNs ‚Äì this is a multi-layer RNN.

üëâ  This allows the network to compute more complex representations

üëâ  The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features.

üëâ  Multi-layer RNNs are also called stacked RNNs.

üëâ  This can be bidirectional provided that the entire input sentence is accessible.
  
<img src='./img/stanford/7-MultiLRNN.png' width='600' height='300'>


### In practice 

**Reference**: ‚ÄúMassive Exploration of Neural Machine Translation Architecutres‚Äù, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf

üëâ <mark>Skips are usually heavily used. </mark>

<img src='./img/stanford/7-MultiLRNNInPractice.png' width='600' height='300'>

## T5 Models

__Text-to-Text Transfer Transformer (T5)__ is a large language model (LLM). A lot of people are working in this field and it's difficult to see who's methods are doing better if so many variables are changing. This paper was on seeing how far they could take the current tools available. 

In the paper they built their own dataset (c4, available on TensorFlow) to test how pretraining helps. 

![image.png](./img/t5_pipeline.PNG) <br>
_Figure 1. T5 pipeline._