made by

Please read the first part before getting into this blog LLMs - 1 [Week 8]

All About Attention

We had covered the data preprocessing part earlier on, now we must learn a very essential part of a LLM - Attention. Let’s get a deep dive through history how we came across attention and why did we even need it in the first place!

This blog will be broken into 8 parts :


Why did we need Attention

The latest technology before attention was Seq2Seq. We need to figure out how did it fail?

image.png

It has two RNNs - Encoder and Decoder. The encoder processes the input and generates a context vector c, which includes all the essential information from the input and also generates an initial decoder state, $s_0$. The decoder uses this context vector to produce the output sequence step by step, relying on it’s previous outputs and content vector at each step. [We will not go into too much detail as it will take up a lot of blog time] This kind of solution is good, if we are working with smaller samples, but if we have bigger samples, then the context vector is not able to hold all meaningful information from our encoder part. This is called the bottleneck problem, which is basically the loss of information in the fixed-size context vector!

A sentence like Mohan is a boy, what gender is Mohan? will be all good to pass through the Seq2Seq, but something like

Mohan is a boy. He loves playing football with his friends in the park every evening. His favorite subject in school is mathematics, and he enjoys solving challenging problems. Mohan also helps his younger sister with her homework. On weekends, he visits his grandparents and listens to their stories. What gender is Mohan?