made by <https://cneuralnets.netlify.app/>

We all have seen this image from the Attention is All We Need paper. Looks scary, right? Let’s try to understand how this actually works using one single example and try to make this journey as simple as possible!

image.png

Part 1 (Preprocessing)

Dataset

First we must have a dataset, with which we will be working throughout our journey. For example, the dataset used in GPT3 is 570GB! We can’t obviously use that here as an example, so let’s make a short dataset with only 3 sentences

image.png

This will be our Dataset ( some lines from my favorite TV Show ~ The Big Bang Theory. Extra points if you guess who told them :D )

Vocabulary

First let’s break down the dataset into individual words, called as tokens.

dataset = [
    "I'm", "not", "crazy", "My", "mother", "had", "me", "tested",
    "Our", "babies", "will", "be", "smart", "and", "beautiful",
    "I'm", "an", "astronaut", "I", "work", "for", "NASA"
]

We must build our vocabulary now! It’s nothing but the set of unique words in the dataset

$$ vocab=set(dataset) $$

The vocab dataset will look something like

vocab = [
    "I'm", "not", "crazy", "My", "mother", "had", "me", "tested",
    "Our", "babies", "will", "be", "smart", "and", "beautiful",
    "an", "astronaut", "I", "work", "for", "NASA"
]

We can easily find the vocab size by :

$$ vocab\space size=count(set(dataset))\\=21 $$

Encoding

Let’s assign a unique number to each of the word in vocab

This is all the preprocessing of data we will be needing, now we will delve deep into the transformer architecture itself!