Introduction

You must have used ChatGPT for all your work, but how do you make your own ChatGPT? These are what we call large language models (LLMs).

An LLM is a neural network designed to understand, generate, and respond to human- like text. These models are deep neural networks trained on massive amounts of text data, sometimes encompassing large portions of the entire publicly available text on the internet.

We are going to study LLMs in detail in over 4 blogs now. Below, you can see how an LLM is built and we get a working prototype of any LLM you want to build for your own purpose!

In this blog, we are going to cover the data preparation part only! It will be divided in 6 parts :

Word Embeddings
Tokenization
Build a BPE Tokenizer from Scratch
Input-Target for Training
Token Embeddings
Positional Embeddings

Word Embeddings

The model can’t understand any input we give. Whatever form it is in, we must convert it into some vector for the model to understand it. By this process, we do embedding!