made by https://cneuralnets.netlify.app/

You must have used ChatGPT, but how does it work? How does Google Translate work? They all work under the study of Natural Language Processing. Today, we embark on a long journey which was inspired from a tweet from Saurabh Bhaiya [Link to Tweet - https://x.com/drummatick/status/1853123831793127810]. Let’s dive deep into the study of NLP.

This blog is broken into 4 pieces

Text Preprocessing - Lowercasing, Tokenization, Stopword Removal, Stemming, Lemmatization
Regex - Basics and some examples
Frequencies - BoW, TF, IDF, TF-IDF
Word Embeddings - Word2Vec, GloVe

Text Preprocessing

We will be using the NLTK library for this task, make sure import it earlier on using import nltk

Lowercasing

This step is obvious, we do this to ensure consistency. It also reduces the complexity of the text data we are going to input. We can simply just convert every character in the test data to its lower case form.

text=#your input data
text=text.lower()

Tokenization

Imagine you want to teach your kid to study English, instead of teaching him Shakespeare, you will first teach him alphabets, then words then sentences.

Here your child is the machine, and we break down sentences for it to learn so that the sentence is broken into meaningful tokens, which still carry the original essence of context. This makes pattern recognition easier.

Let’s say we have a sentence I am neuralnets. We can break it down to ["I","am","neuralnets"] . This is word tokenization, which breaks long sentences into individual words.

Now let’s go down one step further. ['I', ' ', 'a', 'm', ' ','n', 'e', 'u', 'r', 'a', 'l', 'n', 'e', 't', 's'] . This kind of tokenization is called character tokenization, mostly used for spelling correction tasks.

We can also break words neuralnets to ["neural","nets"].This kind of tokenization is termed as subword tokenization. This is useful for languages that form meaning by combining smaller tokens.

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
tokens = word_tokenize("I am neuralnets.")
print("Tokens:", tokens)

Text Preprocessing

Lowercasing

Tokenization

Stopword Removal