made by
https://cneuralnets.netlify.app/
These are the questions I practiced to prepare for my interview from time to time.
Some of the questions (4-5 of them) are not yet filled in, and some have been answered by other people as I felt I couldn’t answer them myself due to lacking knowledge, and some have been structured better into tables by Perplexity
Answer
We will first preprocess the words, like remove stopwords, lemmatize the corpus, lower case everything.
Then we find out TF and IDF using these formulae
$$ TF=\frac{number \space of \space times \space word \space in \space doc}{total \space number \space of \space words \space in \space that \space doc} $$
$$ IDF= log(\frac{total \space no \space of \space docs}{no \space of \space docs \space having \space that \space word +1}) $$
TF IDF is nothing but their product.
$$ TF-IDF=TF . IDF $$
Answer
Normalization in TF-IDF is a crucial process that ensures the scores are comparable across documents of varying lengths and distributions.
Longer documents naturally have higher term frequencies, which can skew the TF-IDF scores. Normalization helps to prevent this bias by adjusting the values so that they reflect the importance of terms more accurately across different document lengths