made by <https://cneuralnets.netlify.app/>
short blog here. just will discuss how i setup my pipeline to get models better on low resource languages. most of these were from experiences i learnt during random experiments
let’s consider nepali, i dub it as a low resource language as it is in devanagiri script, hence models can usually confuse between them. while hindi has complex gendered nouns and foreign influences, nepali has a simpler, generally non-gendered system, different pronunciation, and unique grammatical structures.
some examples below —

start from a very good multilingual model which can’t do nepali, but is trained on a considerable amount of devanagiri text.
for example use models like https://huggingface.co/sarvamai/sarvam-translate or https://huggingface.co/ai4bharat/indictrans2-indic-indic-1B
usually you will be constrained on gpu resources (i guess), so please use lora finetuning.
my suggestion would be to take the lora rank should be half of the lora alpha (at least), the logic i use for this is below —

paper for the above excerpt - https://arxiv.org/pdf/2410.21228v1
go to huggingface and search by language
