made by <https://cneuralnets.netlify.app/>

short blog here. just will discuss how i setup my pipeline to get models better on low resource languages. most of these were from experiences i learnt during random experiments

problem statement

let’s consider nepali, i dub it as a low resource language as it is in devanagiri script, hence models can usually confuse between them. while hindi has complex gendered nouns and foreign influences, nepali has a simpler, generally non-gendered system, different pronunciation, and unique grammatical structures.

some examples below —

image.png

which base model to choose

start from a very good multilingual model which can’t do nepali, but is trained on a considerable amount of devanagiri text.

for example use models like https://huggingface.co/sarvamai/sarvam-translate or https://huggingface.co/ai4bharat/indictrans2-indic-indic-1B

lora or full finetune?

usually you will be constrained on gpu resources (i guess), so please use lora finetuning.

my suggestion would be to take the lora rank should be half of the lora alpha (at least), the logic i use for this is below —

image.png

paper for the above excerpt - https://arxiv.org/pdf/2410.21228v1

where to get the data?

case 1 : internet has digitalized data

go to huggingface and search by language

image.png

case 2 : image data exists but not digitalized