how do i train translation models on low-resource languages?

made by <https://cneuralnets.netlify.app/>

short blog here. just will discuss how i setup my pipeline to get models better on low resource languages. most of these were from experiences i learnt during random experiments

problem statement

let’s consider nepali, i dub it as a low resource language as it is in devanagiri script, hence models can usually confuse between them. while hindi has complex gendered nouns and foreign influences, nepali has a simpler, generally non-gendered system, different pronunciation, and unique grammatical structures.

some examples below —

which base model to choose

start from a very good multilingual model which can’t do nepali, but is trained on a considerable amount of devanagiri text.

for example use models like https://huggingface.co/sarvamai/sarvam-translate or https://huggingface.co/ai4bharat/indictrans2-indic-indic-1B

lora or full finetune?

usually you will be constrained on gpu resources (i guess), so please use lora finetuning.

my suggestion would be to take the lora rank should be half of the lora alpha (at least), the logic i use for this is below —

paper for the above excerpt - https://arxiv.org/pdf/2410.21228v1

where to get the data?

case 1 : internet has digitalized data

go to huggingface and search by language

problem statement

which base model to choose

lora or full finetune?

where to get the data?

case 1 : internet has digitalized data

case 2 : image data exists but not digitalized