top of page
Search

Bringing Sindhi into the AI Era: Our Journey in Speech Technology

Fahad Maqsood Qazi

Updated: 16 hours ago

In August last year, we at Flis Technologies embarked on a journey to make an automated video translation tool that would have been able to translate any video from any language to Sindhi. We thought it would be a simple task but what we learned shocked us. 


It wasn’t the technology that was difficult to build, no! The underlying systems that are basic for any language to grow in AI weren’t even there. Essential models like text-to-speech and speech-to-text had never been developed for Sindhi. This was shocking for us—imagine, 40 million Sindhis in the world, yet no one had built these essential AI systems for their language.


We turned to academia to see if they were developing anything like that, and what we found was that many research papers had been theorizing the techniques to build such systems for decades. It was such a face palm moment because—why theorize your own ways when these systems are already built for other languages? Why not use those existing techniques to finally build them for Sindhi?


Being true to our tagline “Where others see limits, we see possibilities”, we embarked on a journey to accomplish these tasks on our own. The plan was to first build a text-to-speech model (TTS), then a speech-to-text model (STT), then voice cloning, and finally, the entire dubbing system. Then we realized that Sindhi had no available datasets for training any kind of model. This was only mildly shocking because, at this point, we were expecting that nobody had ever done anything for Sindhi in the field of AI. So we started collecting videos from YouTube and transcribing them by hand. We compiled a dataset with several hours of audio-transcript pairs, and then we heard that there is a Google employee, Asad Memon, who had added Sindhi to Mozilla’s Common Voice project and was funding its progress.


We felt relieved because dataset collection was such a tiring task, and we were lucky that the first release of the Common Voice Sindhi dataset was just around the corner. We continued making our own dataset, and when Common Voice released Sindhi, we merged them both for effective training.


Sindhi also didn’t have a tokenizer, so we built our own and started training. We trained one model after another for the same tasks to compare their effectiveness and speed. We ultimately chose a text-to-speech model that was super-fast even on a CPU, and the quality was good enough (remember, we had a limited dataset). Then, we finetuned whisper for Sindhi which now works well if speech is clear enough but the inference speed is a bit slow on CPU. For a few months, we focused on other things, but now we feel it’s time to announce these models to the world.


You can try text to speech here and speech to text here.

 

 
 
 

Comments


Where others see limits, we see possibilities.

Contact Us

Main Wadhu Wah Road,

Qasimabad, Hyderabad,

Sindh, Pakistan

General Inquiries:
admin@flistech.com

Follow Us

Stay updated with the latest news and updates from Flis Technologies.

Quick Links

© 2023 by Flis Technologies (Pvt) LTD. All rights reserved.

bottom of page