OpenSpeaks/NLP

Natural Language Processing (NLP), Artificial Intelligence (AI) and Machine Learning (ML) are different language technologies that are used for different needs. AI and ML needs creation of large databases and training language models with these databases. Companies and others who often train language models do not create the text or video or audio to build databases. They collect and use such data from the internet without asking the creators or communities. But, these communities themselves don't have access to this data and are asked to pay for services that are developed using it. The communities don't have enough resources to collect such data, which raises ethical concerns.

This resource is for activists from the community to create their own technological infrastructure for studying and creating solutions using AI and machine learning. The work is based on experiences with language communities in India with limited resources. This work results in a large collection of speech data with over 65,000 words. This data is released for anyone to use and is ready for research and development of speech synthesis. The session will showcase this work and provide an opportunity to discuss future projects together.

What
This resource focuses on audio and video data in languages with less resources. Speakers of such languages cannot spend too much time and money to build data and NLP or AI or ML tools with that data. This resource also focuses on languages that can be written with a script (writing system).

The first step is to create a list of unique words written/spoken in a language. The pronunciation of such words are recorded. The recordings help one understand how different words by spoken naturally by people. Different words are pronounced differently in different situations and different people also speak the same word in their way. But most people speak a particular word in a particular situation in a similar manner. On the other hand, a person would say the same word differently when they are asking a question or when they are surprised or in another situation. AI and ML training of data is to identify such similarities and differences. Recording of sentences are required for this purpose. Similarly, a digital/online dictionary has meanings of words. A reader should also be able to hear how each word in that dictionary is spoken. For this purpose, the pronunciation of separate words are needed and they need to be recorded.

There are two useful web tools that can help with this recording process: Lingua Libre for words and phrases and Mozilla Common Voice for sentences.