The Masakhane natural language processing (NLP) research project, an African machine translation web interface, has won the inaugural 2021 Wikimedia Foundation Research Award of the Year. This system was developed by the Data Science for Social Impact (DSFSI) research group in the University of Pretoria’s Department of Computer Science, in collaboration with researchers from the African Master’s in Machine Intelligence programme of the African Institute for Mathematical Sciences in Ghana.
Despite the fact that 2 000 of the world’s languages are African, African languages are barely represented in technology. This is further exacerbated by the continent’s colonialist past, which has been devastating for African languages in terms of their support, preservation and integration, and has resulted in a technological space that does not understand African names, cultures, places or history.
The idea for a machine learning tool to assist in the translation of 50 of the regional languages on the continent was developed at the #SautiYetu African NLP Unconference 2020. This event is linked to the Deep Learning Indaba, an organisation focused on strengthening African machine learning (ML) and supporting Africans to be owners of technology advances and artificial intelligence (AI). An objective of the Indaba is to create leadership and recognise excellence in the development of ML and AI across Africa.
A two-year participatory research project spanning different countries in Africa was launched in 2019 with the assistance of funding received from the Mozilla Open Source Support (MOSS) Foundation. Dr Vukosi Marivate, the holder of the Absa Chair of Data Science at the University of Pretoria, is one of the chief investigators on this project. Research outputs included the publication of two journal articles, as well as an electronic application similar to Google Translate, but focusing specifically on the African languages for which accommodation is not made in existing machine translation tools. “Some of the regional languages do not even feature on Google Translate,” says Dr Marivate.
The research that formed part of the development of the Masakhane machine translation tool attempted to fundamentally change how the challenge of “low-resourced languages” is approached in Africa. It describes a novel approach to machine translation for African languages, illustrating how the challenges these languages face to join the web can be overcome, and some of the technologies from which other languages benefit today. The Wikimedia Foundation calls this research an inspiring example of work towards knowledge equity, which is one of the two main pillars of the 2030 Wikimedia Movement Strategy.
Collaborating with Dr Marivate on this project are his colleague in the University of Pretoria’s Department of Computer Science, Abiodun Modupe, and Catherine Gitau and Salomon Kabenamualu from the African Master’s in Machine Intelligence programme of the African Institute for Mathematical Sciences in Ghana. Dr Marivate considers this award of the Wikipedia Foundation a great honour, which also shows that we can innovate “even in the ways we conduct research”.
The project also gave rise to the establishment of the Masakhane community, a grassroots community that aims to develop NLP systems for Africa by Africans. Masakhane, which roughly translates as “we build together” in isiZulu, has as its goal for Africans to shape and own technological advances towards human dignity, wellbeing and equity through inclusive community building, open participatory research and multidisciplinarity.
Members of the Masakhane community provide data for the research project and also assist in building models and testing sample translations to improve the accuracy of the machine translation tool. They represent the African countries and languages that form part of the research and comprise individuals from a range of relevant professions, including data scientists, researchers, language practitioners, translators and software developers.
The funding of the MOSS Foundation was also used to build the machine translation tool. The beta version of the tool, which was first launched, made provision for feedback from test users, as well as from members of the Masakhane community. The final tool has since been launched and is available as an open-source resource. It can be accessed at http://translate.masakhane.io.
According to Dr Marivate, several related developments are in the pipeline to support African language translation, including new machine learning models and a speech-to-text translation tool. More information on the activities of the research group can be found at https://dsfsi.github.io.
Source: University of Pretoria