Open access platform to save the Odia Indian language

No readers like this yet.
A pile of books in different colors

Opensource.com

In February 2014, the Government of India declared the South Asian language Odia as the 6th classical language of India which is one among 22 scheduled languages of India and has a literary heritage of more than 5,000 years. There are documents for more than 3,500 years, and the rest are undocumented oral histories. The native Odia speakers became hopeful of getting a lot of language related projects implemented to grow the lineage of this long literary heritage and see the language used and spoken globally, not just in literature but in computer and mobile games, interactive computer applications and in other digital media—and to reach the masses as a communicative language.

So far, not many federal initiatives have been put into place, nor a single policy level change has been made, to implement a standard as simple as like Unicode for easy access of information. And, there are very few mobile apps that offer concise and easy to digest content. Overall, there is not much content online that is available in a standard format that is easy to search, access, and reproduce,

Wikisource is here to change that and is working to open up a whole new world of online resources for readers.

With more than 40 million native Odia speakers living in the Indian state of Odisha and its neighboring states and the diaspora in rest of the world—primarily living in countries like the US, UK, UAE, and many of the South and East Asian counties—far less content in the Odia language has been made available on the Internet. The highest is Odia Wikipedia, with 8441 articles created by October 2014. A bigger problem is that though there are a few websites with Unicode content, government portals do not have content in Unicode to make them searchable and reusable. A non-profit Srujanika, with support from two other institutions, has digitized around 740 books under the scope of the project: Open Access to Oriya Books (OAOB), most of which were published between 1850 and 1950. This remains the largest digital archive so far for the Odia language, yet all of the books are scanned PDFs, restricting searchability of the content.

Odia Wikisource is a project that aims for the digitization of rare books that are out of copyright. The project is even allowing authors and publishers to donate their copyrighted work by re-licensing under CC0 or CC BY-SA licenses. The goal is to bring about access to large volumes of books and manuscripts and create more Open Educational Resources (OERs). The single biggest advantage of the Wikisource project at-large is that it makes text for books available in Unicode standard, making it searchable on the web and allows readers to copy and use it elsewhere. Most other conventional archival systems lack this important feature.

Wikisource is run by a volunteers and communities who often retype or prepare the books by Optical Character Recognition (OCR), a technique that converts scanned images of books into text. Participate and contribute to Odia Wikisource by visiting or.wikisource.org, the project is open to all who want to help!

As a Wikimedia project, Odia Wikisource went through a thorough and long approval process for about 1 year and 9 months, as an active incubator project—first by the Language Committee and then by the Wikimedia Foundation's Board. During this incubation phase, the project has digitized three books completely and one partially—thanks to the individual contributors. An educational institution Kalinga Institute of Social Sciences (KISS) in collaboration with the Wikimedia funded Centre for Internet and Society's Access To Knowledge (CIS-A2K) are in the process of digitizing 9 books by the author Dr. Jagannath Mohanty that were re-licensed to CC BY-SA 3.0 earlier this year.

Four new Wikisource contributors joined the project in response to a tweet and a Facebook post by the author to digitize The Odia Bhagabata, classic literature compiled in 14th century. "Content that has already been typed in fonts of various non-Unicode based encoding, now they can be converted by (this) like it was done for The Odia Bhagabata, that was typed and available on the community hosted website Odia.org. New contributors did not face the problem of retyping,” says Manoj Sahukar, who along with the author designed a converter for reading text and transforming into Unicode for The Odia Bhagabata.

Questions for early contributors to Odia Wikisource

Subhashish Panigrahi (SP): You have been with Odia Wikisource since its inception. How you think it will help other Odias?
Mrutyunjaya Kar, a long time Wikimedian who proofreads the books on Odia Wikisource: Odias around the globe will have access to a vast amount of old as well as new books and manuscripts online in the tip of their finger. Knowing more about the long and glorious history of Odisha will become easier.

SP: Do you think any particular section of the society is going to be benefited by this?
Nasim Ali, the oldest active Odia Wikimedian and Wikisource writer: Books contain the gist of all human knowledge. The ease of access and spread of books are the markers of the intellectual status of a society. And in this e-age, Wikisource can be helpful by not just providing easy access to a plethora of books under free licenses but also aiding the spread of basic education in developing economies. Together with Wikisource and cheaper internet this could catalyze a Renaissance of 21st century.

SP: How does it feel to be one of the few contributors to digitize Odia Bhagabata? How do you want to get involved in future?
Nihar Kumar Dalai, a Wikisource writer: This is a proud opportunity for me to be a part of digitization of such old literature. I, at times, think if I could get involved with this full time!

SP: You have digitized almost two books, are the highest contributor to the project and also one of the main reasons for Odia Wikisource getting approved. What are your plans next to grow it and take to masses?
Pankajmala Sarangi, a Wikisource writer: I would be happy to contribute by typing more books on Odia so that they can be stored and available to all. We can take this to masses through social, print and audio & visual media and organizing meetings/discussions.

Somewhere in Mumbai in a moving local train.
Subhashish Panigrahi (@subhapa) is the founder of OpenSpeaks, an award winning project that helps grow open resources to digitally-document marginalized languages. He co-founded O Foundation (OFDN), a nonprofit that works towards addressing issues that lie in the cusp of people, culture, and technology with Openness in its core.

4 Comments

Sorry, you comment "So far, not many federal initiatives have been put into place, nor a single policy level change has been made, to implement a standard as simple as like Unicode for easy access of information. " is plain wrong. In 1991, the Department of Electronics (now part of MCIT) had worked in concert with C-DAC and BIS to standardize ISCII - a forerunner of 16-bit Unicode. There had been significant work that had been done in Indian languages with federal funding. One of the big issues was state-level support. All states were trying to develop their own standards for keyboard entry etc. yes, work done in popularizing the tools was not in the right earnest - it took C-DAC far too long to open its technologies - unfortunately that was a result of some bad decisions taken earlier on.

Today, the biggest problem I feel that users face is the lack of input devices. Though we had the Brahmi keyboard which is very good, but sadly, it is not bundled in the OS and it is difficult to get keyboard stickers also :(

Dear Randompie, I appreciate your interest in this topic and sharing useful information. But let me clarify that I mentioned policy level changes post Classical language declaration for Odia. All of those policy reforms in the 90s are not valid in this particular case. It is a different thing, despite of everything you've mentioned there exist multiple Unicode standards which do not talk to each other. I face the real problem every single day. Anyway, that's not the matter of discussion. Talking about the openness and transparency of the government agency, let me share two facts. A CD containing language tools has lot many blank folders as compared to the PROPRIETERY WINDOWS specific software. A well funded Odia OCR project has never made public in the last 8 years for users to test and give feedback but just has given personal name and fame to the person heading it. But assuming good faith, I would not intend to argue here on this matter that you stated as they are unrelated. Would love to discuss these over email (psubhashishatgmaildotcom). My last pie for @randompie, ISCII is history and lets respect what happened then. But in the age of Unicode, talking about ISCII will be pointless. Thanks and period.

In reply to by randompie

Dear SP,

I do agree with you that the government agency does need to a lot towards transparency and openness.

As per your point about ISCII - well, isn't it a fact that Unicode for Indian scripts is largely ISCII.

Anyway: can you point me to good FOSS OCR for Indian languages.

Thanks

Hi, agree to your point about ISCII and Unicode. I have seen demonstration of a Kannada proprietary OCR myself. As it is not available to download, not open source and the developer clearly stated about not even selling the package, I would refrain from giving details about this. It is against my fundamental motivation behind contributing for open source. The other widely known OCR is Tesseract. it of course needs a lot of training and collaboration. I do not have personal and/or professional time to spend on this at the moment. But will continue to reach out to people. Who knows, there might be someone to take it to some level?

In reply to by randompie

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.