CORPUS OF WRITTEN TATAR

About the project

The Corpus of Written Tatar is a collection of electronic texts in the Tatar language.

The work on the Corpus of Tatar texts was started in 2010. The beginnings of the project were connected with Authors' discussions about two directions of research:

the development of software for machine translation (MT) of Tatar texts into one of its kindred languages, and from this language back into Tatar;
the creation of a system for automatic recognition of Tatar speech within a restricted semantic domain.

By studying the relevant literature we became aware that modern systems of MT and automatic recognition of speech rely on national corpora of the languages in question, applying the “hypothesis — check” method. This fact urged us to commit ourselves to the creation of a similar corpus of the Tatar language.

The Corpus of Written Tatar is mainly based on materials available in the web. The texts originating from different sources have been automatically processed before including them in the Corpus of Tatar language: hmtl-tags have been deleted, sentences in foreign languages have been removed, the encoding of the texts has been converted into utf-8, the sentence borders have been added to the material, etc.

Today the Tatar corpus has a balanced representativeness in relation to the language reality. A majority of the texts included in the corpus of Tatar language pertains to three styles: journalism (≈ 60%), fiction (≈ 35%) and scientific literature in the field of humanities (≈ 5%).

The work on collecting materials and processing them is going on. After having learned about the existence of the Corpus of written Tatar, many writers and scholars have provided us with electronic versions of their books and articles. According to our practice, we update the published version of the Tatar corpus when the word count of newly acquired contributions reaches 5-6 million word occurrences. At the same time, the user interface is updated.

New contributions to the Corpus of Tatar are welcomed with gratitude. If you want to help us, please send electronic versions of your own books, articles and other documents to us for inclusion in the corpus.

The Corpus of Written Tatar can also be regarded as an enormous reference book, giving the user an orderly view into the world of the Tatar language.

The basic purpose of the Corpus of Written Tatar language is to provide assistance in research into the Tatar lexicon. Furthermore, the corpus can be used in language learning, and as a source of models for various types of documents.

The user interface of the Tatar language corpus makes it possible to perform the following operations:

searches for specific words (the results contain frequency data);
to find out what words can occur in front of or after the word (to see the left and right contexts of the word), the results being given with frequencies;
to look for examples of the use of words that are difficult for the learner;
to find out whether a certain word, or a word form, occurs in the language.

The searches described above allow the following tasks to be accomplished:

compiling a frequency dictionary Tatar words;
carrying out research into the probabilistic-statistical modeling of Tatar texts;
studying the restrictions in the combinability of the lexical and syntactic units of the language;
compiling a reverse frequency dictionary of Tatar words, which is necessary for research into the morphological system of the language.

The list of applications of the Corpus of Tatar language given above is, of course, not exhaustive. Electronic corpus materials are also indispensable in the work on automatic recognition of speech as well as machine translation.

In order to protect the copyrights of the authors, texts are stored in the corpus as individual sentences, which means that it is not possible to extract whole texts from the corpus. Each sentence is provided with a link to the literary work in question.

All texts of the Corpus of Written Tatar on this site are only made available for non-commercial scientific or educational use (Article 19 of the Russian Copyright Law). No text on the site can be downloaded and/or read in full.

If you quote text excerpts retrieved from the Corpus of Written Tatar please cite Corpus of Written Tatar as the source, as well as the author of the text in question and the name of the text.

Using the Corpus of Written Tatar is free of charge.

See the list of project members.