One of the aims of the project was to build a linguistically annotated database of written and spoken sources in Udmurt, Tundra Nenets, Synja and Surgut Khanty languages, which makes it possible to research on Uralic–Russian language contacts.

In order to observe syntactic changes of minority languages under influence, we aimed at processing texts collected from different times. The oldest sources included in our project originate from the beginning of the 20th century. In addition, we gather modern data from published and/or electronically accessible sources from the 21st century. We focused on selecting texts provided by as many authors as possible from different social classes, age, gender, dialects and genres. Furthermore, our activities included fieldwork, during which we collected contemporary spoken language material, thus the database represents the written and the spoken versions of the languages as well.

Unfortunately, the Latin-based transcription systems used by Uralists are not standardized and/or unified, even within one language. For this reason, it is important to publish the texts using the International Phonetic Alphabet (IPA), as it makes the texts readable for further areas of linguistics outside Uralistics. Thus, the database contains each text material at least in its original transcription used by the language documenter and in IPA transliteration. Moreover, since the writing system of the languages concerned is based on the Cyrillic alphabet, we preserve the original Cyrillic script, if it is available. In our database, the sentence-level aligned English, Hungarian, German and Russian translation of the original text material is also available.

A part of the corpus contains linguistic annotation even on the morphological level. In these text samples, the lemma, the part-of-speech tag and the English and/or Hungarian glosses are added to each token. We follow the glossing conventions and abbreviations of the Leipzig Glossing Rules with some minor additions.

The morphologically analyzed text version is a tab separated text file with fixed columns, which contains all token-level information. A single hyphen in a cell shows an unavailable piece of information. Sentence boundaries are marked by empty lines. The columns are as follows:

1. Cyrillic token

2. Munkácsi token

3. Wichmann token

4. Steinitz token

5. SzOCh token

6. RME token

7. Hajdú token

8. Mus token

9. IPA token

10. segmented token

11. lemma

12. Hungarian gloss

13. English gloss

14. POS tag

15. RUS/-

Additionally, there is an .eaf file available for each source material which contains all pieces of information both on the token-level and the sentence-level, being aligned to the audio data. The .eaf files can be opened and searched using the ELAN software package.

The adequate rendering of Uralic texts requires the Charis SIL font package. In order to see the Uralic characters in your browser, you must choose View/Text encoding/UTF-8 (the exact menu names vary from browser to browser, the important part is UTF-8).


