Research Group for Language Technology


     Chair: Tamás Váradi, Senior Research Fellow

     Secretary: Ágnes Talián

     E-mail: talian.agnes[at]

     Phone: (36-1) 3214-830/191



The Department of Language Technology was created in 1997, as a formal recognition of several years of research and development in the field of language technology. The department has since accumulated significant research experience and has made remarkable achievements, especially in the development of linguistic resources. It has participated in several successful international projects which were aiming, on the one hand, to adopt certain processes developed for western European languages and now considered part of the standard for the analysis of Hungarian (Multext-East, Gramlex) and, on the other hand, to develop new standards of creating linguistic resources (electronic dictionary databases, CONCEDE). The researchers at the department have acquired significant knowledge about computerized language processing systems and technologies developed or applied in these projects, and have played an active role in adapting these to the needs of Hungarian.

An extended version of the Hungarian National Corpus, a reference corpus of present-day Hungarian, which reflects written use and now consists of 187 million words from language variants form Slovakia, Subcarpathia, Transylvania and Vojvodina also, has recently been completed at the department. The use of the processes and programs which have already been applied successfully in the course of the processing of the corpus (i.e., those used for tokenizing or disambiguating on the basis of statistical data), and of the technologies used in international projects for building the lexical database (e.g., SGML/XML editors, validating programs and descriptive grammars) has provided an opportunity for the researchers at the department to test and develop important language processing applications for Hungarian. 

It can be considered a sign of great international recognition of the department as well as of the institute that Budapest was awarded the right to organize the 2003 conference of the European Association of Computational Linguistics (EACL’03).  The department played a central role both in preparing the application (which was evaluated on the basis of very strict criteria) and in organizing the event itself.

Summing up, it can be said that the Department of Corpus Linguistics has accumulated a decade of experience in computational linguistics. As a result of its participation in several international projects in the field of language technology, and its regular and active presence at leading international conferences and workshops from the 1990s onwards, the department has acquired the status of a dominant intellectual base for Hungarian language technology.

Main research topics:

Natural language processing. Computer-based analysis of the mophology and syntax of Hungarian.  Development of language resources, especially:

The Hungarian National Corpus. One of the most important tasks of the department is the development of the corpus, which at the moment contains 187 million words, with morphological analysis and automatic part-of-speech tagging. The corpus is available through the Internet. The corpus includes texts representing five varieties of written language: the language of the press, of fiction, of popular science, as well as official and personal writings.

The development of dictionaries and lexical databases on the basis of a large amount of data reflecting language use.

Current international and national projects:

"Cross-language Access to Catalogues And On-line libraries" (CACAO)
Duration: 2007-2009
Funding: eContentPlus programme of the European Union
* Xerox Research Centre Europe (XRCE), France -- coordinator
* Centre Georges Pompidou, France
* INESC-ID (Instituto de Engenharia de Sistemas e Computadores Investigaçao e Desenvolvimento em Lisboa) Portugal
* The Portuguese National Digital Library, Portugal
* CELI, (Centro per l'Elaborazione del Linguaggio e dell'Informazione) Italy
* Bolzano University Library, Italy
* Freie Universität Bozen / Libera Universita di Bolzano, Italy
* Kornik Library, Poland
* National Szechenyi Library, Hungary
* Research Institute for Linguistics, Hungarian Academy of Sciences, Hungary
* Göttingen University, Germany
* The European Library

CACAO offers an innovative approach for accessing, understanding and navigating multilingual textual content in digital libraries and OPACs, enabling European users to better exploit the available European electronic content. By coupling Natural Language Processing techniques with available information retrieval systems and tools for facilitating the maintenance of multilingual resources we aim at the delivery of a non intrusive infrastructure to be integrated with current OPAC and digital libraries. The result of such an integration will be the possibility for the user to type in queries in his/her own language and retrieve volumes and documents in any available language.



"Comparative Evaluation of the Hungarian and Slovene Wordnet in Machine Translation"  2009-2010  

The project aims to evaluate the Slovene and the Hungarian WordNet in Slovene-English and Hungarian-English Machine Translation. The tool we plan to use is the language-independent tool developed by IXA Group that performs WSD with the help of WordNets ( Besides in the field of WSD we expect improvement of results in cases where no translation equivalent is found for a source word or phrase, as the semantic database may provide a hypernym.  


International and national projects already completed

Construction of the Hungarian WordNet Ontology and its Application in Information Extraction Systems
Project type: Economic Competitiveness Operative Program (GVOP) 2004-05-191 project
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Duration: April 2005 - July 2007
Consortium members:
* University of Szeged, Department of Informatics, HLT Group (coordinator)
* MorphoLogic Ltd. Budapest
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics

Computer application development concerning Hungarian language calls for the development of a Hungarian vocabulary database manageable by automated processes. In computational linguistics, ontology can be defined as the data structure of formally defined concepts and relations, by means of which semantic inferences can be drawn. The so-called language ontologies form an important sub-class of computational ontologies.
The objective of the project was to create a semantically structured, general purpose Hungarian concept set on the basis of the results and formalism of EuroWordNet language ontology. Further it was aimed to supplement the created ontology with a special sub-language already examined by the consortium and a domain-specific ontology including expressions of business language. Finally, we wished to present a potential application of the thus created concept network in the field of information extraction.
The main result of the project is the development of a large, strictly structured natural language concept set (ontology), which helps in finding solutions to several important scientific and technological problems. Regarding scientific achievements, it is important to emphasize that developments concern the semantics of Hungarian language, i.e. of a language, which typologically and morphologically significantly differs from other investigated European languages.
Further scientific and technical objectives of the project included:
(1) research and development of machine learning algorithms to support automatic, heuristic-based ontology building (algorithms help reduce manual work to validation);
(2) research in fields of word sense disambiguation and anaphora resolution;
(3) development of an ontology-based information extraction software prototype for the domain of business news, which is capable of demonstrating the advantages of the application of the concept network.
As the structure of WordNet ontologies is much more complex than that of any simple lexicon or thesaurus, its application potentials are far richer. As a mental encyclopedia of native speakers of Hungarian, a Hungarian WordNet ontology could - to a large extent - assist language teaching in schools. Its standardised interconnection with the other WordNets guarantees its applicability in teaching foreign languages as well. The proper acquisition of the lexical material of the studied foreign language, for example, may significantly contribute to the learner's clear understanding of the differences and similarities of his/her native and the target language. Apart from this, the concept network of WordNet may have a great role in psycho-linguistic experiments concerning Hungarian language.
Beyond purely scientific applicability, electronic-based language technology applications of a Hungarian WordNet may also open new vistas. Search efficiency of different search engines is greatly increased if these tools have reliable access to the semantic environment of the search expression. This may lead to the improvement of future search engines that are capable of satisfying user needs to a greater extent. This may also increase the efficiency of information extraction and machine translation technologies by providing information about the semantic attributes of the analysed text. Automatisms supported by ontologies can handle the context of the information that has to be extracted or translated, therefore, it is likely to produce more reliable results than mere pattern matching or word-by-word translating methods.

Hungarian-English Machine Translation System
Project type: National Research and Development Programme (NKFP) 2/008/2004 project
Duration: 01. January 2005 - 31. May 2007
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Consortium members:
* MorphoLogic Ltd. Budapest (coordinator)
* University of Szeged Department of Informatics, HLT Group
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics

The aim of the project was to implement a Hungarian-English machine translation (MT) system. Three application prototypes can be built upon it: example sentence translator, software supporting the understanding of free text strings, and a form-filler translator. The long-term aim of the project is to enhance the Hungarian language infrastructure and increase the competitiveness of the economic entities. The system helps to fill in official forms in English, to translate business letters into English and to help Hungarian firms in entering the international markets. Interested non-Hungarians can have access to information on Hungarian organisations and events about which no English description is made otherwise. The translation service of the European Commission uses MT systems in translating less sensitive documents (among the languages of the "old" members), thus, a software of this type would also be of great use in the EC.
It is a common practice in EU institutions to use MT for documents requiring fast and cheap, comprehensible but not too stylish translation (like in case of proposals, comments, inter-institution mails), the output of which is then corrected by human translators. Its cost per page is less than one-fifth of that of a human translation. Automated translation system with Hungarian as source language at present does not exist, so we suppose that the English-Hungarian MT system already under development and the Hungarian-English MT system to be developed would be used by EU institutions for quick translation of documents. International institutions other than the EU also use MT, so the system could find its place either as product or as service both in the Hungarian and the international markets.
The project aimed, therefore, at developing a Hungarian-English MT system that places a great emphasis on facilitating the international integration of Hungary. Through this, it increases the competitiveness of the economic entities in the international market, makes EU development resources more available, thus it encourages the innovation activities of small- and medium-sized enterprises as well as state-financed organisations, which entails the visible improvement of the country's R&D potential. The focus areas of the development were:

* translating tender forms and schematised international contact correspondence into English;
* familiarising foreigners with Hungarian enterprises or, to be more precise, facilitating the appearance of Hungarian enterprises in international markets;
* satisfying the requirements of the EU translation organisations, especially the Commission's Translation Service (Service de Traduction).

The quality of MT is obviously far from a human translation but, since its costs are also considerably lower, its application can be justified in certain areas. Documents translated by computers are not intended for publication: their prime objective is to support the understanding of foreign language documents and the reader is left to his/her own intelligence to filter out and understand confusing misinterpretations that are trivial for human wit yet, at the present stage of artificial intelligence research, irresolvable for computers. For this reason the better the reader knows the domain of the text the more useful is the output of the translation software for him/her.
As regards machine translations the targeted enhancement of the translation in a given domain - subsequent to the development of the core system - results in a remarkable translation quality increase in case of deterministic translation systems. It is well worth to assign a few areas in which the translation system is to generate translations in quality above the average. Taking the above priorities into consideration the translation system to be developed will be optimized to the translation of public administration and economic domain.
From technical aspect, the system merges the advantages of direct and transfer translation mechanisms and incorporates corpus linguistic methods as well - thus it is capable of doing pattern-based translations, too. It is based on the concept that the translation process is carried out not in two strictly separated phases of analysis and synthesis but rather in one single phase, practically simultaneously with analysis. The system analyses the texts and constructs the translation during analysis not (only) through abstract rules but on the basis of lexically more or less specified or under-specified patterns.

Examination of National and Ethnic Identity by Means of Computerised Content-analysis of Narratives pertaining to Historic Events
Project type: Ányos Jedlik Programme 6/074/2005 project
Duration: 01. January 2006 - 12. December 2008
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Consortium members:
* University of Pécs, Institute of Psychology (coordinator)
* Hungarian Academy of Sciences, Research Institute of Psychology
* University of Szeged, Department of Informatics, Human Language Technology Group
* MorphoLogic Ltd. Budapest
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics

This research explored the historically changing strategies of identity construction in historical narratives of traumatic events of the Hungarian historical past (Trianon, World War II, Holocaust, 1956) with the help of automated language analysis methods. The analyses explored the processes, in which different qualities of the Hungarian national identity are shaped. They also enabled us to map the trends and psychological conditions of change, and the knowledge of the processes of coping with negative historical events.
Another important element of the research was the analysis of parallel stories, e.g., the history of the Austro-Hungarian Monarchy from Austrian and Hungarian viewpoints in the field of group references, evaluation perspectives, and the comparison of group aims. We identified, on the level of the text, formal and informal groups separated in historical memory, language patterns of group agency, group aims, subjectivity, inter-group relations, struggle and emotional identification, and their connection with agents. To this effect, we prepared a language analyser for the following concrete psychological processes: group agency, group proximity and abduction, emotional evaluation, group struggle, group viewpoint, and change of viewpoint, time continuity and discontinuity.
The project aim was of double nature. On the one hand, we created a content analysis software that is able to analyse content above sentence level. On the other hand, we deepened the present knowledge of Hungarian national identity, modes and components of the identity construction, and the influential factors of changes. Within the latter topic, we intended to compare stable and changing social psychological constructions, analyse the relationship between competing representations, and check the generational hypotheses referring to the representation of traumatic historical events. Our further aim was the description of the strategies for coping with loss, shame, and the sense of guilt, and the examination of the appearance of different perspectives in historical narratives. Distribution of responsibilities, distribution of agency vs. submission, the appearance of endangerment vs. safety, and solitude vs. interdependence, the appearance of the evaluation pattern of acceptability and unacceptability from the viewpoint of the group as well as narrower and wider social environment.

Unified Hungarian Ontology
Project type: National Research and Development Programme (NKFP) 2/042/2004 project
Duration: 01. October 2004 - 31. October 2006
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Consortium members:

* Budapest University of Technology and Economics, Department of Sociology and Communication (coordinator)
* Budapest University of Technology and Economics, Department of Telecommunication and Media Informatics
* MorphoLogic Ltd. Budapest
* Scriptum Informatics Corporation
* Applied Logic Laboratory
* University of Szeged, Department of Informatics, HLT Group
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics

In the public services practice of companies valuable knowledge arises on a daily basis, which is worth keeping track of in a company knowledge-base so that on the following occasion any PR co-worker can utilise it. For the operation of a continuously growing knowledge-base, such ontology-based knowledge management skills are required that can ensure the integration and systematisation of practical, factual information of the knowledge-base. The immediate objective of the project is the intelligent and computational support of such public service activities in the field of telecommunication. A successfully successfully developed ontology infrastructure can be made use of in any other domain provided a domain specific knowledge-base and ontology.
In order to achieve immediate objectives, the project had to carry out developments in a way that their results could be used in a wider circle and for other purposes, as well. Therefore, the indirect objective of the project was the creation of a unified national ontology framework that contains a freely available top ontology and a domain ontology of public telecommunication services. By this means, the consortium wishes to create an open, feely available ontology infrastructure containing an ontology management methodology, ontology handling tools, a practical guide and the necessary cooperative system for the maintenance of the framework.
The term 'ontology' first appeared in the world of data modelling and artificial intelligence, it was only later that it was used in an increasing number of other fields, e.g., cognitive psychology, natural language processing. The current and still growing popularity of the term is due to the international Semantic Web initiative. For some it may seem that this category, which has emerged in the past few decades, is the product of informatics, but ontology has always been a special field of philosophy. Therefore, if we want to thoroughly understand the activities of information technology concerning ontology, it is worth separating philosophical ontologies and the so-called industrial ontologies from each other. No matter what we call them, primarily, the true sense for us in this separation can be a more precise and more unambiguous description of the inner structure and features of applicable ontologies.
Ontology building requires strict methodology, adequate ontology management skills and the establishment of a robust infrastructure. We have to be prepared that, in a short while, the unified ontology framework might have the function to loosely connect different domain ontologies. This will require skills in comparing ontologies and matching them loosely. One possible tool for the comparability of ontologies is connection through top categories when formal logic tools, methodologies are required.

E-vocabulary -- Educational aid for examining contemporary Hungarian literature and its vocabulary from multiple angles
(IHM-ITP-11 /106)
Duration: 2004-2005
In the framework of this project we have morphologically analysed and disambiguated a 33 million word corpus containing texts of contemporary Hungarian literature, forming part of the "Digital Literary Academy". We have developed a related intelligent query interface, tables showing all possible word forms of words, as well as word form-, word stem- and part-of-speech based frequency lists. These developments are what the electronic curriculum in Sulinet STD is based on.

Intelligent multilingual document classification in EUROVOC system (ITEM 2003/000165)
Duration: 2004 January - 2004. December
* MorphoLogic Ltd. Budapest

The project aimed at developing a multilingual system which automatically classifies documents according to their content following the categories of the EUROVOC categorisation system (thesaurus), which is regularly used in the European Union. During the project the Hungarian version of the whole EUROVOC system has been developed, along with the technology with which the automatic content-based classification of texts in primarily Hungrian, English, German and French can be accomplished.

Intelligent electronic dictionary and lexical database (INLEX) 48/ 2002 ITEM projektum
Duration: 2003-2004
The aim of the project was to develop an up-to-date, electronic, machine-readable dictionary and lexical database which satisfies the needs of the information society, and to make it searchable via the internet. The database was created by means of a technology which follows and applies international standards. It was sufficiently explicit and practical, confirming to the needs of computer-based applications, and thus it could provide up-to-date information which flexibly adapts to the needs of language technology, scientific research, education, or those of the general public. The project was essentially based on one technological and one content source. In the course of the CONCEDE (Consortium for Central European Dictionary Encoding) project, in which the Department of Corpus Linguistics also took part, a representation formalism was developed which, being based on international standards but taking into account the specific features of individual languages, including Hungarian, is capable of coding and storing lexical information in a way which satisfies the above requirements. The INLEX project aimd to use this technological basis and to develop it further. The source of the content of the electronic dictionary is the Concise Hungarian Explanatory Dictionary, compiled at the Research Institute for Linguistics, which was also the basis for the CONCEDE project, although the processing of the whole dictionary could not be carried out in the framework of the latter, and only some parts it were used as test data.

Machine learning of syntax rules (application of machine learning methods for the generation of Hungarian syntactic rules)
Project type: Info-communication Technologies and Applications (IKTA) 37/2002 RTD project
Duration: 01. October 2002 - 31. October 2004
Funding: Ministry of Education
Consortium members:

* University of Szeged, Department of Informatics, HLT Group (coordinator)
* MorphoLogic Ltd. Budapest
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics

Parsing, or syntactic analysis of texts plays a key role in natural language processing (NLP). Similarly to many other languages, Hungarian heavily relies on the use and interrelation of suffixes (morphemes) and elementary word structures (syntagmas). The recognition of syntagmas and identification of their relation to each other is essential in NLP systems. Lacking this, semantic analysis of natural language sentences would not be executable. Also, artificial intelligence programs could work much more efficiently by the introduction of a thorough syntactical analysis. Promising fields of application include machine translation, automatic information extraction, and text analysis for scientific or commercial purposes.
Research groups studying the structure of Hungarian sentences have made a great effort to produce a consistent syntax rule system, yet these have not been adaptable to practical, computer related purposes so far. This implies that there is a strong demand for the development of a technology, that would be able to divide a Hungarian sentence into syntactical segments, recognize their structure, and based on this recognition, would assign an annotated tree representation to each sentence. Such, so called treebank representations have already been developed for most West European languages, and some Central and East European languages as well.
In relation to the above, the project's main goal was twofold. On the one hand, we aimed to develope a general purpose syntactic parser for Hungarian, with the support of machine learning algorithms. An inevitable precondition of the technology behind a syntactic parser that has the required efficiency is the existence of a syntactically annotated Hungarian language corpus of suitable size (a treebank), which can serve as learning database for the machine learning system, and also as a basic reference for future similar research. Therefore, another aim of the project was to develop such a treebank.

Information Extraction from Short Business News
Project type: National Research and Development Programme (NKFP) 2/17/2001 project
Duration: 01. July 2001 - 31. July 2003
Funding: Ministry of Education
Consortium members:

* MorphoLogic Ltd. Budapest (coordinator)
* University of Szeged, Department of Informatics, HLT Group
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics

The central aim of the project was to develop a technology which is capable of content-analysis and information-retrieval, with the help of which the relevant information could be obtained in a structured form from texts (from short business news). During the IE process, first textual data (natural language text) had to be parsed for relevant information, then the identified information had to be extracted and stored in a pre-defined structure. It was important that the system disregards irrelevant information, and that the structured data can be easily managed and queried by automated means. To accomplish this goal, participants represented the most typical events of business life by so-called semantic frames. The recognition of semantic frames was supported by shallow syntactic parsing methods. Consortium members applied machine learning algorithms for determining shallow syntactic rules. The learning process was conducted on the Szeged Treebank 1.0 already containing hierarchic noun phrase (NP) annotation and the marking of clause boundaries.
A by-product of the project was an annotated corpus of Hungarian, which serves as a reference for future linguistic and language technological research.

2000-2002: MATCHPAD (Machine Translation for the Czech, Polish and Hungarian Public Administration)

1998-2000: CONCEDE (Consortium for Central European Dictionary Encoding) COPERNICUS project

1997-2002: TELRI (Trans European Language Resources Infrastructure) project

1995-1998: MULTEXT-EAST (Multilingual Text Tools and Corpora) COPERNICUS project