EvaluationParCorp

Parallel Corpora for Grammar Evaluation

During our recent DELPH-IN meeting in Berlin, many of our participants have agreed on a joint exercise of creating parallel corpora/treebanks for multiple languages, in order to facilitate cross-lingual grammar evaluation. As a first step, participants may edit this page to link to their collected texts, and provide a basic description of the data (addressing, among other things, the source of the texts, license status, level of cross-lingual parallelism, etc.). You are also welcome to contribute your ideas under the section Individual Reflections. Please bear the following time table in mind.

Language

Participant Group

Description

Data Link

Catalan (ca)

Barcelona

TBA

Chinese (zh)

Saarbrücken

TBA

English (en)

Stanford/Oslo

TBA

French (fr)

Toulouse

TBA

German (de)

Saarbrücken

TBA

Greek, Modern (el)

Saarbrücken/Athens

TBA

Japanese (ja)

Kyoto

TBA

Korean (ko)

Seoul

TBA

Norwegian (no)

Trondheim

TBA

Portuguese (pt)

Lisbon

TBA

Spanish (es)

Barcelona

TBA

Swedish (sv)

Linköping

TBA

There was some discussion of this subject at Fefor in 2006: FeforParCorp. This includes several candidate corpora, and some desiderata for choosing a text.

Individual Reflections

Stephan Oepen

Obviously, establishing a parallel corpus within DELPH-IN would have many advantages, and it would likely pull participants and grammar development more closely together. Each group has their own history, interests, and constraints imposed by how they fund their work; hence, it cannot be the expectation that everyone focus their efforts on the same parallel corpus, but: (a) in some cases grammarians are relatively free in deciding on their target domain, genre, et al., and then it would be great if DELPH-IN could take advantage of such freedom and have multiple efforts work on comparable data; (b) designed as resource grammars, typical DELPH-IN efforts tend to avoid specialization to a single domain or genre, thus even for a project working on its own domain, it may be beneficial to devote some additional effort on a different target text or texts, e.g. ones taken from the DELPH-IN parallel corpus; and, finally, (c) with a growing interest in machine translation among participants, it will be a lot easier to build prototype systems and compare MRSs across grammars, where there has been at least some development effort on a parallel corpus.

I would recommend the following criteria in looking for candidate texts:

At the Fefor and Berlin meetings, various text sources had been proposed. There is at least one open-source advocacy text, [WWW] The Cathedral and the Bazar, freely available in many languages, even though it is not quite clear how to envision an application around this kind of text. Open-source software documentation is another candidate source of parallel text, and being able to process it would have obvious applied value. However, often (computer) manuals are not produced as direct translations, hence there may be limits to parallelism; plus linguistic variation may be somewhat restricted.

Works of art, specifically ones where the copyright has expired could be candidate sources, e.g. Pride and Prejudice or A doll's House, although it is not obvious to me which translations exist (freely), and how relevant variation in national copyright laws would be. Also, a large number of free translations of the bible are available, some probably in contemporary language variants.

My personal favorite are tourism-related texts, e.g. the materials produced for international events (say The Olympics or World Cup) or large cities (Athens, Barcelona, Berlin, Lisboa, Oslo, Paris, San Francisco, Seoul, Kyoto, you name it). These are instructional texts (How to get there?, Where to stay?, How to get around?) that are often prepared and translated with great care (at great expense). The producers want such texts to be widely distributed, so getting permission should be possible. And, over time and around the world, such texts are produced in many source languages.

Francis Bond

I would support work on multiple corpora if we can find them. If we can't find anything better, then let's start with the Cathedral and the Bazaar, it fulfills our most important requirements (multi-linguality, availability) even though I agree it is hard to link it to an application.

I am afraid that finding a corpus containing all the current languages for which we have grammars is extremely difficult. The Opus corpus's approach of looking at open-source manual documentation seems to be the most promising, but our experience with the KDE corpus was that the Japanese translations were not well aligned at all, normally being one or two versions behind the English. Because annotation is so expensive, I think it would be important to look for an application whose documentation translations are up-to-date for most of the languages we want. One candidate is the Emacs tutorial emacs/etc/tutorial, which is about 600 sentences. It is available for most (but not all) of the languages we want (bg, cn, cs, de, es, en, fr, it, ja, ko, nl, pl, pt_BR, ro, ru, sk, sl, sv, th, zh: missing ca, el, no, pt) translating it into more would be a public service. As do most such documents, it has a fair bit of hairy non-text.

I would welcome a freely redistributable corpus of tourism-related texts, but am slightly skeptical of our chances of finding a collection including Catalan, Norwegian and Korean, even without worrying about licenses.

last edited 2007-11-20 20:20:54 by FrancisBond

(The DELPH-IN infrastructure is hosted at the University of Oslo)