FeforParCorp

Parallel Corpora for DELPH-IN

  1. Parallel Corpora for DELPH-IN
    1. Collections/Samples of available parallel corpora
      1. Europarl Corpus
      2. OPUS: Technical Documentation (plus Europarl and European Constitution)
      3. The Sofie Treebank
      4. The JRC-Acquis Multilingual Parallel Corpus
      5. Cathedral and the Bazaar
      6. Universal Declaration of Human Rights
      7. Scroogled
    2. Some criteria for choosing a corpus

Collections/Samples of available parallel corpora

Europarl Corpus

OPUS: Technical Documentation (plus Europarl and European Constitution)

- URL: [WWW] http://logos.uio.no/opus/

The Sofie Treebank

- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: [WWW] http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html

This is not redistributable:

Translations in other languages exist (including Japanese), which we may be able to get permission for.

The JRC-Acquis Multilingual Parallel Corpus

Cathedral and the Bazaar

We decided to use this as a corpus, the full description is now up at MatrixMrsCatb.

Universal Declaration of Human Rights

The preamble (a multi paragraph sentence) is impossible, but apart from that it isn't too difficult, and gets some nice universal quantifiers and modals. It is a little short (65 sentences), but there are many other declarations. There are 369 different translations (4 more than last year), most of excellent quality --- the multilinguality is the main selling point. It is freely available. There is a little synergy as it is the de facto standard for testing Unicode fonts --- it should print nicely.

Scroogled

[WWW] http://craphound.com/?p=1902

A short story with many free translations. It is a bit short: about 500 sentences.

Some criteria for choosing a corpus

  1. difficulty -- we need to have some hope of parsing it

  2. size --- to build statistical models it has to be a certain size

  3. quality --- the language should be natural (often a problem for translations)

  4. availability --- we need to be able to share the data

  5. multilinguality --- it would be nice to have exisiting translations

  6. relevance --- the genre should be one you are interested in

  7. synergy --- it is nice to reuse/complement existing markup

  8. diversity --- it can be interesting to experiment with a mixture of corpora, of different text types

last edited 2008-08-05 08:17:18 by FrancisBond