JRC-Acquis Parallel Corpus freely available

tiger

高级会员
JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
available


Readers on this list may be interested in the availability of the
'JRC-Acquis' parallel corpus:

SIZE AND FORMAT

- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.

LANGUAGES

Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.

TEXT TYPES

- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.

PARAGRAPH ALIGNMENT

- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.

MANUAL SUBJECT DOMAIN CLASSIFICATION

- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.

USE / DOWNLOAD

- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.

FOR MORE DETAILS

Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
http://langtech.jrc.it/#Publications.


CONTACT FOR FURTHER INFORMATION

Ralf Steinberger (Ralf.Steinberger@jrc.it)
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology
URL: http://langtech.jrc.it, http://press.jrc.it/NewsExplorer
T.P. 267, Via Fermi 1
21020 Ispra (VA), Italy
Tel: +39 0332 78-6271
Fax: +39 0332 78-5154
 
Back
顶部