2 legal-domain corpora

tiger

高级会员
1. JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
available

SIZE AND FORMAT

- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.

LANGUAGES

Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.

TEXT TYPES

- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.

PARAGRAPH ALIGNMENT

- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.

MANUAL SUBJECT DOMAIN CLASSIFICATION

- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.

USE / DOWNLOAD

- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.


2. HOLJ Corpus built in the framework of the SUM project in
Edinburgh (http://www.ltg.ed.ac.uk/SUM/index.html).
It contains court decisions by the House of Lords, is annotated and can
be downloaded for free.
 
Back
顶部