The ltGLOBE Corpus: A balanced collection of contemporary written Lithuanian

xujiajin · 2022-12-08

The ltGLOBE Corpus

INTRODUCTION

The ltGLOBE Corpus is a balanced collection of contemporary Lithuanian written texts, totaling one million words.

The text samples in the corpus were gathered and cleaned up by Yiran Wang, Yitong Zhang, Shuning Zhao, and Yutong Guan at the School of European Languages and Cultures, Beijing Foreign Studies University (BFSU), China.

The online version of the ltGLOBE Corpus is available at BFSU CQPweb Corpus Portal (http://114.251.154.212/cqp/). Both user ID and passcode are ‘test’.

KEY INFORMATION

Project leader: Yiran Wang of the School of European Languages and Cultures, BFSU

Text collectors: Yitong Zhang, Shuning Zhao, Yiran Wang and Yutong Guan at the School of European Languages and Cultures and Mingchen Sun of the National Research Centre for Foreign Language Education, BFSU

Time of compilation: September 2021 – November 2022

Size: Approximately one million words

Language: Contemporary Lithuanian

Number of texts/samples: 500 samples of 2000+ words each (Short texts are pieced together to form one 2000-word text, but saved separately and marked with A, B, C etc. in the filenames.)

Versions of the corpus: Three versions, i.e. raw texts, part-of-speech annotated texts, and lemmatised texts, are available. The texts were POS tagged and lemmatised using the Lithuanian Tagger in spaCy.

Period: The texts were published between 2009 and 2022.

Released in: November 2022

BACKGROUND

On 29 December 2021, Jiajin Xu launched the GLOBE (Global Languages Out of BFSU Expertise) Corpus project, an initiative which aims to collect present-day written texts in all 101 languages that are taught at BFSU. The sampling frame of the Brown Corpus was followed to make the multilingual GLOBE corpus family comparable to the Brown family corpora. The immediate application of the GLOBE is meant to be corpus-based dictionary compilation. The first batch of the corpora covers about 30 languages.

The ltGLOBE Corpus is a sub-project of the BFSU-funded GLOBE Corpus projects (Ref. 2022SYLZD015 and 2022SYLPY004), whose principal investigator is Prof. Jiajin Xu at the National Research Centre for Foreign Language Education, BFSU.

ltGLOBE语料库

介绍

ltGLOBE语料库为当代立陶宛语平衡语料库。该库总容量约为100万词。

ltGLOBE中的语料样本由北京外国语大学欧洲语言文化学院张依桐、赵书凝、关宇彤、王怡然和北京外国语大学中国外语与教育研究中心孙铭辰共同采集、加工完成。

该库可通过“北外CQPweb多语种语料库平台”在线访问：http://114.251.154.212/cqp/。账号、密码皆为test。

关键信息

ltGLOBE语料库负责人：王怡然（北外欧洲语言文化学院）

主要语料文本采集者：张依桐、赵书凝、关宇彤、王怡然（北外欧洲语言文化学院）；孙铭辰（北外中国外语与教育研究中心）

建库周期：2021年9月至2022年11月

库容：约100万词

语言：当代立陶宛语

文本数：500个2000词文本（少于2000词的多个文本会在文件名末尾标注A、B、C等，以标明同属一个2000词的文本。）

语料库版本：ltGLOBE语料库含生语料、词性标注和词形还原三个版本。立陶宛语文本词性标注及词形还原采用了Python语言中spaCy包中的立陶宛语词性赋码模块。

文本原始出版年份：所收集文本均发表/出版于2009-2022年间。

语料库发布时间：2022年11月

背景

2021年12月29日，北外启动了“北外全球语料库集群”项目，又称“GLOBE语料库”项目。GLOBE的英文全称为Corpus of Global Languages Out of BFSU Expertise。该语料库集群旨在建设北外开设的101个语种的当代书面语语料库。

北外全球语料库集群中的单语平衡库借鉴布朗语料库的采样方案，使之与现有布朗家族语料库具有可比性，从而可开展相关外英或外汉对比研究。建设该系列语料库的首要应用目的是开展基于语料库的多语种词典编纂。首批建设的GLOBE家族语料库约为30个语种。

ltGLOBE立陶宛语平衡语料库是北外中国外语与教育研究中心许家金教授主持的北外双一流项目“北外全球语料库集群”（项目编号：2022SYLZD015及2022SYLPY004）的子课题。

ltGLOBE tekstynas

ĮVADAS

„ltGLOBE tekstynas“ yra gerai apgalvotas milijono žodžių šiuolaikinių lietuvių rašytinių tekstų rinkinys.

Pavyzdžius tekstynui parinko ir pritaikė Kinijos Pekino užsienio studijų universiteto (angl. – BFSU) Europos kalbų ir kultūrų fakulteto dėstytoja Yiran Wang bei studentės Yutong Guan, Yitong Zhang ir Shuning Zhao.

Interneto „ltGLOBE tekstyno“ versiją galima rasti „BFSU CQPweb Corpus“ portale (http://114.251.154.212/cqp/). Vartotojo vardas ir slaptažodis yra „testas“.

SVARBIAUSIA INFORMACIJA

Projekto vadovė: Yiran Wang (Pekino užsienio studijų universitetas, Europos kalbų ir kultūrų fakultetas)

Tekstus tekstynui surinko: Yitong Zhang, Shuning Zhao, Yutong Guan, Yiran Wang (Pekino užsienio studijų universitetas, Europos kalbų ir kultūrų fakultetas) ir Mingchen Sun (Pekino užsienio studijų universitetas, Nacionalinis užsienio kalbų švietimo tyrimų centras)

Tekstų rinkimo laikas: 2021 m. rugsėjis–2022 m. lapkritis

Apimtis: Apie milijoną žodžių

Kalba: Dabartinė lietuvių kalba

Tekstų / pavyzdžių skaičius: 500 pavyzdžių po 2000 ir daugiau žodžių. Trumpi tekstyno tekstai sujungiami į vieną 2000 žodžių tekstą. Tačiau visi tekstyną sudarantys tekstai yra išsaugomi atskirai ir bylų pavadinimuose pažymimi A, B, C, ... .

Tekstyno versijos: Prieinamos trys tekstyno versijos: neapdoroto teksto, kalbos dalimis suskirstyto teksto ir pagal lemas sugrupuoto teksto. Tekstai yra pažymėti POS ir lematizuoti naudojant spaCy „Lithuanian Tagger“.

Laikotarpis: Tekstai paskelbti 2009–2022 m.

Išleidimo data: 2022 m. lapkričio mėn.

KONTEKSTAS

2021 m. gruodžio 29 d. Jiajin Xu pradėjo įgyvendinti „GLOBE“ (angl. – Global Languages Out of BFSU Expertise) tekstyno projektą – iniciatyvą, kuria siekiama surinkti dabartinius rašytinius tekstus visomis 101 Pekino užsienio studijų universitete mokomomis užsienio kalbomis. Siekiant daugiakalbius „GLOBE“ tekstynus padaryti panašius į pagal Browno sistemą sudaromus tekstynus, buvo laikomasi šio tipo tekstynų sudarymo sistemos. Esminis „GLOBE“ projekto užmojis yra parengti tekstyno tipo žodynų rinkinį. Pirmoji tekstyno versija apima apie 30 kalbų.

„ltGLOBE tekstynas“ yra Pekino užsienio studijų universiteto finansuojamų „GLOBE Corpus“ projektų (Nr. 2022SYLZD015 ir 2022SYLPY004) paprojektis. Vyriausiasis projekto tyrėjas yra Pekino užsienio studijų universiteto Nacionalinio užsienio kalbų mokymo tyrimų centro profesorius Jiajin Xu.

The ltGLOBE Corpus: A balanced collection of contemporary written Lithuanian

xujiajin

管理员