Torch Corpus: Texts of Recent Chinese (2013 summer edition) 1百万词平衡汉语语料库2009年

xujiajin

管理员
Staff member
#1
Torch Corpus: Texts of Recent Chinese (2013 summer edition)

  Torch2009语料库是由全国64所以上高校的115位老师和硕士生及博士生参与语料收集和校对,共同创建的现代汉语语料库。该语料库大小为1,087,619词,1,703,635字(平均一个词大约对应1.57个汉字)。该语料库中所收文本绝大部分为2009年出版。语料库的名称Torch为Texts Of Recent CHinese的缩略词。我们希望这个语料库将来能以类似的模式,每隔几年出一个版本,从而可以考察现代汉语的动态发展。因此,我们希望这个语料库成为一个系列,此次的Torch2009是该系列的第一个语料库。Torch一次的含义也体现出我们希望这个语料库系列可以“薪火相传”,不断延续下去的含义。
  该语料库与此前创建的Crown和CLOB语料库(参看:http://icame.uib.no/ij37/Pages_175-184.pdf )构成英汉可比语料库(comparable corpora),可用于英汉对比研究。
  Crown、CLOB及Torch2009语料库皆可通过BFSU CQPweb语料库平台(http://111.200.194.212/cqp/ )在线检索。
  该语料库的一个突出特点是“共建共享”。它由上百位老师和同学共同完成,并与语料库的创建者、广大语料库研究者免费共享的一个新型语料库。

Download Torch 2009 at http://www.bfsu-corpus.org/channels/corpus

Download Crown and CLOB at http://www.bfsu-corpus.org/channels/corpus

This is a quick description of the 2009 Brown family Chinese corpus, Torch. Torch is the acronym of Texts Of Recent CHinese. The Torch project was initiated under the name of CC2009 meaning Chinese corpus 2009. The new name Torch was proposed by Xu Jiajin and underwent several rounds of email discussions among members of Corpus Research Group at Beijing Foreign Studies University. Most members seemed to agree that it is a memorable and meaningful name, the naming of Torch was not a unanimous vote though.

The corpus contains 671 texts covering 15 text types (Press: Reportage, Press: Editorial, Press: Reviews, Religion, Skill and hobbies, Popular lore, Belles-lettres, Miscellaneous: Government & house organs, Learned, Fiction: General, Fiction: Mystery, Fiction: Science, Fiction: Adventure, Fiction: Romance, and Humour).

The corpus size in tokenised words is 1,087,619 counted by our Chinese word definition (regular expression [\u4e00-\u9fa5a-za-zA-ZA-Z0-90-9\.%%]+). Chinese characters is 1,703,635 counted by our token definition (regular expression [\u4e00-\u9fa5]|[a-zA-Za-zA-Z0-90-9\.%%]+.

Most texts in the corpus were published in 2009.

This edition of Torch corpus was a tokenised/segmented one using ICTCLAS (the YACSI interface). The manual check of the tokenisation shows that the accuracy rate is over 95%.

This edition of Torch is called ‘TORCH 2013 summer edition’, which accepts ICTCLAS tokenised texts on an as-is basis. In other words, the mis-tokenised words were not corrected.

Later, all problematic tokenisations will be corrected by human analysts, thus yielding an updated edition of Torch Corpus. The new edition will be made available through BFSU CQPweb (http://111.200.194.212/cqp/).

You can cite the corpus as:

Xu, Jiajin. 2013. Torch Corpus: Texts of Recent Chinese (2013 summer edition).

Acknowledgments
A special thank goes to the following text collectors and proofreaders, without whom the completion of the Torch2009 corpus could not have been possible.

More than 116 college teachers and graduate students from over 64 universities participated in the Torch2009 text collection and proofreading. Each individual text collector proofread the texts they have collected, and 刘燕 and 吉洁 have thoroughly checked all the texts twice successively.
Texts collected Collector University
72 刘燕 BFSU
35 丁晓阳 BFSU
22 章柏成 BFSU
19 尚延延 SDAU
16 章柏成 BFSU
14 陈真真 BUPT
12 韩菁菁 JUST
11 周琨 JUST
11 吴良平 HNUC
11 乔伟 BFSU
11 连美丽 BNU
11 胡蓉菁 NCHU
10 弭宁 QFNU
10 刘运锋 AHPU
10 李广伟 USC
10 陈功 UIBE
9 谢雪锋
9 邱广民 HEUET
9 陈珺 HZNU
7 李平 BUPT
7 黄艳 WHCSC
6 王冬梅 CUPL
6 田耀收 HNU
6 聂平俊 BUCEA
6 刘运锋 AHPU
6 陈常清
5 邹积铭 BFSU
5 张淑静 HNNU
5 张俭 DLOU
5 袁艳玲 USC
5 余敏 ZKY
5 于海岩 BSU
5 徐萍 CCZU
5 向冰 LZMC
5 吴进善 HNU
5 陶军海 ZJOU
5 唐红英 CZU
5 沈忆宁 NEDU
5 苗昱佶 HUU
5 麻亚东 BFSU
5 刘曦 SCFAI
5 刘伟 TSTC
5 刘磊 BFSU
5 李宏霞 GUET
5 黎雁 LNIT
5 吉丹丹 HUEB
5 付晶晶 NCEPU
5 段红燕 NEDU
5 杜爱玲 HNU
5 丁爱群 SQC
5 陈颖 CZU
5 陈令君 ZZU
5 曹霞 HRB EU
5 吉洁 BFSU
4 朱蕴 BFSU
4 周慧璇 BFSU
4 余敏 ZKY
4 要文静 SXJZ
4 徐明琦 HUST
4 熊淋宵 BUCM
4 韦储学 GUET
4 陶雪城 AHPU
4 齐东武 HNNU
4 门冬梅 DTU
4 路露 BFSU
4 龙宇 BNU
4 李志君 HQU
4 赖惟芝 HQU
4 胡燕 PUMC
4 冯倩倩 SYNU
4 戴丽琴 JXCTMI
4 程爽 JZU
4 陈晦 ZAFU
3 钟峰 CHDU
3 杨威 TUST
3 杨东焕 LUIBE
3 辛鑫 SXNU
3 万丽芳 AHUT
3 孙晓蕊 IMNU
3 舒婧娟 HBMY
3 石志亮 ZYUT
3 毛莹莹 BFSU
3 毛毳 SDUT
3 龙江 BFSU
3 刘座雄 HZAU
3 刘艳丽 TUST
3 刘进军 BHU
3 刘江涛 HDU
3 黄沭云 HYTC
3 和伟 ZHZHU
3 高媛 HEUET
3 陈珺 HZNU
2 周卫京 JUST
2 周榕 GMU
2 于建 SDUT
2 许家金 BFSU
2 吴金华 BNU
2 万丽芳 AHUT
2 孙迪辉 NUDT
2 史润霞 HZAU
2 聂平俊 BUCEA
2 李志君 HQU
2 韩菁菁 JUST
2 段红燕 NEDU
1 徐明琦 HUST
1 邹积铭 BFSU
1 陶雪城 AHPU
1 孙晓蕊 IMNU
1 史润霞 HZAU
1 齐东武 HNNU
1 门冬梅 DTU
1 李广伟 USC
1 赖惟芝 HQU
1 戴丽琴 JXTCMI
1 陈晦 ZAFU
1 要文静 SXJZ
 
Last edited:

xujiajin

管理员
Staff member
#2
回复: Torch Corpus: Texts of Recent Chinese (2013 summer edition) 1百万词平衡汉语语料库2009年

该语料库大小1,087,619词,1,703,635字。平均一个词对应1.57个汉字。这也大约是英文单词与汉字的转换率。
 

chrisyang

普通会员
#4
回复: Torch Corpus: Texts of Recent Chinese (2013 summer edition) 1百万词平衡汉语语料库2009年

感谢许博及北外语料库研究小组的辛勤劳动和无私奉献!谢谢分享!
 

mayerniu

初级会员
#5
回复: Torch Corpus: Texts of Recent Chinese (2013 summer edition) 1百万词平衡汉语语料库2009年

感谢Torch Corpus 研制小组师生们的辛勤劳动和许博士的信息分享! 该语料库的建成是众多语料库爱好者的福音。谢谢!
 
顶部