PDA

查看完整版本 : Chinese Gigaword Second Edition


Haiyang
2005-09-26, 07:48 PM
Chinese Gigaword Release Second Edition is a comprehensive archive
of newswire text data in Chinese that has been acquired over several
years by the LDC. This release includes all of the contents in the first
release of the Chinese Gigaword corpus (LDC2003T09), material from
one new source, as well as new materials from the other two sources.
Thus, the corpus contains three distinct international sources of Chinese
newswire - Central News Agency, Taiwan, Xinhua News Agency, and
Zaobao. Some minor updates to the documents from the first release
have been made.

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14

xusun575
2005-09-26, 08:16 PM
Could you please define what Gigaword is?

Haiyang
2005-09-26, 08:45 PM
Chinese Gigaword is a commercial corpus.
Gigaword here just indicate its corpus size, I believe, not quite sure about the original idea though.

xujiajin
2008-11-28, 03:03 AM
An update

Tagged Chinese Gigaword
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T03

Chinese Gigaword Third Edition
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T38

Gigaword family has several members of media language corpora:
Catalog Number Corpus Name
LDC2003T12 Arabic Gigaword
LDC2006T02 Arabic Gigaword Second Edition
LDC2007T40 Arabic Gigaword Third Edition
LDC2003T09 Chinese Gigaword
LDC2005T14 Chinese Gigaword Second Edition
LDC2007T38 Chinese Gigaword Third Edition
LDC2003T05 English Gigaword
LDC2005T12 English Gigaword Second Edition
LDC2007T07 English Gigaword Third Edition
LDC2006T17 French Gigaword First Edition
LDC2006T12 Spanish Gigaword First Edition
LDC2007T03 Tagged Chinese Gigaword

English Gigaword Third Edition
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07