[求助]请问哪里能下载免费使用的拼音-文字语料库

我希望能找到一个可以免费下载使用的语料库用于研究,内容是汉字对应的拼音,形如"pin1 拼"的就可以了。请各位热心人指点帮忙,谢谢
 
回复: [求助]请问哪里能下载免费使用的拼音-文字语料库

The one-million-word Lancaster Corpus of Mandarin Chinese (LCMC) has a Pinyin version and a character version. This corpus is freely available for academic and education use from the European Association and Language Resources. Just just for the ELRA (ELDA catalog No. W0037). It is also available from the Oxford Text Archive (OTA).
 
回复: [求助]请问哪里能下载免费使用的拼音-文字语料库

Hi Dr. Xiao,
Interestingly, the ELRA only have CDrom version, and what's more fun, the OTA
zipped file is empty! (0 bytes), have informed OTA about this, hope they can fix it.
I was also interested in this corpus, recently a paper I read mentioned that there are mistakes (pinyin errors in polyphones) in the LCMC corpus, do you intend to release a updated version?

The one-million-word Lancaster Corpus of Mandarin Chinese (LCMC) has a Pinyin version and a character version. This corpus is freely available for academic and education use from the European Association and Language Resources. Just just for the ELRA (ELDA catalog No. W0037). It is also available from the Oxford Text Archive (OTA).
 
回复: [求助]请问哪里能下载免费使用的拼音-文字语料库

LCMC is not designed for that purpose. A Pinyin version was published simply to help many non-native Chinese speakers who know Chinese by romanisation. Chinese-Pinyin conversion was done automatically, which means the romanisation is done on a character-by-character basis without taking account of Tone Sandhi etc. New editions of LCMC are only enriched with linguistic analyses and corrected tagging errors, but not addressed the Pinyin issue.
 
回复: [求助]请问哪里能下载免费使用的拼音-文字语料库

That make sense :)

so do you know any corpus for Chinese / pinyin alignment?
sentence level would be perfect.
 
Back
顶部