PDA

查看完整版本 : 谭松波:中文文本分类语料库


动态语法
2005-07-17, 04:03 AM
语料库里的中文文本用什么打开?文本的选择似乎是根据内容来的,
而非一般所采用的语体分类。

http://lcc.software.ict.ac.cn/~tansongbo/corpus1.php




[本贴已被 作者 于 2005年07月17日 04时05分49秒 编辑过]

xujiajin
2005-07-17, 07:40 PM
好像是dat格式是吧,用notepad打开即可。

动态语法
2005-07-18, 03:03 AM
Used notepad and Access to open the files, still didn't see anything other than a bunch of text names and numbers.

xiaoz
2005-08-26, 07:51 AM
You will not be able to open those very large files with an MAT file extension using ACCESS (shouldn't such extensions be reserved?). They are POORLY named with this extension as MAT files - ACCESS Table shortcut file type - but in fact they are not. You are not even allowed to open them with NotePad or WordPad becuase they are so named.

Anyway I have opened all of the files downloaded at that site. They are not the files for the corpus itself, but a wordlist, files indicating text categories, and files for matrix data. The only thing useful is the wordlist, which you can get here:

http://www.corpus4u.org/upload/forum/2005082607495230.rar

After downloading this wordlist, you can now remove that rubbish taking up your disk space.

tiger
2005-08-26, 10:47 AM
useless

tansongbo
2008-04-10, 01:07 AM
原始语料可以申请后下载http://www.searchforum.org.cn/tansongbo/corpus.htm