A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

Glad it works.

A small tip for using NoteTab Light to work with ACWT:

Delete (or compress and save) all the other clip libraries in the
...\Notetab Light\Libraries directory
that come with the NoteTab Light program. Just keep the ACWT library files in it,
i.e., just keep !TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_Trans.clb, and 05_Links.clb.

This way your ...\Libraries dir will not be cluttered and the desktop will be clean as well
when you run NoteTab Light (and ACWT). If you really want to use the system libraries
you can always put them back in.
请问诸位大虾,为什么检索汉语时词汇词频、标点符号频数和汉字数都有问题。尤其是汉字数与Ms Word统计的相差太多。而Ms Word也有问题,就是它把标点符号数也算作字来统计。所以,我本来想用Ms word统计的字数减去ACWT统计出的标点数,则是纯汉字数。无奈,。。。
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

Can you describe in more detail how your texts look like?
Thank u for ur concern of my problem with ACWT. I find ACWT makes some problems in identifying some, though not all, Chinese characters. To be specific, when i have segmented the chinese text, then subject it to Text Statistics (under the menu, Tools), the problem will occur that some chines characters cannot be recognised as chinese characters at all. My pressing question is: how can i make ACWT identify these so-called unidentifiable chinese characters, or how can i improve ACWT in some way, under the condition that my computer skill is not so satisfying?
Ur timely response will be highly appreciated. Thank u again!
i found that the function of stripping of the tags in ACWT doesn't work well on the stu6.txt file of the CELC. i ran acwt over it for 3 or 4 times only to find acwt cry out 'out of memory' and a series of other bumps. what's wrong with the function? could anyone give me some hint on this? thanks a lot!
A Screenshot to Illustrate My Point
What's wrong with my ACWT? The same error information appears when I try to strip of tags of an English txt.file as well, ie. st6.txt from CELC.

[本贴已被 作者 于 2005年11月01日 01时35分40秒 编辑过]
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

ACWT uses a regular expression to strip off the tags. In order for the error
not to apepar, you need to do either one or both of the following two things:
1) make your text short;
2) increase your system's memory (RAM).

If the problem persists, I suggest that you use the ICTCLAS tokenizer. It has the
option of segmenting your text without putting on POS tags to the plain text.
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

What's wrong with my ACWT? The same error information appears when I try to strip of tags of an English txt.file as well, ie. st6.txt from CELC


What do you mean by an 'English text'? is the English text tagged by NEUCSP?
NEUCSP is a Chinese tagger. The tag stripper is desgined for this specific tagger
because the tag format is very specific.
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

以下是引用 yinghuang2005-10-28 14:10:15 的发言:
Thank u for ur concern of my problem with ACWT. I find ACWT makes some problems in identifying some, though not all, Chinese characters. To be specific, when i have segmented the chinese text, then subject it to Text Statistics (under the menu, Tools),

Sounds like you are using the Tools clip that comes with NoteTab Light. It is not
part of ACWT. All ACWT clip libraries (except !TK_Start) are marked with a number:


You should try the stuff under 02_..or 03_...

(Also see an earlier post I did as a tip for using ACWT: delete all the system clip
libraries so that you don't get confused by them and ACWT files.)

And it's good that you have had your text segmented first.
thanks a lot! i will try to solve it as you suggested. what I mean by 'English Text' is stu subcorpus from CELC. yeterday i attempetd a couple of times to strip of tags from the corpus only to find myself down dnd out. probably the problem has soemthing to do with the light version of notetab since you once said, i remember, ACWT couldn't process a large file with ease.
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

Please read post #132 on page 14 of this thread.
i don't know why the text combination function in acwt doen't work when being asked to process the 911 report downloaded from our site, though i selected only txt. files in the file folder. what's wrong with the files? the function has been pretty efficient in combining files. yet this time it failed my expectation. sigh!
I see! The documents combination function doesn't work well on UNICODE Big Endian. Probably it is the Achilles's Heels of ACWT. Am I right, Mr. 动态语法?
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

You are right. NoteTab Light is not working well with Unicode documents,
and I'm not sure if the Pro version does.
I have the NoteTab pro version. Leave me a message if any of you want to have a try for this Unicode problem.