“CLEC中国英语学习者语料库”分论坛开坛

xujiajin

管理员
Staff member
#1
CLEC主页
http://www.clal.org.cn/baseinfo/achievement/Achievement1.htm
CLEC在线检索
http://www.clal.org.cn/corpus
http://www.clal.org.cn/corpus/EngSearchEngine.aspx

We set up a sub-forum under Learner Corpora to invite more information and discussion on CLEC. You are very welcome to post CLEC-related research information and problems on this forum.
 
#3
我这里也祝贺分论坛开张!这里提一个很幼稚的问题!书里面有很多的单词的拼写错误形式,我们应该怎么充分利用这些形式,见笑我一直在想,但没有师呢没好主意?是否应该将其中的单词建立一个单词表?请前辈指点迷津!谢谢!
 

xianx

初级会员
#5
弱弱的问一下,比如, in addition , in CLEC, No.of words 1070602, No. of hits 18, frequency per million words 16.81. 如果是addition,毫无疑问, No. of words 1070602, No. of hits 173, frequency 161.59.

in addition to 并不是单个单词, 总词数1070602是按单词数计算的。这样统计出来的词频正确吗?

另外 词频统计器也只能 按单个单词来排列, 不能把词块词组排列。

[本贴已被 作者 于 2005年08月30日 18时57分43秒 编辑过]
 

tiger

高级会员
#6
回复:“CLEC中国英语学习者语料库”分论坛开坛

以下是引用 xianx2005-8-30 15:58:57 的发言:
弱弱的问一下,比如, in addition to, in CLEC, No.of words 1070602, No. of hits 18, frequency per million words 16.81.
in addition to 并不是单个单词, 总词数1070602是算得每个单词的个数。这样统计出来的词频正确吗?如果是addition,毫无疑问, No. of words 1070602, No. of hits 173, frequency 161.59.
另外 词频统计器也只能 按单个单词来排列, 不能把词块词组排列。
can you reword your first question to make it clearer to me?
as to the second, lists of word chunks or clusters can only be extracted and sorted with special programs, for example, the cluster function of wordsmith tools.
 

xiaoz

永远的超级管理员
Staff member
#7
It all depends upon the tagger used. For example, the UCREL CLAWS tags "in addition to" as one token, as in the BNC. But most POS tagger annotate words in phrases like this separately.
 

xianx

初级会员
#8
then if it's an unannotated corpus, the frequency cannot be counted in that way?
how about the taggers used in CLEC? many thanks
 

xiaoz

永远的超级管理员
Staff member
#9
CLEC is not POS tagged but only error-tagged. For an unannotated corpus - and for a POS tagged corpus indeed, the easiest and most natural way to indicate the size of a corpus is by counting the tokens in it. The conventional practice, therefore, is to normalise frequencies on the basis of words instead of phrases or sentences (for a parallel corpus perhaps people are also interested in aligned sentence pairs).

In your case, I do not see a problem in normalising the frequency of "in addition to" to the word basis, though, if you are comparing corpora or samples of different sizes, the common basis for normalisation should be reasonable. For an explanation see my posting "Making statistical claims" at

http://www.corpus4u.com/upload/forum/2005052307351613.pdf
 
顶部