词频统计的另一个问题

为什么用Antconc中word list功能统计词频的时候,显示的结果和书中的不一样呢?比如在《中国学习者英语语料库》所附光盘中的词频排列表中显示token=1070602,用Antconc查出的token=1172732?
此外,如何设置才能使比如I’m, she's类的词显示完整,不是分开显示I,m, she,s?
谢谢!
 

清风出袖

高级会员
回复: 词频统计的另一个问题

你可以在Global Setting中对Token Setting进行相关设置。
 
回复: 词频统计的另一个问题

具体应该怎么设置?我把该试的地方都试了,还是不行,显示的还是s,m等等
 

laohong

管理员
Staff member
回复: 词频统计的另一个问题

...此外,如何设置才能使比如I’m, she's类的词显示完整,不是分开显示I,m, she,s?谢谢!
本来就是两个词,为什么硬要不分开?
 
回复: 词频统计的另一个问题

恩,也是哦。
但是是不是还要某种限制呢?要不Antconc把</sp>,或者,[wd-t]中的sp,wd都算作词了。
应该怎么设?
 

清风出袖

高级会员
回复: 词频统计的另一个问题

我是这么设置的经过检测可以搜出I've,你试试看,选中Token (Word)Definition 中letter 和puctuation中的other,试试看哦
 
回复: 词频统计的另一个问题

thanks a lot!
but until now the number(753169) of the tokens I have counted in the COLSEC is still much more than that mentioned in the book by 杨惠中,卫乃兴,(723299)
i have excluded the tag of <> and []. and i don't know if there are any other symbols i shall exclude?
 

清风出袖

高级会员
回复: 词频统计的另一个问题

perhaps, there's a difference on the definition of token or something. if you have got an answer, please post it here. thanks
 

xiaoz

永远的超级管理员
Staff member
回复: 词频统计的另一个问题

Unsurprising results as different programs may have different counting algorithms. They have different setting as to whether special characters are allowed in words. e.g. Are the following one or two words? - I'll, can't, gonna, so-called.
 
顶部