疑问:tokens

xujiajin

管理员
Staff member
如果你的stoplist里把标点和数字等都包括了的话,对英文来说,number of tokens应该就是单词数,对汉字来说就是词数(如果你的语料经过了分词处理tokenization的话)。
 

asan82

高级会员
我自己没有做啊,是看到一篇文章里出现的,想搞清楚,所以来问问大家.
谢谢你的答案!
 

cncorpus

普通会员
Please take a look at Nation's book chapter below about its definition:

http://assets.cambridge.org/052180/0927/sample/0521800927ws.pdf
 

csli

初级会员
也许WordSmith Tools 4 Help里的这段话对你有用:
"If a text is 1,000 words long, it is said to have 1,000 "tokens". But a lot of these words will be repeated, and there may be only say 400 different words in the text. "Types", therefore, are the different words."


也就是说,不包括标点。有趣的是,在默认设置状态下,数字不算作词,在自定义设置状态下,数字是可以算作词的。
所以,与平时所说的“字符数”不是一个概念。
 

asan82

高级会员
回复:疑问:tokens

以下是引用 csli2006-1-21 23:37:19 的发言:
也许WordSmith Tools 4 Help里的这段话对你有用:
"If a text is 1,000 words long, it is said to have 1,000 "tokens". But a lot of these words will be repeated, and there may be only say 400 different words in the text. "Types", therefore, are the different words."


也就是说,不包括标点。有趣的是,在默认设置状态下,数字不算作词,在自定义设置状态下,数字是可以算作词的。
所以,与平时所说的“字符数”不是一个概念。
I see.
Thank you.
 
顶部