请教:类连接的卡方需要哪些数值来计算?

请教高手,我想对比分析2个语料库中某类联接多用或少用是否显著。
如类连接“V+N"在clec和bnc中分别出现184次和138次,如何运用spss计算X2 及P值?在spss中具体输入哪几个数啊?
麻烦大家了,急需等待回复!
谢谢!
 
回复: 请教:类连接的卡方需要哪些数值来计算?

不好意思, 我的能力只能提供您 WordSmith 5.0 說明書裡關於 keyness 的計算方式.
以下出自 Mike Scott.
.............................................................
How Key Words are Calculated

The "key words" are calculated by comparing the frequency of each word in the word-list of the text you're interested in with the frequency of the same word in the reference word-list. All words which appear in the smaller list are considered, unless they are in a stop list.

If the occurs say, 5% of the time in the small word-list and 6% of the time in the reference corpus, it will not turn out to be "key", though it may well be the most frequent word. If the text concerns the anatomy of spiders, it may well turn out that the names of the researchers, and the items spider, leg, eight, etc. may be more frequent than they would otherwise be in your reference corpus (unless your reference corpus only concerns spiders!)



To compute the "key-ness" of an item, the program therefore computes

its frequency in the small word-list

the number of running words in the small word-list

its frequency in the reference corpus

the number of running words in the reference corpus


and cross-tabulates these.



Statistical tests include:

the classic chi-square test of significance with Yates correction for a 2 X 2 table

Ted Dunning's Log Likelihood test, which gives a better estimate of keyness, especially when contrasting long texts or a whole genre against your reference corpus.
 
回复: 请教:类连接的卡方需要哪些数值来计算?

用 SPSS 會不會太費工夫啊?

難道你要每個 V+N 的詞組都一條一條算 token嗎?
 
回复: 请教:类连接的卡方需要哪些数值来计算?

用 SPSS 會不會太費工夫啊?

難道你要每個 V+N 的詞組都一條一條算 token嗎?
类联接的话,我不懂需要使用哪些原始数据输入到spss中?您说的token,是用于对比类联接用的吗?谢谢
 
回复: 请教:类连接的卡方需要哪些数值来计算?

哈罗, 根据您提供的资讯和我对卡方的理解,
我的解法是这样:

To compute the "key-ness" of an item, the program therefore computes

its frequency in the small word-list (184)
the number of running words in the small word-list (clec 總 token)
its frequency in the reference corpus (138)
the number of running words in the reference corpus (bnc 總 token)
and cross-tabulates these.

但我不懂为什麽bnc的数值比较小, 也许您需要互换两个corpus的数值.
P值应该可以随您的喜好设定吧.

我水平有限, 不保证建议正确喔.
 
回复: 请教:类连接的卡方需要哪些数值来计算?

哈罗, 根据您提供的资讯和我对卡方的理解,
我的解法是这样:

To compute the "key-ness" of an item, the program therefore computes

its frequency in the small word-list (184)
the number of running words in the small word-list (clec 總 token)
its frequency in the reference corpus (138)
the number of running words in the reference corpus (bnc 總 token)
and cross-tabulates these.

但我不懂为什麽bnc的数值比较小, 也许您需要互换两个corpus的数值.
P值应该可以随您的喜好设定吧.

我水平有限, 不保证建议正确喔.
我需要再进一步说一下,数据184和138是我进行抽样后的数据,那么我在进行计算时是把您说的token认为成语料库的总词容还是抽样出来的某词的所有类联接的总数呢?
谢谢您!
 
回复: 请教:类连接的卡方需要哪些数值来计算?

那么我在进行计算时是把您说的token认为成语料库的总词容还是抽样出来的某词的所有类联接的总数呢?

这也是我一开始看到您的问题不敢回答的原因. 我也有想到这个问题. 我想还是可以用语料库的总词容作计算基准. 至少比较两个语料库比较基准一致. 而且以词类的角度去算不同的词性单字互相比较那类比较显着真的很复杂,像the, a,天生就多,怎麽处理? 何况要去谈词类之间相对多或相对少.纯用单字token比较会不精准没错, 但至少是可供比较的数值.

数据184和138是我进行抽样后的数据
那就要做像 normalization 的动作, 把数值的分母调到一致水平.
譬如 184 是每10,000出现184, 总token 就要用10,000去算.
或加大分子184, 如果corpus 总 token 是1,000,000; 184就要再等比例乘上100再计算.

真的很尴尬, 这还是得需要站内高人指导啊.
 
回复: 请教:类连接的卡方需要哪些数值来计算?

Jeremy提到的问题很关键。

理论上,比较colligation的frequency差异时,不应用总word tokens,而应用相对应的colligation/PoS的频数。

实际中,也有用总的word tokens做比较的reference的。
 
Back
顶部