卡方检验求帮助

zwq763 · 2012-06-05

我现在在做基于MICASE的某个词的比较，主要是看该词在不同年龄段的使用是否有差异，年龄被分成3个阶段，然后算出来三个阶段的每千词的使用频数分别是4.58次，6.25次，7.1次，请教各位老师，能用卡方检验比较他们之间的显著性差异吗？另外，我的这种统计方法需要对该语料库中的152个文本中的speakers进行挨个的进行年龄段划分，然后统计每个年龄段说话的总字数，非常麻烦，请问有没有其他的方法可以做呢?

qhdjason · 2012-06-07

回复: 卡方检验求帮助

For your 1st question: Yes, you can use chi-squared test.

You may want to organize your data in the following way.

AgeGroup WordFrequency OtherWordsFrequency

ag1 a b
ag2 c d
ag3 e f

It is a 3 x 2 contingency table and you can feed these data into your favorite statistical package and do chi-squared test.

The null hypothesis of this test is:
The distribution of the specific word is independent of age group
or formulate mathematically

P(i, j) = P(i)P(j)

Here, i = 1, 2, 3 (the number of rows) and j = 1, 2 (the number of columns)

To interpret the chi-squared test better, you can compute the residuals for each cell of the table.

Take my favorite statistical package R for example.
First, you input the data like this:
data <- matrix(c(a,b,c,d,e,f),nrow=3,byrow=T)
Then, do chi-squared test:
fit <- chisq.test(data)
Finally, find the standardized residual for each cell:
fit$stdres

If any of the residual is greater than 1.96 or less than -1.96, you can be 95% sure that the null hypothesis is wrong for that cell.

lisang · 2012-06-10

回复: 卡方检验求帮助

貌似也可以试试proportion test（Freq比较大的时候，chisq.test的p值常会显著）
>prop.test()
再比较其confidence interval

作者 qhdjason:
For your 1st question: Yes, you can use chi-squared test.

You may want to organize your data in the following way.

AgeGroup WordFrequency OtherWordsFrequency

ag1 a b
ag2 c d
ag3 e f

It is a 3 x 2 contingency table and you can feed these data into your favorite statistical package and do chi-squared test.

The null hypothesis of this test is:
The distribution of the specific word is independent of age group
or formulate mathematically

P(i, j) = P(i)P(j)

Here, i = 1, 2, 3 (the number of rows) and j = 1, 2 (the number of columns)

To interpret the chi-squared test better, you can compute the residuals for each cell of the table.

Take my favorite statistical package R for example.
First, you input the data like this:
data <- matrix(c(a,b,c,d,e,f),nrow=3,byrow=T)
Then, do chi-squared test:
fit <- chisq.test(data)
Finally, find the standardized residual for each cell:
fit$stdres

If any of the residual is greater than 1.96 or less than -1.96, you can be 95% sure that the null hypothesis is wrong for that cell.

zwq763 · 2012-06-10

回复: 卡方检验求帮助

作者 qhdjason:
For your 1st question: Yes, you can use chi-squared test.

You may want to organize your data in the following way.

AgeGroup WordFrequency OtherWordsFrequency

ag1 a b
ag2 c d
ag3 e f

It is a 3 x 2 contingency table and you can feed these data into your favorite statistical package and do chi-squared test.

The null hypothesis of this test is:
The distribution of the specific word is independent of age group
or formulate mathematically

P(i, j) = P(i)P(j)

Here, i = 1, 2, 3 (the number of rows) and j = 1, 2 (the number of columns)

To interpret the chi-squared test better, you can compute the residuals for each cell of the table.

Take my favorite statistical package R for example.
First, you input the data like this:
data <- matrix(c(a,b,c,d,e,f),nrow=3,byrow=T)
Then, do chi-squared test:
fit <- chisq.test(data)
Finally, find the standardized residual for each cell:
fit$stdres

If any of the residual is greater than 1.96 or less than -1.96, you can be 95% sure that the null hypothesis is wrong for that cell.

首先非常感谢您的讲解，但是有一点我不解，我只是想要比较特定的一个词，那么other word frequency在这里有什么用呢？直接比较那三个数难道不可以吗？

qhdjason · 2012-06-10

回复: 卡方检验求帮助

可以直接比啊，画个柱装图，这叫描述性统计。

卡方检验是推论性统计。首先要搞清楚卡方检验比的是比率的差异。
比如一个词word在10,000词的语料库A里出现了100次，在8,000词的语料库B里出现了
90次，你想知道这个词在各个语料库里所占的比率是否有差异。
一般在卡方检验里比率用odds表示，比如在语料库A里word出现的odds是
(100/10,000) / (9900/10,000) = 1/99
意思是在该样本中不出现word的概率是出现word的99倍。
在语料库B中的odds是 90/7910。

由样本中得到的两个odds比值推测总体中odds值是否相同称为卡方检验。

因为考虑的是odds，所以要找出other word frequency.

卡方检验求帮助

zwq763

qhdjason

lisang

zwq763

qhdjason