卡方检验求帮助

我现在在做基于MICASE的某个词的比较,主要是看该词在不同年龄段的使用是否有差异,年龄被分成3个阶段,然后算出来三个阶段的每千词的使用频数分别是4.58次,6.25次,7.1次,请教各位老师,能用卡方检验比较他们之间的显著性差异吗?另外,我的这种统计方法需要对该语料库中的152个文本中的speakers进行挨个的进行年龄段划分,然后统计每个年龄段说话的总字数,非常麻烦,请问有没有其他的方法可以做呢?
 
回复: 卡方检验求帮助

For your 1st question: Yes, you can use chi-squared test.

You may want to organize your data in the following way.

AgeGroup WordFrequency OtherWordsFrequency

ag1 a b
ag2 c d
ag3 e f

It is a 3 x 2 contingency table and you can feed these data into your favorite statistical package and do chi-squared test.

The null hypothesis of this test is:
The distribution of the specific word is independent of age group
or formulate mathematically

P(i, j) = P(i)P(j)

Here, i = 1, 2, 3 (the number of rows) and j = 1, 2 (the number of columns)

To interpret the chi-squared test better, you can compute the residuals for each cell of the table.

Take my favorite statistical package R for example.
First, you input the data like this:
data <- matrix(c(a,b,c,d,e,f),nrow=3,byrow=T)
Then, do chi-squared test:
fit <- chisq.test(data)
Finally, find the standardized residual for each cell:
fit$stdres

If any of the residual is greater than 1.96 or less than -1.96, you can be 95% sure that the null hypothesis is wrong for that cell.
 
回复: 卡方检验求帮助

貌似也可以试试proportion test(Freq比较大的时候,chisq.test的p值常会显著)
>prop.test()
再比较其confidence interval

For your 1st question: Yes, you can use chi-squared test.

You may want to organize your data in the following way.

AgeGroup WordFrequency OtherWordsFrequency

ag1 a b
ag2 c d
ag3 e f

It is a 3 x 2 contingency table and you can feed these data into your favorite statistical package and do chi-squared test.

The null hypothesis of this test is:
The distribution of the specific word is independent of age group
or formulate mathematically

P(i, j) = P(i)P(j)

Here, i = 1, 2, 3 (the number of rows) and j = 1, 2 (the number of columns)

To interpret the chi-squared test better, you can compute the residuals for each cell of the table.

Take my favorite statistical package R for example.
First, you input the data like this:
data <- matrix(c(a,b,c,d,e,f),nrow=3,byrow=T)
Then, do chi-squared test:
fit <- chisq.test(data)
Finally, find the standardized residual for each cell:
fit$stdres

If any of the residual is greater than 1.96 or less than -1.96, you can be 95% sure that the null hypothesis is wrong for that cell.
 
回复: 卡方检验求帮助

For your 1st question: Yes, you can use chi-squared test.

You may want to organize your data in the following way.

AgeGroup WordFrequency OtherWordsFrequency

ag1 a b
ag2 c d
ag3 e f

It is a 3 x 2 contingency table and you can feed these data into your favorite statistical package and do chi-squared test.

The null hypothesis of this test is:
The distribution of the specific word is independent of age group
or formulate mathematically

P(i, j) = P(i)P(j)

Here, i = 1, 2, 3 (the number of rows) and j = 1, 2 (the number of columns)

To interpret the chi-squared test better, you can compute the residuals for each cell of the table.

Take my favorite statistical package R for example.
First, you input the data like this:
data <- matrix(c(a,b,c,d,e,f),nrow=3,byrow=T)
Then, do chi-squared test:
fit <- chisq.test(data)
Finally, find the standardized residual for each cell:
fit$stdres

If any of the residual is greater than 1.96 or less than -1.96, you can be 95% sure that the null hypothesis is wrong for that cell.



首先非常感谢您的讲解,但是有一点我不解,我只是想要比较特定的一个词,那么other word frequency在这里有什么用呢?直接比较那三个数难道不可以吗?
 
回复: 卡方检验求帮助

可以直接比啊,画个柱装图,这叫描述性统计。

卡方检验是推论性统计。首先要搞清楚卡方检验比的是比率的差异。
比如一个词word在10,000词的语料库A里出现了100次,在8,000词的语料库B里出现了
90次,你想知道这个词在各个语料库里所占的比率是否有差异。
一般在卡方检验里比率用odds表示,比如在语料库A里word出现的odds是
(100/10,000) / (9900/10,000) = 1/99
意思是在该样本中不出现word的概率是出现word的99倍。
在语料库B中的odds是 90/7910。

由样本中得到的两个odds比值推测总体中odds值是否相同称为卡方检验。

因为考虑的是odds,所以要找出other word frequency.
 
Last edited:
Back
顶部