PDA

查看完整版本 : [求助]SARA中的Z-score怎么跟其它方法算出来的不一样?


ibid
2005-10-08, 12:19 PM
用SARA算BNC里的一些搭配词的z值,放到专门算Z-score 的软件CalcZ里检验,结果差得很远,这是怎么回事啊?求高人指点了!!

[本贴已被 xujiajin 于 2005年10月11日 09时30分56秒 编辑过]

动态语法
2005-10-08, 12:57 PM
Can you provide us here with the raw data you used and perhaps also how you calculated the Z-score?

Without this kind of information, how can anyone help you?

[本贴已被 作者 于 2005年10月08日 13时16分12秒 编辑过]

刘语料
2005-10-08, 05:08 PM
perhaps different tools define "word" differently, the most important is that the ranks of the words you get from the same tool. with which one can compare the collocability of a specific word with its collocations. you can consult Dr.Barlow's analysis.

ibid
2005-10-08, 11:39 PM
to 动态语法: I use BNC as the control corpus for my comparative study. Since it's monstously large, I only selected 500 instances randomly out of around 17700 occurrences of the word under discussion. But all these are processed by SARA, the typicality of the radomly selected concordances are questionable. So far, I haven't found a way to copy the corcondance lines from SARA to a word file. I'm sorry, I can't provide the raw data. The software I use to calculate Z-score requires 5 pieces of information, viz. C1 节点词与搭配词共现次数, C2 搭配词的出现频数, S 默认为10, Cs 语料库总词容, n 节点词出现频数. It's a simple software, actually, to save human laborious efforts.

ibid
2005-10-08, 11:44 PM
to 刘语料: I'm sorry, but I'm not very clear what you referred to. Can you explain more specifically? Thank you!

xiaoz
2005-10-08, 11:59 PM
Which version of the BNC are you using? Version 2 or the world Edition. The online SARA uses the old version 2 and you are using the World Edition?

If you let me know which word you are studying in your example, perpahs I will be able to check it for you.

动态语法
2005-10-09, 12:34 AM
以下是引用 ibid 在 2005-10-8 23:39:25 的发言:
to 动态语法: I use BNC as the control corpus for my comparative study. Since it's monstously large, I only selected 500 instances randomly out of around 17700 occurrences of the word under discussion. But all these are processed by SARA, the typicality of the radomly selected concordances are questionable. So far, I haven't found a way to copy the corcondance lines from SARA to a word file. I'm sorry, I can't provide the raw data. The software I use to calculate Z-score requires 5 pieces of information, viz. C1 节点词与搭配词共现次数, C2 搭配词的出现频数, S 默认为10, Cs 语料库总词容, n 节点词出现频数. It's a simple software, actually, to save human laborious efforts.



No, I am not interested in your concordance lines, I meant raw numbers.
Can you give these numbers: C1 节点词与搭配词共现次数, C2 搭配词的出现频数,
S 默认为10, Cs 语料库总词容, n 节点词出现频数? (By the way, what's 'S 默认为
10'? Window span? Do both programs have it as the default setting?)

Just like scientists doing experiments, you want others to have the exact same
raw data to be able to replicate your results. Without the raw data we don't know
what you are talking about and how to respond to you.

ibid
2005-10-09, 02:15 PM
to xiaoz and 动态语法: I'm using BNC world edition released in 2000. Let's take 'everyone' for example: the collocate word 'else' co-occurs 1153 times with 'everyone'; 'everyone' and 'else' occur 12786 and 19931 times respectively; S is the window span, we set it as 10 (5 left, 5 right); and the total number of words of BNC is 100,000,000. The Z-score given by BNC is 237.1, while putting all these data into the Z-score software CalcZ, the result is 150.3. Now you can see this is my question.
Besides, if I choose 500 concordances randomly by BNC, and 'else' co-occurs 36 times with 'everyone', then I calculate Z-score within downloads only, the number I get is 37.3. Why it is such a far cry from the one, 237.1, in the whole corpus?

ibid
2005-10-09, 02:19 PM
Another question, does anyone know how to save concordance lines in a word file? Now I can only save the lines in xml format, then copy to a word file. But there are a lot of tags I need to get rid of.

动态语法
2005-10-09, 03:12 PM
以下是引用 ibid 在 2005-10-9 14:15:35 的发言:
to xiaoz and 动态语法: I'm using BNC world edition released in 2000. Let's take 'everyone' for example: the collocate word 'else' co-occurs 1153 times with 'everyone'; 'everyone' and 'else' occur 12786 and 19931 times respectively; S is the window span, we set it as 10 (5 left, 5 right); and the total number of words of BNC is 100,000,000. The Z-score given by BNC is 237.1, while putting all these data into the Z-score software CalcZ, the result is 150.3. Now you can see this is my question.

You need to look at the documentation about CalcZ (what is it, by the way?) and
see what formula the author uses. I got a different score (with ACWT), too, based
on the formula given by BNCWeb.

http://forum.corpus4u.org/upload/forum/2005100915002419.jpg

The corpus size may be the problem: Does the program uses
100,000,000 or something else as the corpus size? This may cause different
results.

Now question #2:
以下是引用 ibid 在 2005-10-9 14:15:35 的发言:

Besides, if I choose 500 concordances randomly by BNC, and 'else' co-occurs 36 times with 'everyone', then I calculate Z-score within downloads only, the number I get is 37.3. Why it is such a far cry from the one, 237.1, in the whole corpus?


Well, when you change the sample size, you get different frequency info and yet
you assume the same base numbers (corpus size, freq. of node, freq. of collocate,
etc.). The results are bound to be different. It's almost like asking: when you have
a total of 10 bananas, why 100 monkeys get fewer bananas than 10 monkeys do
and why can't they have the same number of bananas.



[本贴已被 作者 于 2005年10月09日 15时28分32秒 编辑过]

刘语料
2005-10-09, 05:04 PM
the calculation of Z-score needs five kinds of number, in fact ,any of them changes. the result will be different.
tools like Wordsmith 4.0 , Tact2.15 and PHC 1.02 ,define "word" differently, therefore, even if users use the identical corpus ,they tend to get different results.
actually. sometimes a tool uses 4/4 as the span,while others 5/5. so the results are different.
Dr.Barlow thinks that one should pay attention the ranks of collocates of a certain word instead of the value itself.
I use the above tools to study the same word"provide".the results are quite different.
tools are tools, we should explain the results according to linguistic facts.
Besides. the Z-score shoes it disadvantges, so the comibination of the Z-score , T-score, MI-score and other scores is of great importance.

xiaoz
2005-10-09, 09:54 PM
The frequencies of everyone and else are the same as my result (BNC World), but the z score is 301.37 for +/-3. When you created a collocate database with the default settings (+/-5 span), it actually shows collocates within a span of +/-3. Try your method with S=6 to see if your result is the same as mine. [Actually if you try +/-5, else is not on the list at all.]

http://forum.corpus4u.org/upload/forum/2005100921531960.jpg

ibid
2005-10-09, 09:54 PM
Many thanks to 动态语法 and xiaoz . To be honest, I don't have the documentation about calcz. My friend just sent me the software. And I've tried to google it, but found no useful info. I've noticed there is a discussion about ACWT on this web, I'll take a look at that later. As for WordSmitlh, I have the version 3.0, but it doesn't seem to be able to calculate Z-score. Or maybe I'm not in that.

xiaoz
2005-10-09, 10:05 PM
WST 4 can compute z scores.

ibid
2005-10-09, 10:07 PM
uh-huh! ic. i'll try it now.

ibid
2005-10-09, 10:25 PM
my result is almost the same as yours: 299.9 with 1121 occurrences of "else". the other word forms else' (twice), else- (once), elses (3 times) are not taken into account.
now comes to the next question: i can only set S=6? (why don't I have the interface like yours shown above? there are several items under the "query" on top of the interface, i choose collocation, and calculate the collocate one by one.)

xiaoz
2005-10-09, 10:39 PM
I am using BNCweb hosted at Zurizh.

you can also rey the BNC Online at British Library (but that is the old version, a bit larger than World Edition).

http://thetis.bl.uk/

hancunxin
2005-10-09, 10:46 PM
[quote]以下是引用 ibid 在 2005-10-9 14:19:56 的发言:
Another question, does anyone know how to save concordance lines in a word file? Now I can only save the lines in xml format, then copy to a word file. But there are a lot of tags I need to get rid of.

it depends on what kind of concordancer you are using.

ibid
2005-10-09, 10:49 PM
My study follows my supervisor's criteria. But I guess he hasn't adopted BNC for his studies. He used to define the span as +/-5, and the threshold of the significance of
Z-score 2.0. If, for my study, I set S=6, then what should the threshold be?

ibid
2005-10-09, 10:52 PM
to hancunxin: do u have better ideas?

xiaoz
2005-10-09, 11:14 PM
the threshold value is independent of window span. A z score above 3.0 should be significant.

[本贴已被 作者 于 2005年10月11日 21时37分58秒 编辑过]

ibid
2005-10-09, 11:45 PM
C1=10 C2=100 N=80 S=10 CS=100,000
Z=6.42223128

C1=10 C2=100 N=80 S=6 CS=100,000
Z=8.79039915

我做了个小小的对比,如果将S从10换作6的话,Z值是不一样的哦!真不好意思,搞不清楚这个问题我实在是没办法选取搭配词进行研究,或者是有可能排除一些有研究价值的搭配词。麻烦你多给我讲讲了!

xiaoz
2005-10-10, 12:55 AM
There are many discussions of collocation statistics at this site. Search for the relevant postings.

动态语法
2005-10-10, 02:01 AM
以下是引用 ibid 在 2005-10-9 23:45:32 的发言:
C1=10 C2=100 N=80 S=10 CS=100,000
Z=6.42223128

C1=10 C2=100 N=80 S=6 CS=100,000
Z=8.79039915

我做了个小小的对比,,如果将S从10换作6的话,Z值是不一样的哦!真不好意思,搞不清楚这个问题我实在是没办法选取搭配词进行研究,或者是有可能排除一些有研究价值的搭配词。麻烦你多给我讲讲了!


Z值不一样是可以料到的,而且也并不可怕,关键是Z值一般是用来做多个搭配组合比较的,只要
用来做比较的各项的基数一致(S, CS 等)最终结果就是有一定意义的。

cathy
2005-10-10, 03:58 PM
请教:
我在读濮建忠 的《英语词汇教学中的类连接、搭配及词块》(2003、6期)时,有几个问题不明白,想请教。440页表1中的卡方值和显著性P值,是怎么计算出来的,wordsmith 里有这样的计算软件吗?有关这些值的意义,哪些书或文章中有介绍?表5中“类联接the ADJn频数为238”这是人工数出来的,还是有专门软件?

ibid
2005-10-10, 10:52 PM
to cathy:我不知道濮老师是怎么算出来的。我以前统计colligation的时候,都是人工数出来的。 卡方是对比差异性是否显著,p值是一个界定标准,这些统计上会学的。我讲不大清楚了,只是以前用过。我用的是wordsmith3.0, 好象没这个功能。

ibid
2005-10-10, 10:58 PM
X2
0.028<1
0.006<1
3.820
这是我以前做的一个小研究里的数据,0.028, 0.006, 3.820都是算出来的X2,设定p为1,那么大于1的是显著差异了。

ibid
2005-10-10, 11:06 PM
to 动态语法: 怪我以前没说清楚了。我做的是中国学生和本族语者之间的对比研究。所以还要用到学习者语料库,我用wordsmith3.0处理学习者部分。但3.0没办法算z-score所以就用另一个小软件算z-score。和bnc里的数据对比时,或者提取搭配词时,两个的基准应该是一样的。
xiaoz建议我结合其它统计方法,可我目前比较熟悉的,而且以前也一直用的都是z-score,其它的只是知道,并未用过。

动态语法
2005-10-10, 11:33 PM
以下是引用 ibid 在 2005-10-10 23:06:36 的发言:
to 动态语法: 怪我以前没说清楚了。我做的是中国学生和本族语者之间的对比研究。所以还要用到学习者语料库,我用wordsmith3.0处理学习者部分。但3.0没办法算z-score所以就用另一个小软件算z-score。和bnc里的数据对比时,或者提取搭配词时,两个的基准应该是一样的。
xiaoz建议我结合其它统计方法,可我目前比较熟悉的,而且以前也一直用的都是z-score,其它的只是知道,并未用过。


个人看法:最好是找到不同语料库的相匹配的原始数据,然后用同一个统计软件计算。

动态语法
2005-10-10, 11:37 PM
以下是引用 cathy 在 2005-10-10 15:58:54 的发言:
请教:
我在读濮建忠 的《英语词汇教学中的类连接、搭配及词块》(2003、6期)时,有几个问题不明白,想请教。440页表1中的卡方值和显著性P值,是怎么计算出来的,wordsmith 里有这样的计算软件吗?有关这些值的意义,哪些书或文章中有介绍?表5中“类联接the ADJn频数为238”这是人工数出来的,还是有专门软件?


你是说Chi-Square? ACWT has links to an online X2 calculator as well as the
statistical tables for X2.

xujiajin
2005-10-10, 11:43 PM
以下是引用 cathy 在 2005-10-10 15:58:54 的发言:
请教:
我在读濮建忠 的《英语词汇教学中的类连接、搭配及词块》(2003、6期)时,有几个问题不明白,想请教。440页表1中的卡方值和显著性P值,是怎么计算出来的,wordsmith 里有这样的计算软件吗?有关这些值的意义,哪些书或文章中有介绍?表5中“类联接the ADJn频数为238”这是人工数出来的,还是有专门软件?


濮建忠是用WordSmith里的keyword工具得出的。WS3和WS4都有。其中的keyness应该就是chi-square的值吧。

dzhigner
2005-10-11, 02:40 AM
Z分值计算数值的不同也许是因为公式的不同:
比如如下两个不同的Z分值公式
http://forum.corpus4u.org/upload/forum/2005101101455573.jpg
http://forum.corpus4u.org/upload/forum/2005101101462292.jpg
两公式原理是一样的,但有两处不同:
1. 搭配词概率(probability of the collocate )
BNCweb 公式中:搭配词概率 = 搭配词频数 / (整个文本长度 - 节点词频数)
《导论》公式中:搭配词概率 = 搭配词频数 / 整个文本长度

2. 小文本的跨距(span)
BNCweb公式中的S与《导论》公式中的S表示不同,前者相当于后者中2S

在另一则帖子中,我尝试性地对Z分值的统计学实质给出一个解释:
http://www.corpus4u.com/forum_view.asp?forum_id=34&view_id=882


[本贴已被 作者 于 2005年10月16日 02时00分35秒 编辑过]

dzhigner
2005-10-11, 04:01 AM
此外,不才我也认为Z-分值是揭示搭配力较好的方法。
http://forum.corpus4u.org/upload/forum/2005101102530042.jpg
MI的实质是观测值与期望值的比率,MI3是一种出色的改良,在某种程度上克服了MI过分的突出低频次搭配的不足,T-Score对频次有充分的强调,并考虑到总体,而Z分值的计算与上述种种因素均有反映。以上图表对此有明显的反映。就像各种考试系统偏爱标准分,原理相似的Z分值计算是一种衡量搭配力的好方法。

xiaoz
2005-10-11, 09:03 AM
Can you try some very frequent and very infrequent items to see if the result of this comparison is also true? Thanks.

dzhigner
2005-10-14, 12:35 PM
"Very frequent items and very infrequent items" refer to?

ibid
2005-10-14, 11:29 PM
我所熟悉的是第一种计算方法。而且我所用的计算Z-Score的软件也是用的第一种方法。
我统计不好,有一点没看懂:“BNCweb公式中的S与《导论》公式中的S表示不同,前者相当于后者中2S+1,”意思是BNC的S比导论的S大两倍多?
还有“比较亦可证明,在节点词频数不很高的情况下,计算结果近乎相同。”如果节点词频数很高呢?几千,上万呢?

dzhigner
2005-10-18, 05:07 AM
在回帖之前,本人必须郑重声明:本人科研态度不够严谨,在得到正确结论之前就妄然发言,必将闭门检讨,端正自身作风。如下内容实属尝试性,假想性的发言,望各位同志批评指正。

在经过一番苦思之后,认定本人之前关于Z-score之观点大有可疑。虽发现了《导论》与BNCweb两算法间的不同,但却陷入了一种可疑的思路,未经明证的假想,在把两公式中S统一之后发现其结果近乎相似便止步不前,未进一步实证便提出观点并以为正确。惭愧之至,痛心疾首。以下文字是反思回想后再次得出的结论,希望各位同志批评,指正,讨论.

BNCweb中的搭配词概率P与《导论》公式确有不同,两不同点间应有联系,以至对S的选取应慎重思考。

两公式仅S的意义不同。BNCweb中的S, 经比较结果得出, 就是SPAN的取值. 而<<导论>>公式中的S则是单侧单词的数目, 因此2S+1得出的就是小文本的跨度SPAN. 若是把两公式加以简单推算,便可看出大同小异。。。

<<导论>>公式中的E=P*M, BNCweb 中的E = P*F(n)*S, 在不考虑S差异的情况下与<<导论>>公式中E的计算完全一致
BNCweb公式分母:
SQRT(E(1-P)) = SQRT(F(n) * S * P * (1 - P) ) = SQRT(P*(1-P)*M)
Z最终的运算方法都是(F(c) - E)/SD, 所以除了P的运算不同和S的的规定不同, 无其他差异.

至于P,S的差异,的确会引起的两种Z值算法的结果差异,那么P的运算与S的确定间究竟是什么关系。 BNCweb中给出的S定义如下:
S: the span (window-size), i.e. the number of items on either side of the node considered as its environment
经分析,本人认为: 这里的S指的是不包含节点词在内的跨距,相应于节点词未出现处搭配词的概率,即不包含节点词的文本中搭配词的概率。而《导论》公式中的2S+1是包括节点词且对称的跨距,相应于包括节点词的整个文本中搭配词出现概率。
把BNCweb公式中的SPAN定名为S1,《导论》中的SPAN定名为S2,那么有2 * S2 + 1 = S1 + 1。此外本人猜想,在某些具体问题中,亦可采用双侧不对称的跨距。比如仅考察节点词左侧第一个位置上搭配词与节点词的搭配力。那么S1 应定为 1,而此时若应用<<导论>>公式,就应该用单一变量来替代2S+1,比如,令S替代原来的2S+1,此时S的取值应该是2.

动态语法
2005-10-18, 06:20 AM
D君治学严谨,钻研深刻,值得我们学习。

xujiajin
2005-10-18, 07:34 AM
Support!

dzhigner
2005-10-19, 01:25 AM
工作表:分析<<导论>>Z-score公式与BNCweb z-score公式中SPAN
http://forum.corpus4u.org/upload/forum/2005101901231286.xls

清风出袖
2005-10-19, 12:08 PM
后学楷模,我辈敬仰。谢谢!

ibid
2005-10-24, 02:01 PM
I've finally given up trying to make out the differentiation between Z-scores calculated by SARA and the method proposed in Yang's book, 'An Introduction to Corpus Linguistics'. I took 动态语法's suggestion "最好是找到不同语料库的相匹配的原始数据,然后用同一个统计软件计算", and left this issue unsolved.
However, by making a few tests, I don't tend to believe that the differentiation is caused by different delimitations of span. I use the clearest way to define S, left 5 and right 5, or left 4 and right 4, etc. Listed below are my tests:

C': co-occurrence with the node
C: occurrence of the collocate
W: total number of words in BNC
N: occurrence of the node word

ibid
2005-10-24, 02:34 PM
本来想上传文件的,没找到什么地方可以。懒得找了,没有表格的数据,大家凑合看看吧。
N (anything) =27487, C (else) =19931
Z-score
BNC proposed method
left=5 right=5 C’ =2295 321.4 287.9
left=4 right=4 C’ =2265 356.1 315.6
left=3 right=3 C’ =2223 405.2 352.8

ibid
2005-10-24, 02:36 PM
看来效果不好,哪里上传啊?我把我的东西传上去!

dzhigner
2005-11-04, 10:13 AM
Click on "回复", instead of editing your post in "快速回复", and the upload widget you'll see.

ibid
2005-11-04, 01:29 PM
thanks, dzhigner.http://forum.corpus4u.org/upload/forum/2005110413272140.doc

happyw
2006-03-26, 02:17 PM
用SARA算BNC里的一些搭配词的z值,放到专门算Z-score 的软件CalcZ里检验,结果差得很远,这是怎么回事啊?

据说,SARA软件里的公式可能有flaws.