[求助]《中国学习者语料库》第13页语料库词频分布的问题:累频

dzhigner

Moderator
“Herdan认为词频的分布是遵循对数正态模型(Herdan, 1960: Caroll, 1967), 即如果把样本(用词次表示)的积累百分比和相应的词型频数的对数作图,前者为Y轴后者为X轴,其分布是正态的。”
拜托高手给“样本积累百分比和相应的词型频数”这两个概念解释一下。。

[本贴已被 xujiajin 于 2005年09月22日 23时19分25秒 编辑过]
 

tiger

高级会员
这里应该这样理解,在按词型的频率进行排序的词型列表中,“样本积累百分比”(cumulative percentage)是各个词型的积累百分比,“相应的词型频数”指与各个积累百分比相对应的词型频数。

不知理解是否正确。
 

dzhigner

Moderator
回复:[求助]关于《中国学习者语料库》第十三页语料库词频分布的问题

以下是引用 tiger2005-9-21 19:10:57 的发言:
这里应该这样理解,在按词型的频率进行排序的词型列表中,“样本积累百分比”(cumulative percentage)是各个词型的积累百分比,“相应的词型频数”指与各个积累百分比相对应的词型频数。

不知理解是否正确。
积累百分比是否将各词型在总词型数中所占百分比的累加?
 

dzhigner

Moderator
Word Number Cumulative Cumulative Percentage Percentage
Frequency of Words Vocabulary Word Count Vocabulary Word Count
1 14856 14856 14856 45.33830 3.08546
2 4929 19785 24714 60.38087 5.13288
3 2592 22377 32490 68.29127 6.74789
4 1697 24074 39278 73.47026 8.15770
5 1153 25227 45043 76.98904 9.35504
6 874 26101 50287 79.65636 10.44417
7 664 26765 54935 81.68279 11.40952
8 503 27268 58959 83.21787 12.24527
9 464 27732 63135 84.63393 13.11259
10 369 28101 66825 85.76006 13.87897
11 329 28430 70444 86.76412 14.63060
12 275 28705 73744 87.60338 15.31598
13 259 28964 77111 88.39381 16.01528
14 207 29171 80009 89.02554 16.61717
15 195 29366 82934 89.62065 17.22466
16 176 29542 85750 90.15778 17.80952
17 148 29690 88266 90.60945 18.33207
18 149 29839 90948 91.06418 18.88910
19 132 29971 93456 91.46702 19.40999
20 116 30087 95776 91.82104 19.89183
21 122 30209 98338 92.19337 20.42394
22 114 30323 100846 92.54128 20.94483
23 119 30442 103583 92.90445 21.51328
24 102 30544 106031 93.21574 22.02171
25 96 30640 108431 93.50871 22.52017
26 92 30732 110823 93.78948 23.01696
27 68 30800 112659 93.99701 23.39829
28 67 30867 114535 94.20148 23.78791
29 61 30928 116304 94.38765 24.15532
30 74 31002 118524 94.61348 24.61639
31 53 31055 120167 94.77523 24.95763
32 49 31104 121735 94.92477 25.28329
33 53 31157 123484 95.08652 25.64654
34 57 31214 125422 95.26048 26.04905
35 38 31252 126752 95.37645 26.32528
36 33 31285 127940 95.47716 26.57201
37 39 31324 129383 95.59618 26.87171
38 34 31358 130675 95.69994 27.14005
39 42 31400 132313 95.82812 27.48025
40 38 31438 133833 95.94409 27.79594
41 38 31476 135391 96.06006 28.11952
42 41 31517 137113 96.18519 28.47717
43 37 31554 138704 96.29810 28.80760
44 33 31587 140156 96.39882 29.10917
45 33 31620 141641 96.49953 29.41759
46 23 31643 142699 96.56972 29.63733
47 24 31667 143827 96.64296 29.87161
48 29 31696 145219 96.73147 30.16071
49 22 31718 146297 96.79861 30.38460
50 21 31739 147347 96.86270 30.60268
51 17 31756 148214 96.91458 30.78275
52 17 31773 149098 96.96646 30.96635
53 17 31790 149999 97.01834 31.15348
54 15 31805 150809 97.06412 31.32171
55 16 31821 151689 97.11295 31.50447
56 18 31839 152697 97.16788 31.71383
57 21 31860 153894 97.23197 31.96243
58 19 31879 154996 97.28996 32.19131
59 12 31891 155704 97.32658 32.33835
60 13 31904 156484 97.36625 32.50035
61 13 31917 157277 97.40593 32.66505
62 11 31928 157959 97.43950 32.80670
63 19 31947 159156 97.49748 33.05530
64 19 31966 160372 97.55547 33.30786
65 20 31986 161672 97.61650 33.57786
66 21 32007 163058 97.68059 33.86572
67 12 32019 163862 97.71722 34.03270
68 11 32030 164610 97.75079 34.18805
69 13 32043 165507 97.79046 34.37435
70 14 32057 166487 97.83319 34.57789
71 15 32072 167552 97.87896 34.79908
72 9 32081 168200 97.90643 34.93366
73 11 32092 169003 97.94000 35.10044
74 12 32104 169891 97.97662 35.28487
75 11 32115 170716 98.01019 35.45621
76 17 32132 172008 98.06207 35.72455
77 9 32141 172701 98.08954 35.86848
78 10 32151 173481 98.12006 36.03048
79 5 32156 173876 98.13532 36.11252
80 13 32169 174916 98.17499 36.32852
81 6 32175 175402 98.19330 36.42946
82 5 32180 175812 98.20856 36.51461
83 10 32190 176642 98.23908 36.68699
84 9 32199 177398 98.26655 36.84401
85 8 32207 178078 98.29096 36.98524
86 10 32217 178938 98.32148 37.16385
87 9 32226 179721 98.34895 37.32647
88 7 32233 180337 98.37031 37.45441
89 5 32238 180782 98.38557 37.54683
90 6 32244 181322 98.40388 37.65899
91 5 32249 181777 98.41914 37.75349
92 9 32258 182605 98.44661 37.92546
93 5 32263 183070 98.46187 38.02203
94 6 32269 183634 98.48018 38.13917
95 5 32274 184109 98.49544 38.23782
96 7 32281 184781 98.51680 38.37739
97 12 32293 185945 98.55342 38.61914
98 7 32300 186631 98.57479 38.76162
99 4 32304 187027 98.58699 38.84387
100 2 32306 187227 98.59310 38.88540
.
.
.

2058 1 32744 324618 99.92981 67.42031
2098 1 32745 326716 99.93286 67.85605
2191 1 32746 328907 99.93591 68.31110
2438 1 32747 331345 99.93896 68.81745
2495 1 32748 333840 99.94201 69.33564
2587 1 32749 336427 99.94507 69.87293
2720 1 32750 339147 99.94812 70.43785
2787 1 32751 341934 99.95117 71.01669
3025 1 32752 344959 99.95422 71.64496
3096 1 32753 348055 99.95727 72.28797
3208 1 32754 351263 99.96033 72.95424
3475 1 32755 354738 99.96338 73.67597
3496 1 32756 358234 99.96643 74.40206
3581 1 32757 361815 99.96948 75.14580
3845 1 32758 365660 99.97253 75.94437
4733 1 32759 370393 99.97559 76.92737
5032 1 32760 375425 99.97864 77.97248
5860 1 32761 381285 99.98169 79.18955
10626 1 32762 391911 99.98474 81.39647
11703 1 32763 403614 99.98779 83.82708
12449 1 32764 416063 99.99084 86.41263
13398 1 32765 429461 99.99390 89.19528
17967 1 32766 447428 99.99695 92.92687
34056 1 32767 481484 100.00000 100.00000

以上是Brown语料库在Simple Concordance Program 里生成的word frequency profile
是否应该把第一列数据(word frequency)作为X轴数据,对第五列数据(percentage word count)作为Y轴作图?

作出的散点图如上。。
不知道是不是这么回事,一头雾水。。

[本贴已被 作者 于 2005年09月21日 23时21分03秒 编辑过]
 

tiger

高级会员
积累百分比应该是各词型的词次数在总词次数中所占百分比的累加,如
1 the 20 50%
2 at 10 75%
3 in 10 100%

只能是这样。
 

xujiajin

管理员
Staff member
再给一个累频的例子:
Variables in Names Files:
name
freq = Frequency in percent
cum.freq = Cumulative Frequency in percent
rank
---------------------------------------
First ten entries in dist.all.last
---------------------------------------
name freq cum.freq rank
SMITH 1.006 1.006 1
JOHNSON 0.810 1.816 2
WILLIAMS 0.699 2.515 3
JONES 0.621 3.136 4
BROWN 0.621 3.757 5
DAVIS 0.480 4.237 6
MILLER 0.424 4.660 7
WILSON 0.339 5.000 8
MOORE 0.312 5.312 9
TAYLOR 0.311 5.623 10
---------------------------------------
First ten entries in dist.female.first
---------------------------------------
name freq cum.freq rank
MARY 2.629 2.629 1
PATRICIA 1.073 3.702 2
LINDA 1.035 4.736 3
BARBARA 0.980 5.716 4
ELIZABETH 0.937 6.653 5
JENNIFER 0.932 7.586 6
MARIA 0.828 8.414 7
SUSAN 0.794 9.209 8
MARGARET 0.768 9.976 9
DOROTHY 0.727 10.703 10
---------------------------------------
First ten entries in dist.male.first
---------------------------------------
name freq cum.freq rank
JAMES 3.318 3.318 1
JOHN 3.271 6.589 2
ROBERT 3.143 9.732 3
MICHAEL 2.629 12.361 4
WILLIAM 2.451 14.812 5
DAVID 2.363 17.176 6
RICHARD 1.703 18.878 7
CHARLES 1.523 20.401 8
JOSEPH 1.404 21.805 9
THOMAS 1.380 23.185 10
加粗的地方很清楚的说明了累频是如何得来的。
 
顶部