Jem Clear
The two statistical measures of significance which are used by the
collocations feature of the CobuildDirect service are explained below in
layman's terms. It is not really possible to explain the complete
statistical background to the use of Mutual Information and t-scores here.
Let us work through some example data (taken from a 20m word corpus) for
the word "post".
It co-occurs with many words, among which are "the", "office" and
"mortem".
The observable facts are that "post" has an overall corpus freq of 2579
(let's refer to this as f(post)=2579) and also
f(office) = 5237
f(the) = 1019262
f(mortem) = 51
We also observe the number of times these words co-occurred with "post"
(for shorthand I'll write j(the) = 1583 to mean that "the" occurred with
"post" 1583 times: this is the "joint" frequency). So
j(the) = 1583
j(office) = 297
j(mortem) = 51
Now if we were to list the collocates of "post' by raw frequency of
co-occurrence we would order them according to j(x), as above. Of course,
a full collocation listing of "post" in this form would have many other
words with intermediate frequencies -- we are just focussing on these
three words for the moment. But the ordering show above doesn't tell us
anything much about the strength of association between "post" and these
other words: it is simply a reflection of the basic overall frequency of
the collocating words (i.e. "the" is much more frequent than "office"
which is much more frequent than "mortem"). We just showed that in the
f(x) list! This is true in general: ordering collocates by j(x) simply
places words like "the", "a", "of", "to" at the top of every collocate
list. What we would like to know is:
------------------------------------------------------------------------------------------------------------
IMPORTANTQUESTION: to what extent does the word "post" condition its
lexical environment by selecting particular words with which it will
co-occur?
------------------------------------------------------------------------------------------------------------
We can compare the relative frequencies of what we observed with what we
would expect under the null hypothesis:
------------------------------------------------------------------------------------------------------------
NULLHYPOTHESIS: the word "post" has no effect whatsoever on its lexical
environment and the frequencies of words surrounding "post" will be
exactly (give or take random fluctuation) the same as they would be if
"post" were present or not.
------------------------------------------------------------------------------------------------------------
That is, if "the" has an overall relative frequency of 1 in 20 (about 1m
occurrences in a 20m word corpus -- see f(the) above) then we can expect
"the" to occur with the same relative frequency in a subset of the corpus
which is the 4 words either side of "post": hence under the null
hypothesis we would expect j(the) to be
(f(post) * span ) * relative_freq(the)
which is
(2579 * 8) * (1 / 20) = 20632 / 20 = 1031
So under the null hypothesis we would expect j(the) to be 1031. We
actually observed j(the) to be 1583, which is rather higher, and we could
simply express the difference as ratio (of observed to expected joint
frequency) thus:
1583/1031
This is the Mutual Information score and it expresses the extent to which
observed frequency of co-occurrence differs from expected (where we mean
"expected under the null hypothesis"). Of course, big differences indicate
massive divergence from the null hypothesis and indicate that "post" is
exerting a strong influence over its lexical environment.
BUT BUT BUT! there is Big Problem with Mutual Information: suppose the
word "egregious" appears just once with "post" (not an unreasonable event)
in the corpus. And "egregious" may have a very low overall freq:
f(egregious) = 3
Now we carry out the sums to calculate the expected j(egregious) figure. I
can assure you it will be a small number! It is:
(f(post) * span ) * relative_freq(egregious)
(2579 * 8) * ( 3 / 20000000)
= 0.0030948
Now you'll see that even if "egregious" occurs just once in the vicinity
of "post" the observed j(egregious) will be 323 times more than the
expected joint frequency, and the mutual information value will be high.
Common sense tells us that since words cannot appear 0.0030948 times --
they either occur zero or one times, nothing in between -- that claiming
that "post"+"egregious" is a significant collocation is rather dubious.
In general, the comparison of observed j(x) and expected j(x) will be very
unreliable when values of j(x) are low; this is common sense, too. Just
because I've seen these two words together once in 20m words doesn't give
me much confidence that they are strongly associated: I'd need to see them
together several times at least before I could start to feel at all secure
in claiming that they have some sort of significant association.
Now here comes T-score. We can calculate a second-order statistic which
is, crudely, this:
------------------------------------------------------------------------------------------------------------
IMPORTANTQUESTION: how confident can I be that the association that I've
measured between "post" and "egregious" is true and not due to the
vagaries of chance?
------------------------------------------------------------------------------------------------------------
T-score answers this question. It takes account of the size of j(x) and
weights its value accordingly. A high T-score says: it is safe (very
safe/pretty safe/extremely secure etc according to value) to claim that
there is some non-random association between these two words. So t-scores
are higher when the figure j(x) is higher. In the case of "egregious" we
would get a very low t-score. In the case of "the" the t-score might be
quite high, but not huge because "the" doesn't have that strong an
association with "post". "office" gets a really high t-score because not
only is the observed j(office) way higher than expected, but we seen a
goodly number of such co-occurrences, enough to be pretty damn sure that
this can't be due to some freak of chance.
In practical terms, raw frequency or j(x) won't tell you much at all about
collocation: you'll simply discover what you already knew that "the" is a
*very* frequenct word and seems to co-occur with just about everything. MI
is the proper measure of strength of association: if the MI score is high,
then observed j(x) is massively greater then expected, BUT you've got to
watch out for the low j(x) frequencies because these are very likely to be
freaks of chance, not consistent trends. t-score is best of the lot,
because it highlights those collocations where j(x) is high enough not to
be unreliable and where the strength of association is distinctly
measurable.
Try the different measures: you'll soon see the difference. Raw freq often
picks out the obvious collocates ("post office" "side effect") but you
have no way of distinguishing these objectively from frequent non
collocations (like "the effect" "an effect" "effect is" "effect it" etc).
MI will highlight the technical terms, oddities, weirdos, totally fixed
phrases, etc ("post mortem" "Laurens van der Post" "post-menopausal"
"prepaid post"/"post prepaid" "post-grad") T-score will get you
significant collocates which have occurred frequently ("post office"
"Washington Post" "post-war", "by post" "the post").
If a collocate appears in the top of both MI and t-score lists it is
clearly a humdinger of a collocate, rock-solid, typical, frequent,
strongly associated with its node word, recurrent, reliable, etc etc etc.
Jem Clear
June 1995
The two statistical measures of significance which are used by the
collocations feature of the CobuildDirect service are explained below in
layman's terms. It is not really possible to explain the complete
statistical background to the use of Mutual Information and t-scores here.
Let us work through some example data (taken from a 20m word corpus) for
the word "post".
It co-occurs with many words, among which are "the", "office" and
"mortem".
The observable facts are that "post" has an overall corpus freq of 2579
(let's refer to this as f(post)=2579) and also
f(office) = 5237
f(the) = 1019262
f(mortem) = 51
We also observe the number of times these words co-occurred with "post"
(for shorthand I'll write j(the) = 1583 to mean that "the" occurred with
"post" 1583 times: this is the "joint" frequency). So
j(the) = 1583
j(office) = 297
j(mortem) = 51
Now if we were to list the collocates of "post' by raw frequency of
co-occurrence we would order them according to j(x), as above. Of course,
a full collocation listing of "post" in this form would have many other
words with intermediate frequencies -- we are just focussing on these
three words for the moment. But the ordering show above doesn't tell us
anything much about the strength of association between "post" and these
other words: it is simply a reflection of the basic overall frequency of
the collocating words (i.e. "the" is much more frequent than "office"
which is much more frequent than "mortem"). We just showed that in the
f(x) list! This is true in general: ordering collocates by j(x) simply
places words like "the", "a", "of", "to" at the top of every collocate
list. What we would like to know is:
------------------------------------------------------------------------------------------------------------
IMPORTANTQUESTION: to what extent does the word "post" condition its
lexical environment by selecting particular words with which it will
co-occur?
------------------------------------------------------------------------------------------------------------
We can compare the relative frequencies of what we observed with what we
would expect under the null hypothesis:
------------------------------------------------------------------------------------------------------------
NULLHYPOTHESIS: the word "post" has no effect whatsoever on its lexical
environment and the frequencies of words surrounding "post" will be
exactly (give or take random fluctuation) the same as they would be if
"post" were present or not.
------------------------------------------------------------------------------------------------------------
That is, if "the" has an overall relative frequency of 1 in 20 (about 1m
occurrences in a 20m word corpus -- see f(the) above) then we can expect
"the" to occur with the same relative frequency in a subset of the corpus
which is the 4 words either side of "post": hence under the null
hypothesis we would expect j(the) to be
(f(post) * span ) * relative_freq(the)
which is
(2579 * 8) * (1 / 20) = 20632 / 20 = 1031
So under the null hypothesis we would expect j(the) to be 1031. We
actually observed j(the) to be 1583, which is rather higher, and we could
simply express the difference as ratio (of observed to expected joint
frequency) thus:
1583/1031
This is the Mutual Information score and it expresses the extent to which
observed frequency of co-occurrence differs from expected (where we mean
"expected under the null hypothesis"). Of course, big differences indicate
massive divergence from the null hypothesis and indicate that "post" is
exerting a strong influence over its lexical environment.
BUT BUT BUT! there is Big Problem with Mutual Information: suppose the
word "egregious" appears just once with "post" (not an unreasonable event)
in the corpus. And "egregious" may have a very low overall freq:
f(egregious) = 3
Now we carry out the sums to calculate the expected j(egregious) figure. I
can assure you it will be a small number! It is:
(f(post) * span ) * relative_freq(egregious)
(2579 * 8) * ( 3 / 20000000)
= 0.0030948
Now you'll see that even if "egregious" occurs just once in the vicinity
of "post" the observed j(egregious) will be 323 times more than the
expected joint frequency, and the mutual information value will be high.
Common sense tells us that since words cannot appear 0.0030948 times --
they either occur zero or one times, nothing in between -- that claiming
that "post"+"egregious" is a significant collocation is rather dubious.
In general, the comparison of observed j(x) and expected j(x) will be very
unreliable when values of j(x) are low; this is common sense, too. Just
because I've seen these two words together once in 20m words doesn't give
me much confidence that they are strongly associated: I'd need to see them
together several times at least before I could start to feel at all secure
in claiming that they have some sort of significant association.
Now here comes T-score. We can calculate a second-order statistic which
is, crudely, this:
------------------------------------------------------------------------------------------------------------
IMPORTANTQUESTION: how confident can I be that the association that I've
measured between "post" and "egregious" is true and not due to the
vagaries of chance?
------------------------------------------------------------------------------------------------------------
T-score answers this question. It takes account of the size of j(x) and
weights its value accordingly. A high T-score says: it is safe (very
safe/pretty safe/extremely secure etc according to value) to claim that
there is some non-random association between these two words. So t-scores
are higher when the figure j(x) is higher. In the case of "egregious" we
would get a very low t-score. In the case of "the" the t-score might be
quite high, but not huge because "the" doesn't have that strong an
association with "post". "office" gets a really high t-score because not
only is the observed j(office) way higher than expected, but we seen a
goodly number of such co-occurrences, enough to be pretty damn sure that
this can't be due to some freak of chance.
In practical terms, raw frequency or j(x) won't tell you much at all about
collocation: you'll simply discover what you already knew that "the" is a
*very* frequenct word and seems to co-occur with just about everything. MI
is the proper measure of strength of association: if the MI score is high,
then observed j(x) is massively greater then expected, BUT you've got to
watch out for the low j(x) frequencies because these are very likely to be
freaks of chance, not consistent trends. t-score is best of the lot,
because it highlights those collocations where j(x) is high enough not to
be unreliable and where the strength of association is distinctly
measurable.
Try the different measures: you'll soon see the difference. Raw freq often
picks out the obvious collocates ("post office" "side effect") but you
have no way of distinguishing these objectively from frequent non
collocations (like "the effect" "an effect" "effect is" "effect it" etc).
MI will highlight the technical terms, oddities, weirdos, totally fixed
phrases, etc ("post mortem" "Laurens van der Post" "post-menopausal"
"prepaid post"/"post prepaid" "post-grad") T-score will get you
significant collocates which have occurred frequently ("post office"
"Washington Post" "post-war", "by post" "the post").
If a collocate appears in the top of both MI and t-score lists it is
clearly a humdinger of a collocate, rock-solid, typical, frequent,
strongly associated with its node word, recurrent, reliable, etc etc etc.
Jem Clear
June 1995