The two statistical measures of significance which are used by the

collocations feature of the CobuildDirect service are explained below in

layman's terms. It is not really possible to explain the complete

statistical background to the use of Mutual Information and t-scores here.

Let us work through some example data (taken from a 20m word corpus) for

the word "post".

It co-occurs with many words, among which are "the", "office" and

"mortem".

The observable facts are that "post" has an overall corpus freq of 2579

(let's refer to this as f(post)=2579) and also

f(office) = 5237

f(the) = 1019262

f(mortem) = 51

We also observe the number of times these words co-occurred with "post"

(for shorthand I'll write j(the) = 1583 to mean that "the" occurred with

"post" 1583 times: this is the "joint" frequency). So

j(the) = 1583

j(office) = 297

j(mortem) = 51

Now if we were to list the collocates of "post' by raw frequency of

co-occurrence we would order them according to j(x), as above. Of course,

a full collocation listing of "post" in this form would have many other

words with intermediate frequencies -- we are just focussing on these

three words for the moment. But the ordering show above doesn't tell us

anything much about the strength of association between "post" and these

other words: it is simply a reflection of the basic overall frequency of

the collocating words (i.e. "the" is much more frequent than "office"

which is much more frequent than "mortem"). We just showed that in the

f(x) list! This is true in general: ordering collocates by j(x) simply

places words like "the", "a", "of", "to" at the top of every collocate

list. What we would like to know is:

------------------------------------------------------------------------------------------------------------

IMPORTANTQUESTION: to what extent does the word "post" condition its

lexical environment by selecting particular words with which it will

co-occur?

------------------------------------------------------------------------------------------------------------

We can compare the relative frequencies of what we observed with what we

would expect under the null hypothesis:

------------------------------------------------------------------------------------------------------------

NULLHYPOTHESIS: the word "post" has no effect whatsoever on its lexical

environment and the frequencies of words surrounding "post" will be

exactly (give or take random fluctuation) the same as they would be if

"post" were present or not.

------------------------------------------------------------------------------------------------------------

That is, if "the" has an overall relative frequency of 1 in 20 (about 1m

occurrences in a 20m word corpus -- see f(the) above) then we can expect

"the" to occur with the same relative frequency in a subset of the corpus

which is the 4 words either side of "post": hence under the null

hypothesis we would expect j(the) to be

(f(post) * span ) * relative_freq(the)

which is

(2579 * 8) * (1 / 20) = 20632 / 20 = 1031

So under the null hypothesis we would expect j(the) to be 1031. We

actually observed j(the) to be 1583, which is rather higher, and we could

simply express the difference as ratio (of observed to expected joint

frequency) thus:

1583/1031

This is the Mutual Information score and it expresses the extent to which

observed frequency of co-occurrence differs from expected (where we mean

"expected under the null hypothesis"). Of course, big differences indicate

massive divergence from the null hypothesis and indicate that "post" is

exerting a strong influence over its lexical environment.

BUT BUT BUT! there is Big Problem with Mutual Information: suppose the

word "egregious" appears just once with "post" (not an unreasonable event)

in the corpus. And "egregious" may have a very low overall freq:

f(egregious) = 3

Now we carry out the sums to calculate the expected j(egregious) figure. I

can assure you it will be a small number! It is:

(f(post) * span ) * relative_freq(egregious)

(2579 * 8) * ( 3 / 20000000)

= 0.0030948

Now you'll see that even if "egregious" occurs just once in the vicinity

of "post" the observed j(egregious) will be 323 times more than the

expected joint frequency, and the mutual information value will be high.

Common sense tells us that since words cannot appear 0.0030948 times --

they either occur zero or one times, nothing in between -- that claiming

that "post"+"egregious" is a significant collocation is rather dubious.

In general, the comparison of observed j(x) and expected j(x) will be very

unreliable when values of j(x) are low; this is common sense, too. Just

because I've seen these two words together once in 20m words doesn't give

me much confidence that they are strongly associated: I'd need to see them

together several times at least before I could start to feel at all secure

in claiming that they have some sort of significant association.

Now here comes T-score. We can calculate a second-order statistic which

is, crudely, this:

------------------------------------------------------------------------------------------------------------

IMPORTANTQUESTION: how confident can I be that the association that I've

measured between "post" and "egregious" is true and not due to the

vagaries of chance?

------------------------------------------------------------------------------------------------------------

T-score answers this question. It takes account of the size of j(x) and

weights its value accordingly. A high T-score says: it is safe (very

safe/pretty safe/extremely secure etc according to value) to claim that

there is some non-random association between these two words. So t-scores

are higher when the figure j(x) is higher. In the case of "egregious" we

would get a very low t-score. In the case of "the" the t-score might be

quite high, but not huge because "the" doesn't have that strong an

association with "post". "office" gets a really high t-score because not

only is the observed j(office) way higher than expected, but we seen a

goodly number of such co-occurrences, enough to be pretty damn sure that

this can't be due to some freak of chance.

In practical terms, raw frequency or j(x) won't tell you much at all about

collocation: you'll simply discover what you already knew that "the" is a

*very* frequenct word and seems to co-occur with just about everything. MI

is the proper measure of strength of association: if the MI score is high,

then observed j(x) is massively greater then expected, BUT you've got to

watch out for the low j(x) frequencies because these are very likely to be

freaks of chance, not consistent trends. t-score is best of the lot,

because it highlights those collocations where j(x) is high enough not to

be unreliable and where the strength of association is distinctly

measurable.

Try the different measures: you'll soon see the difference. Raw freq often

picks out the obvious collocates ("post office" "side effect") but you

have no way of distinguishing these objectively from frequent non

collocations (like "the effect" "an effect" "effect is" "effect it" etc).

MI will highlight the technical terms, oddities, weirdos, totally fixed

phrases, etc ("post mortem" "Laurens van der Post" "post-menopausal"

"prepaid post"/"post prepaid" "post-grad") T-score will get you

significant collocates which have occurred frequently ("post office"

"Washington Post" "post-war", "by post" "the post").

If a collocate appears in the top of both MI and t-score lists it is

clearly a humdinger of a collocate, rock-solid, typical, frequent,

strongly associated with its node word, recurrent, reliable, etc etc etc.

Jem Clear

June 1995