100 Million Corpus: registers, WordNet, synonyms


100 Million Corpus: registers, WordNet, synonyms
Date: 04-May-2005
From: Mark Davies <mark_daviesbyu.edu>
Subject: 100 Million Corpus: registers, WordNet, synonyms

There is a free resource that may be of interest -
"Variation in English Words and Phrases" found at:


This is a new interface to the 100 million word British National Corpus,
probably the most well-known corpus of English. One can carry out the
following types of searches -- most of which are not possible with any
other interface:

1. Quickly find the frequency of words and phrases in any combination of
more than 70 registers that you define (spoken, academic, poetry, medical,
tabloids, email, etc); e.g.:
-- the most common nouns in natural sciences texts, adjectives in
engineering texts, or verbs in medical texts
-- which collocates (co-occurring words) occur more in one register than
another; e.g. the collocates of [chair] in fiction vs. academic texts
-- variation in grammatical constructions across registers; e.g. the
relative frequency of the passive in academic vs spoken, the relative
frequency of [whom] in all 70 registers, etc.

2. Compare between synonyms and other semantically-related words. One
simple search, for example, shows the most frequent nouns that appear with
[sheer], [complete], or [utter] (sheer nonsense, complete account, utter
dismay), but not with the others. Another simple search, for example, would
look for adjectives that occur with [woman] but not [man] or [child].

3. You can also input information from WordNet (a semantically-organized
lexicon of English) directly into the search form. This allows you to find
the frequency and distribution of words with similar, more general, or more
specific meanings (e.g. the frequency of synonyms of [world], or the
frequency of more specific words for [jump]).

4. Search for words and phrases by exact word or phrase, wildcard or part
of speech, or combinations of these (e.g. *ly good/bad [n*]: really good
time, extremely bad idea).

5. Use anchors and targets for fuzzy matches (e.g. all nouns somewhere near
[paper], all adjectives near [woman], or all nouns near [spin]).

Please feel free to email me with any questions that you might have.

Mark Davies
Dept. Linguistics, Brigham Young University

Linguistic Field(s): Computational Linguistics
Discourse Analysis
Text/Corpus Linguistics