UCREL: Univ Centre for Corpus Research on Lg

xujiajin

管理员
Staff member
University Centre for Computer Corpus Research on Language (UCREL)

http://www.comp.lancs.ac.uk/ucrel/

is a research centre of Lancaster University. It draws upon the expertise of the Department of Linguistics and Modern English Language and the Department of Computing.

For more than two decades, we have led the way in an approach to natural language processing that is based upon information derived from large bodies of naturally-occuring text. These bodies of text are stored on the computer and are known as corpora (sg. corpus).

The vast majority of UCREL's work is carried out within this corpus-based paradigm. The corpora are used to derive empirical knowledge about language, which can supplement, and frequently supplant, information from reference sources and introspection (Leech, 1991; 1992).

Because they are well suited to quantitative analysis, corpora can provide information about the relative frequencies of many aspects of language. These frequencies can then be employed in probabilistic analysis techniques, which are another major feature of UCREL's work.

Probabilistic systems, rather than using hard-and-fast rules, instead use frequency data along with sophisticated statistical models to make a `best guess' about the correct analysis of a piece of language (Sampson, 1987b). Although probabilistic systems make mistakes, they often perform at a very high degree of accuracy (in the high nineties per cent). Compared with rule-based systems, they are exceptionally robust, and can analyze `real' language containing performance errors (as opposed to idealized invented examples) where rule-based systems would often fail. Because of this robustness and overall accuracy, mainstream computational linguists are now taking an increased interest in probabilistic methods and corpora.

UCREL's work is very much focussed on practical outcomes. We have engaged in corpus-based research contributing to such practical applications as:

speech synthesis
speech recognition
machine-aided translation
dictionary publishing
social survey interview analysis
computer-aided language teaching
Our work focusses on:
English - we were a leading partner in the British National Corpus consortium and are now exploiting the BNC to arrive at new, data-grounded analyses of present-day British speech and writing. We are also involved in corpus-based work on the historical development of the English language, as well as on learner English.

Modern foreign languages - we have built, annotated, and exploited corpora of modern languages such as French and Spanish, and we are presently involved (in collaboration with the University of Lodz) in producing a major corpus of contemporary Polish.

Minority, endangered, and ancient languages - we have pioneered corpus work on non-indigenous minority languages in the UK (e.g. Chinese, Hindi, Punjabi), and we are now extending this work to European indigenous minority languages. We have also carried out computer-aided linguistic research on ancient languages such as Latin.
We are always looking for new ways to apply our expertise and are interested in research projects and individual or team-based consultancy, as well as sharing ideas or techniques.
If there is a natural language topic you would like to explore with us, please contact us at the address shown on the home page.
 
Back
顶部