[讨论]Sample corpus or monitor corpus?

xiaoz · 2005-05-23

Sample corpus or monitor corpus?

Richard Xiao

Most existing corpora are sample corpora, which try to sample language data in a balanced way and remain static once they are created. In contrast, a monitor corpus is constantly (e.g. annually, monthly or even daily) supplemented with fresh material and keeps increasing in size, though the proportion of text types included in the corpus remains constant. Corpora of this type are typically much larger than sample corpora. The Bank of English (BoE) is widely acknowledged to be an example of a monitor corpus. It has increased in size progressively since its inception in the 1980s (Hunston 2002: 15) and is around 524 million words at present. The Global English Monitor Corpus, which was started in late 2001 as an electronic archive of the world’s leading newspapers in English, is expected to reach billions of words within a few years. The corpus aims at monitoring language use and semantic change in English as reflected in newspapers so as to allow for research into whether the English language discourses in Britain, the United States, Australia, Pakistan and South Africa are convergent or divergent over time. Another example of corpora of this kind is AVIATOR (Analysis of Verbal Interaction and Automated Text Retrieval), developed at the University of Birmingham, which automatically monitors language change, using a series of filters to identify and categorize new word forms, new word pairs or terms, and change in meaning. However, as a monitor corpus does not have a finite size, some corpus linguists have argued that it is an ‘ongoing archive’ (Leech 1991: 10) rather than a true corpus.

There was an impromptu debate at a joint conference of the Association for Literary and Linguistic Computing (ALLC) and the Association for Computing in the Humanities (ACH) at Christchurch College, Oxford in 1992 between, on the one hand, Quirk and Leech arguing in favour of the balanced sample corpus model and on the other hand Sinclair and Meijs who spoke in favour of the monitor corpus model (cf. Geoffrey Leech, personal communication). Whilst the monitor corpus team won the debate in 1992, it is now clear that the sample corpus model has won the wider debate, as evidenced by it being the dominant tradition in modern corpus building; the majority of corpora which have been built to date are balanced sample corpora, as exemplified by the pioneer English corpora Brown and LOB, and more recently in the British National Corpus. Nonetheless, the idea of the monitor corpus is still important and deserves a review here.

The monitor corpus approach was first developed in Sinclair (1991a: 24-26). Sinclair argued against static sample corpora like Brown and argued in favour of an ever growing dynamic monitor corpus. The ideas expressed in Sinclair (1991a) mirror the way people were thinking about corpora two decades ago. Unsurprisingly, Sinclair has amended his views ‘as new advances come on stream’ and no longer holds many of the positions held in his 1991a work (Sinclair, personal communication). However, the arguments expressed therein still have some currency, both in the writing of Sinclair (e.g. Sinclair 2004a: 187-191) and others (e.g. Tognini-Bonelli 2001). So it is worth reviewing the arguments presented by Sinclair against sample corpora. The major arguments relate to the overall corpus size (one million words) and the sample size (2,000 words). These concerns have largely been neutralized by an increase in both computer power and the availability of electronic texts (in many languages). As Aston and Burnard (1998: 22) comment, ‘The continued growth in the size of corpora has generally implied an increase in sample sizes as well as the number of samples.’ The overall size of the BNC, for example, is 100 million words. Accordingly, the sample size in the BNC has also increased to 40,000 C 50,000 words while the number of samples has increased to over 4,000. Samples of this size can no longer be said to be ‘too brief’, and sub-categories composed of such texts can indeed ‘stand as samples themselves.’ Biber (1988, 1993) shows that even in corpora consisting of 2,000-word samples, frequent linguistic features are quite stable both within samples and across samples in different text categories. For relatively rare features and for vocabulary, though, Sinclair’s warning is still valid.

A monitor corpus undergoes a partial self-recycling after reaching some sort of saturation, i.e. the inflow of the new data is subjected to an automatic monitoring process which will only admit new material to the corpus where that material shows some features which differ significantly from the stable part of the corpus (cf. Váradi 2000: 2). There are a number of difficulties with the monitor corpus model, however. First, as this approach rejects any principled attempt to balance a corpus, depending instead upon sheer size to deal with the issue, Leech (1991: 10) refers to a monitor corpus as an ‘ongoing archive’, while Váradi (2000: 2) would label it as ‘opportunistic’. As such, monitor corpora are a less reliable basis than balanced sample corpora for quantitative (as opposed to qualitative) analysis. Second, as this approach argues strongly in favour of whole texts, text availability becomes a difficulty in the already sensitive context of copyright. Third, it is quite confusing to indicate a specific corpus version with its word count. Under such circumstances it is only the corpus builders, not the users, who know what is contained in a specific version. Fourth, as a monitor corpus keeps growing in size, results cannot easily be compared. A monitor corpus thus loses its value as ‘a standard reference’ (McEnery and Wilson 2001: 32). Gronqvist (2004) suggests, rightly in our view, that ‘[a] system where it is possible to restore the corpus as it was at a specific time would be worth a lot’ for a monitor corpus. This suggestion would mean that a dynamic monitor corpus should in effect consist of a series of static corpora over time. This is not current practice. One final concern of the dynamic model is that, even if the huge corpus size and required processing speed should not become a problem as the rapid development of computing technology makes this a non-issue, there is no guarantee that the same criteria will be used to filter in new data in the long term (e.g. in 200 years), meaning that even if a diachronic archive of the sort suggested is established, the comparability of the archived version of the corpus would be in doubt.

Monitor corpora are potentially useful, however. A monitor corpus is primarily designed to track changes from different periods (cf. Hunston 2002: 16). In this sense, a monitor corpus is similar to a diachronic corpus. However, a monitor corpus normally covers a shorter span of time but in a much more fine-grained fashion than the diachronic corpora discussed so far. It is possible, using a monitor corpus, to study relatively rapid language change, such as the development and the life cycle of neologisms. In principle, if a monitor corpus is in existence for a very long period of time C 30 years for example C it may also be possible to study change happening at a much slower rate, e.g. grammatical change. At present, however, no monitor corpus has been in existence long enough to enable the type of research undertaken using diachronic corpora like the Helsinki corpus or sample corpora such as LOB and FLOB to be undertaken fruitfully (e.g. Leech 2002).

What do you think about the debate? You are welcome to contribute to the discussion.

[讨论]Sample corpus or monitor corpus?

xiaoz

永远的超级管理员