http://www.bossenglish.com/ReadNews.asp?NewsID=443
Early Corpus Linguistics and the Chomskyan Revolution
"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.
Language acquisition
The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 - analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present - again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)].
Spelling conventions
Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.
Language pedagogy
Fries and Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagody had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.
Other examples
Comparative linguistics, and syntax and semantics can be read about in Chapter 1, page 3 of "Corpus Linguistics".
Chomsky
Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance.
Competence is best described as our tacit, internalised knowledge of a language.
Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form.
Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence.
Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it.
However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.
The non-finite nature of language
All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:
The sentences of a natural language are finite.
The sentences of a natural language can be collected and enumerated.
The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists...regarded the corpus as the sole explicandum of linguistics" (Leech, 1991).
To be fair, not all linguists at the time made such bullish statements - Harris [1951) is probably the most enthusiastic exponent of this point, while Hockett [1948] did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time."
The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in "Corpus Linguistics" Chapter 1, pages 7-8).
The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.
The value of introspection
Even if language was a finite construct, would corpus methodology still be the best method of studying language? Why bother waiting for the sentences of a language to enumerate themselves, when by the process of introspection we can delve into our own minds and examine our own linguistic competence? At times intuition can save us time in searching a corpus.
Without recourse to introspective judgements, how can ungrammatical utterances be distinguished from ones that simply haven't occurred yet? If our finite corpus does not contain the sentence:
*He shines Tony books
how do we conclude that it is ungrammatical? Indeed, there may be persuasive evidence in the corpus to suggest that it is grammatical if we see sentences such as:
He gives Tony books
He lends Tony books
He owes Tony books
Introspection seems a useful and good tool for cases such as this. But early corpus linguistics denied its use.
Also, ambiguous structures can only be identified and resolved with some degree of introspective judgement. An observation of physical form only seems inadequate. Consider the sentences:
Tony and Fido sat down - he read a book of recipes.
Tony and Fido sat down - he ate a can of dog food.
It is only with introspection that this pair of ambiguous sentences can be resolved e.g. we know that Fido is the name of a dog and it was therefore Fido who ate the dog food, and Tony who read the book.
Other criticisms of corpus linguistics
Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1963) summed up the corpus-based approach as being composed of "pseudo-procedures". Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive.
Whatever Chomsky's criticisms were, Abercrombie's were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time.
The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died.
Chomsky re-examined
Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpus-based work. For example, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. Also, in the field of language acquisition the observation of naturally occuring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteen-month-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed, and there is no evidence that a child at the one-word stage has meta-linguistic awareness. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies.
The revival of corpus linguistics
It is a common belief that corpus linguistics was abandoned entirely in the 1950s, and then adopted once more almost as suddenly in the early 1980s. This is simply untrue, and does a disservice to those linguists who continued to pioneer corpus-based work during this interregnum.
For example, Quirk (1960) planned and executed the construction of his ambitious Survey of English Usage (SEU) which he began in 1961. In the same year, Francis and Kucera began work on the now famous Brown corpus, a work which was to take almost two decades to complete. These researchers were in a minority, but they were not universally regarded as peculiar and others followed their lead. In 1975 Jan Svartvik started to build on the work of the SEU and the Brown corpus to construct the London-Lund corpus.
During this period the computer slowly started to become the mainstay of corpus linguistics. Svartvik computerised the SEU, and as a consequence produced what some, including Leech (1991) still believe to be "to this day an unmatched resource for studying spoken English".
The availability of the computerised corpus and the wider availability of institutional and private computing facilities do seem to have provided a spur to the revival of corpus linguistics. The table below (from Johansson, 1991) shows how corpus linguistics grew during the latter half of this century.
Date Studies
To 1965 10
1966-1970 20
1971-1975 30
1976-1980 80
1981-1985 160
1985-1991 320
The machine readable corpus
The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as psuedo-techniques. The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer.
Processes
Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist.
The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words occuring in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a concordance), and extract from this another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark.
The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.
Goals and conclusion
In this section we have
seen the failure of early corpus linguistics
examined Chomsky's criticisms
seen the failings of introspective data
seen how corpus linguistics was revived
In the remaining sections we will see -
how corpus linguists study syntactic features (Section 2)
how corpus linguistics balances enumeration with introspection (Section 3)
how corpora can be used in language studies (Section 4)
Early Corpus Linguistics and the Chomskyan Revolution
"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.
Language acquisition
The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 - analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present - again based on collections of utterances, but this time with a smaller (approximately 3) sample of children who are studied over long periods of time (e.g. Brown (1973) and Bloom (1970)].
Spelling conventions
Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.
Language pedagogy
Fries and Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagody had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.
Other examples
Comparative linguistics, and syntax and semantics can be read about in Chapter 1, page 3 of "Corpus Linguistics".
Chomsky
Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance.
Competence is best described as our tacit, internalised knowledge of a language.
Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form.
Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence.
Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it.
However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.
The non-finite nature of language
All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:
The sentences of a natural language are finite.
The sentences of a natural language can be collected and enumerated.
The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists...regarded the corpus as the sole explicandum of linguistics" (Leech, 1991).
To be fair, not all linguists at the time made such bullish statements - Harris [1951) is probably the most enthusiastic exponent of this point, while Hockett [1948] did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time."
The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in "Corpus Linguistics" Chapter 1, pages 7-8).
The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.
The value of introspection
Even if language was a finite construct, would corpus methodology still be the best method of studying language? Why bother waiting for the sentences of a language to enumerate themselves, when by the process of introspection we can delve into our own minds and examine our own linguistic competence? At times intuition can save us time in searching a corpus.
Without recourse to introspective judgements, how can ungrammatical utterances be distinguished from ones that simply haven't occurred yet? If our finite corpus does not contain the sentence:
*He shines Tony books
how do we conclude that it is ungrammatical? Indeed, there may be persuasive evidence in the corpus to suggest that it is grammatical if we see sentences such as:
He gives Tony books
He lends Tony books
He owes Tony books
Introspection seems a useful and good tool for cases such as this. But early corpus linguistics denied its use.
Also, ambiguous structures can only be identified and resolved with some degree of introspective judgement. An observation of physical form only seems inadequate. Consider the sentences:
Tony and Fido sat down - he read a book of recipes.
Tony and Fido sat down - he ate a can of dog food.
It is only with introspection that this pair of ambiguous sentences can be resolved e.g. we know that Fido is the name of a dog and it was therefore Fido who ate the dog food, and Tony who read the book.
Other criticisms of corpus linguistics
Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1963) summed up the corpus-based approach as being composed of "pseudo-procedures". Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive.
Whatever Chomsky's criticisms were, Abercrombie's were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time.
The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died.
Chomsky re-examined
Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpus-based work. For example, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. Also, in the field of language acquisition the observation of naturally occuring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteen-month-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed, and there is no evidence that a child at the one-word stage has meta-linguistic awareness. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies.
The revival of corpus linguistics
It is a common belief that corpus linguistics was abandoned entirely in the 1950s, and then adopted once more almost as suddenly in the early 1980s. This is simply untrue, and does a disservice to those linguists who continued to pioneer corpus-based work during this interregnum.
For example, Quirk (1960) planned and executed the construction of his ambitious Survey of English Usage (SEU) which he began in 1961. In the same year, Francis and Kucera began work on the now famous Brown corpus, a work which was to take almost two decades to complete. These researchers were in a minority, but they were not universally regarded as peculiar and others followed their lead. In 1975 Jan Svartvik started to build on the work of the SEU and the Brown corpus to construct the London-Lund corpus.
During this period the computer slowly started to become the mainstay of corpus linguistics. Svartvik computerised the SEU, and as a consequence produced what some, including Leech (1991) still believe to be "to this day an unmatched resource for studying spoken English".
The availability of the computerised corpus and the wider availability of institutional and private computing facilities do seem to have provided a spur to the revival of corpus linguistics. The table below (from Johansson, 1991) shows how corpus linguistics grew during the latter half of this century.
Date Studies
To 1965 10
1966-1970 20
1971-1975 30
1976-1980 80
1981-1985 160
1985-1991 320
The machine readable corpus
The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as psuedo-techniques. The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer.
Processes
Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist.
The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words occuring in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a concordance), and extract from this another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark.
The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy.
Goals and conclusion
In this section we have
seen the failure of early corpus linguistics
examined Chomsky's criticisms
seen the failings of introspective data
seen how corpus linguistics was revived
In the remaining sections we will see -
how corpus linguists study syntactic features (Section 2)
how corpus linguistics balances enumeration with introspection (Section 3)
how corpora can be used in language studies (Section 4)