LOCNESS corpus: US-UK students" writing

回复:LOCNESS corpus

Thank you very much. I can open it with wordsmith 3.
But word clusters of LOCNESS still cannot be extracted without the texts. I'll be patient and wait for the compiler's reply.
 
There are some high-frequency bigram and trigram lists based on FLOB, LOCNESS and CLEC in the Native Corpora section which you may find useful.
 
回复:LOCNESS corpus

Yes. I've noticed that. Many thanks.
I also need clusters of 4, 5, 6, 7 and 8 words. These are hard to find.
 
Lexical bundles, also called lexical chains or multiword units, are closely associated with collocations and have been an important topic in lexical studies (e.g. Stubbs 2002). More recently, Biber found that lexical bundles are also a reliable indicator of register variation (e.g. Biber and Conrad 1999; Biber 2003). Biber and Conrad (1999), for example, showed that the structural types of lexical bundles in conversation are markedly different from those in academic prose. Biber’s (2003) comparative study of the distribution of 15 major types of 4-word lexical bundles (technically known as 4-grams) in the registers of conversation, classroom teaching, textbooks and academic prose indicates that lexical bundles are significantly more frequent in the two spoken registers. The distribution of lexical bundles in different registers also varies across structural types. In conversation, nearly 90% of lexical bundles are declarative or interrogative clause segments. In contrast, the lexical bundles in academic prose are basically phrasal rather than clausal. Of the four registers in Biber’s study, lexical bundles are considerably more frequent in classroom teaching because this register uses the types of lexical bundles associated with both conversation and academic prose.

References:

Stubbs, M. 2002. ‘Two quantitative methods of studying phraseology in English’. International Journal of Corpus Linguistics 7/2: 215-244.

Biber, D. and Conrad, S. 1999. ‘Lexical bundles in conversation and academic prose’ in H. Hasselgard and S. Oksefjell (eds.) Out of Corpora: Studies in Honour of Stig. Johansson, pp. 181-189. Amsterdam: Rodopi.

Biber, D. 2003. ‘Lexical bundles in academic speech and writing’ in B. Lewandowska-Tomaszczyk (ed.) Practical Applications in Language and Computers, pp. 165-178. Frankfurt: Peter Lang.
 
回复:LOCNESS corpus

Dear xiaoz, I don't know how to express my gratitude for you. You're terribly great!
But can you do me another favour and upload the bigram and trigram lists for use with wordsmith 3? The bigram and trigram lists found here are all incomplete ones.
 
回复:LOCNESS corpus

以下是引用 tiger2005-7-9 14:02:55 的发言:
Dear xiaoz, I don't know how to express my gratitude for you. You're terribly great!
But can you do me another favour and upload the bigram and trigram lists for use with wordsmith 3? The bigram and trigram lists found here are all incomplete ones.

Have sent them to your email address. Files too large to upload.

...but returned. Is your email address valid?

[本贴已被 作者 于 2005年07月09日 21时18分36秒 编辑过]
 
回复:LOCNESS corpus

Then would you please send them one by one to Xingbingliutiger@163.com? Something must have been wrong.
 
two questions

First, how to sum up the selected column with wordsmith 3? The way I know is too complicated and inconvenient: copy it into a .txt file, then copy the list to an excel file, and finally sum up the column.
2005071009024745.jpg


Secondly, how to set a threshhold frequency when doing n-gram search with wordsmith 3?
 
回复:LOCNESS corpus

1) You can select "File - statistics" in the menu, or click on the sigma icon on the toolbar to find out statistics about your wordlist, bigram/trigram etc list. What you want in your posting is shown as Token. the number of items is shown as Type.
2005071021023117.jpg


2) The threshold value (minimum frequency) can be set before you make a wordlist/bigram etc (Settings - Adjust settings - Wordlist)
2005071021052045.jpg


Or you can re-sort the list in terms of frequency to cut off all items below a value.
 
回复:LOCNESS corpus

I know how to set the threshhold frequency for clusters now. Thank you.
But it seems that the "tokens" row on n-gram list window on my wordsmith 3 is always the total tokens of single words of all the text in the subcorpus. What is wrong with my wordsmith or with my setting?
The following is the statistics window of 8-gram list. The "types" row shows the right figure, but the number of "tokens" is the total of tokens of single words in the subcorpus.
2005071022250892.jpg


The following is the statistics window for 7-gram list. The same thing has happened.
2005071022272574.jpg



[本贴已被 作者 于 2005年07月10日 22时40分07秒 编辑过]

[本贴已被 作者 于 2005年07月10日 22时45分57秒 编辑过]
 
回复:LOCNESS corpus

I know how to set the threshhold frequency for clusters now. Thank you.
But it seems that the "tokens" row on n-gram list window on my wordsmith 3 is always the total tokens of single words of all the text in the subcorpus. What is wrong with my wordsmith or with my setting?
The following is the statistics window of 8-gram list. The "types" row shows the right figure, but the number of "tokens" is the total of tokens of single words in the subcorpus.
2005071022250892.jpg


The following is the statistics window for 7-gram list. The same thing has happened.
2005071022272574.jpg



[本贴已被 作者 于 2005年07月10日 22时46分51秒 编辑过]

[本贴已被 作者 于 2005年07月10日 22时47分45秒 编辑过]
 
Sorry for having mistakenly posted the same message again.

[本贴已被 作者 于 2005年07月10日 22时44分39秒 编辑过]
 
Right. It appears that the number of tokens in Statistics is the number of 1-grams (wordlist). Try copying the row of frequencies into Excel to get the total.
 
The corpora I extracted for comparison with LOCNESS in terms of n-grams amount to 117,600 tokens each, but the total tokens of LOCNESS is 324,203. Then how to standardize the values of n-grams from LOCNESS so as to make them comparable with those from the previous corpora?

And the "tokens" row in your screen dump at No. 32 also shows the total of tokens in locness corpus rather than those of the 2-grams in locness corpus.
 
Yes. The token number shown in WordSmith is the number of single tokens (i.e. the number of words in a corpus), but not n-grams.

See my posting "Statistics won't bite" for normalization.
 
Yes. It is normalization.
Many thanks。

[本贴已被 作者 于 2005年07月10日 23时57分27秒 编辑过]

[本贴已被 作者 于 2005年07月10日 23时57分56秒 编辑过]
 
最近我一直和Grangerd或De Cock联系上不,在国内何处可以搞到LONCESS?有意者请同我联系。本人有一些别的语料库,愿与同行互通有无。
 
Back
顶部