查看完整版本 : LOCNESS corpus: US-UK students" writing
xiaoz
2005-06-15, 03:14 AM
LOCNESS: Louvain Corpus of Native English Essays
Use this corpus to compare English produced by foreign learners of English (e.g. ICLE, Longman Learners Corpus, and CLEC) and by native speaker students.
http://www.corpus4u.org/upload/forum/2005061503142048.pdf
[本贴已被 xujiajin 于 2005年09月11日 08时06分57秒 编辑过]
xujiajin
2005-06-17, 02:12 AM
LOCNESS is a corpus of native English essays made up of:
- British pupils' A level essays: 60,209 words
- British university students essays: 95,695 words
- American university students' essays: 168,400 words
Total number of words: 324,304 words
To order a copy of the corpus, contact Sylviane Granger (granger@lige.ucl.ac.be) or
Sylvie De Cock (decock@lige.ucl.ac.be)
xiaoz
2005-06-17, 04:36 AM
I have a copy of the corpus which I POS tagged using CLAWS. If you can prove that you have obtained a copy of raw corpus from Granger or De Cock, I will be happy to provide you with my tagged version.
xujiajin
2005-06-20, 12:37 AM
Is LOCNESS payable or free (from Granger or De Cock)?
xiaoz
2005-06-20, 12:59 AM
should be free of charge
xiaoz
2005-06-20, 05:10 AM
A frequency list based on the LOCNESS corpus
http://www.corpus4u.org/upload/forum/2005062005112020.zip
xujiajin
2005-06-21, 01:01 AM
Sylviane Granger (granger@lige.ucl.ac.be)
Sylvie De Cock (decock@lige.ucl.ac.be)
Haiyang
2005-06-21, 11:31 PM
Wonderful.
I'm currently doing a researd on connectors based on learner corpus, so this will be really helpful to me.
Very interested in this corpus.
Thanks Richard and Jiajin.
thanks a lot! it is really free from charge? Great!
tiger
2005-07-04, 06:01 PM
在国内何处可以搞到这个语料库?有意者请同我联系。
tiger
2005-07-04, 06:15 PM
Is there anyone who has got the corpus free of charge?
xiaoz
2005-07-05, 04:46 AM
I think it is more advisable to get it from its owners, as mentioned in Jiajin's reply above.
xiaoz
2005-07-05, 04:53 PM
88888
tiger
2005-07-06, 08:11 AM
I sent an email to Sylviane Granger at granger@lige.ucl.ac.be and Sylvie De Cock at decock@lige.ucl.ac.be to order a copy of this corpus, but there has been no reply.
xiaoz
2005-07-06, 08:27 AM
Be patient. Maybe they are on holidays.
is it something free of charge?
xiaoz
2005-07-07, 09:35 AM
See No. 5 above.
tiger
2005-07-08, 01:41 PM
xiaoz, can you upload a unlemmatised wordlist of LOCNESS in the format of wordsmith wordlist please?
xiaoz
2005-07-08, 09:53 PM
以下是引用 tiger 在 2005-7-8 13:41:01 的发言:
xiaoz, can you upload a unlemmatised wordlist of LOCNESS in the format of wordsmith wordlist please?
Here you are:
http://www.corpus4u.org/upload/forum/2005070821513779.zip
For use with WordSmith 3. If you are using WST4, there is a tool in version 4 that helps you to convert WST3 data to WST4 format.
xujiajin
2005-07-08, 10:03 PM
Richard真是个有求必应的好人啊!
tiger
2005-07-09, 12:13 AM
Thank you very much. I can open it with wordsmith 3.
But word clusters of LOCNESS still cannot be extracted without the texts. I'll be patient and wait for the compiler's reply.
tiger
2005-07-09, 12:14 AM
确实有求必应,致敬。
xiaoz
2005-07-09, 12:22 AM
There are some high-frequency bigram and trigram lists based on FLOB, LOCNESS and CLEC in the Native Corpora section which you may find useful.
tiger
2005-07-09, 12:40 AM
Yes. I've noticed that. Many thanks.
I also need clusters of 4, 5, 6, 7 and 8 words. These are hard to find.
xiaoz
2005-07-09, 01:04 AM
LOCNESS 4-8-gram lists (zipped, for WST3)
http://www.corpus4u.org/upload/forum/2005070901052455.zip
xiaoz
2005-07-09, 01:12 AM
Lexical bundles, also called lexical chains or multiword units, are closely associated with collocations and have been an important topic in lexical studies (e.g. Stubbs 2002). More recently, Biber found that lexical bundles are also a reliable indicator of register variation (e.g. Biber and Conrad 1999; Biber 2003). Biber and Conrad (1999), for example, showed that the structural types of lexical bundles in conversation are markedly different from those in academic prose. Biber’s (2003) comparative study of the distribution of 15 major types of 4-word lexical bundles (technically known as 4-grams) in the registers of conversation, classroom teaching, textbooks and academic prose indicates that lexical bundles are significantly more frequent in the two spoken registers. The distribution of lexical bundles in different registers also varies across structural types. In conversation, nearly 90% of lexical bundles are declarative or interrogative clause segments. In contrast, the lexical bundles in academic prose are basically phrasal rather than clausal. Of the four registers in Biber’s study, lexical bundles are considerably more frequent in classroom teaching because this register uses the types of lexical bundles associated with both conversation and academic prose.
References:
Stubbs, M. 2002. ‘Two quantitative methods of studying phraseology in English’. International Journal of Corpus Linguistics 7/2: 215-244.
Biber, D. and Conrad, S. 1999. ‘Lexical bundles in conversation and academic prose’ in H. Hasselgard and S. Oksefjell (eds.) Out of Corpora: Studies in Honour of Stig. Johansson, pp. 181-189. Amsterdam: Rodopi.
Biber, D. 2003. ‘Lexical bundles in academic speech and writing’ in B. Lewandowska-Tomaszczyk (ed.) Practical Applications in Language and Computers, pp. 165-178. Frankfurt: Peter Lang.
tiger
2005-07-09, 02:02 PM
Dear xiaoz, I don't know how to express my gratitude for you. You're terribly great!
But can you do me another favour and upload the bigram and trigram lists for use with wordsmith 3? The bigram and trigram lists found here are all incomplete ones.
xiaoz
2005-07-09, 09:17 PM
以下是引用 tiger 在 2005-7-9 14:02:55 的发言:
Dear xiaoz, I don't know how to express my gratitude for you. You're terribly great!
But can you do me another favour and upload the bigram and trigram lists for use with wordsmith 3? The bigram and trigram lists found here are all incomplete ones.
Have sent them to your email address. Files too large to upload.
...but returned. Is your email address valid?
[本贴已被 作者 于 2005年07月09日 21时18分36秒 编辑过]
tiger
2005-07-09, 10:02 PM
Then would you please send them one by one to Xingbingliutiger@163.com? Something must have been wrong.
tiger
2005-07-09, 10:06 PM
Got it. Millions of thanks.
tiger
2005-07-10, 09:09 AM
First, how to sum up the selected column with wordsmith 3? The way I know is too complicated and inconvenient: copy it into a .txt file, then copy the list to an excel file, and finally sum up the column.
http://forum.corpus4u.org/upload/forum/2005071009024745.jpg
Secondly, how to set a threshhold frequency when doing n-gram search with wordsmith 3?
xiaoz
2005-07-10, 09:06 PM
1) You can select "File - statistics" in the menu, or click on the sigma icon on the toolbar to find out statistics about your wordlist, bigram/trigram etc list. What you want in your posting is shown as Token. the number of items is shown as Type.
http://forum.corpus4u.org/upload/forum/2005071021023117.jpg
2) The threshold value (minimum frequency) can be set before you make a wordlist/bigram etc (Settings - Adjust settings - Wordlist)
http://forum.corpus4u.org/upload/forum/2005071021052045.jpg
Or you can re-sort the list in terms of frequency to cut off all items below a value.
tiger
2005-07-10, 10:27 PM
I know how to set the threshhold frequency for clusters now. Thank you.
But it seems that the "tokens" row on n-gram list window on my wordsmith 3 is always the total tokens of single words of all the text in the subcorpus. What is wrong with my wordsmith or with my setting?
The following is the statistics window of 8-gram list. The "types" row shows the right figure, but the number of "tokens" is the total of tokens of single words in the subcorpus.
http://forum.corpus4u.org/upload/forum/2005071022250892.jpg
The following is the statistics window for 7-gram list. The same thing has happened.
http://forum.corpus4u.org/upload/forum/2005071022272574.jpg
[本贴已被 作者 于 2005年07月10日 22时40分07秒 编辑过]
[本贴已被 作者 于 2005年07月10日 22时45分57秒 编辑过]
tiger
2005-07-10, 10:38 PM
I know how to set the threshhold frequency for clusters now. Thank you.
But it seems that the "tokens" row on n-gram list window on my wordsmith 3 is always the total tokens of single words of all the text in the subcorpus. What is wrong with my wordsmith or with my setting?
The following is the statistics window of 8-gram list. The "types" row shows the right figure, but the number of "tokens" is the total of tokens of single words in the subcorpus.
http://forum.corpus4u.org/upload/forum/2005071022250892.jpg
The following is the statistics window for 7-gram list. The same thing has happened.
http://forum.corpus4u.org/upload/forum/2005071022272574.jpg
[本贴已被 作者 于 2005年07月10日 22时46分51秒 编辑过]
[本贴已被 作者 于 2005年07月10日 22时47分45秒 编辑过]
tiger
2005-07-10, 10:42 PM
Sorry for having mistakenly posted the same message again.
[本贴已被 作者 于 2005年07月10日 22时44分39秒 编辑过]
xiaoz
2005-07-10, 10:59 PM
Right. It appears that the number of tokens in Statistics is the number of 1-grams (wordlist). Try copying the row of frequencies into Excel to get the total.
tiger
2005-07-10, 11:07 PM
The corpora I extracted for comparison with LOCNESS in terms of n-grams amount to 117,600 tokens each, but the total tokens of LOCNESS is 324,203. Then how to standardize the values of n-grams from LOCNESS so as to make them comparable with those from the previous corpora?
And the "tokens" row in your screen dump at No. 32 also shows the total of tokens in locness corpus rather than those of the 2-grams in locness corpus.
xiaoz
2005-07-10, 11:21 PM
Yes. The token number shown in WordSmith is the number of single tokens (i.e. the number of words in a corpus), but not n-grams.
See my posting "Statistics won't bite" for normalization.
tiger
2005-07-10, 11:54 PM
Yes. It is normalization.
Many thanks。
[本贴已被 作者 于 2005年07月10日 23时57分27秒 编辑过]
[本贴已被 作者 于 2005年07月10日 23时57分56秒 编辑过]
最近我一直和Grangerd或De Cock联系上不,在国内何处可以搞到LONCESS?有意者请同我联系。本人有一些别的语料库,愿与同行互通有无。
tiger
2005-07-25, 10:24 PM
I also emailed them, but have received no reply so far.
tiger
2005-08-10, 12:28 AM
is this corpus really freely available?
I emailed them a month ago, but still no reply has come.
xiaoz
2005-08-10, 01:03 AM
Maybe they do not really wish to distribute the corpus to the general public, but it is still there:
http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/locness1.htm
tiger
2005-08-11, 08:32 PM
以下是引用 刘语料 在 2005-8-11 15:56:38 的发言:
在国内何处可以搞到这个语料库?有意者请同我联系。
也请同我联系,可以交换resourses。
tiger
2005-08-11, 08:43 PM
以下是引用 xiaoz 在 2005-8-10 1:03:08 的发言:
Maybe they do not really wish to distribute the corpus to the general public, but it is still there:
http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/locness1.htm
Mr. Xiao, this link is not valid. Are there other links?
xiaoz
2005-08-11, 08:58 PM
The link above only gives a description of the corpus.
The corpus was never there for downloading.
我已经和这个语料库的编者联系两次了一直没有任何消息!为什么?
tiger
2005-08-12, 11:23 AM
以下是引用 xiaoz 在 2005-8-11 20:58:06 的发言:
The link above only gives a description of the corpus.
The corpus was never there for downloading.
I C. thanx.
hancunxin
2005-09-09, 08:53 PM
以下是引用 xiaoz 在 2005-6-15 3:14:30 的发言:
LOCNESS: Louvain Corpus of Native English Essays
Use this corpus to compare English produced by foreign learners of English (e.g. ICLE, Longman Learners Corpus, and CLEC) and by native speaker students.
《中国英语专业学生make 的使用特点调查报告》一文(解放军外国语学院学报,2002年第4期)当中将LOCNESS 作为与英语专业四、八级统测作文子语料库相对照的操母语者语料库。这样做行吗?
我想问的问题是:与LOCNESS相对照的到底是CLEC全部语料库呢,还是只能与专业英语语料库对照?
xiaoz
2005-09-09, 09:10 PM
LOCNESS is composed of writings by senior high schools students and junior university students in the UK and US, so it should be comparable to the CLEC in student ages. Data produced by TEM-4/8 students and non-English majors in China are both learner data.
Without more suitable dataset, TEM-4/8 corpus can be compared with LOCNESS. I have even found people comparing CLEC with LOB and Brown!
hancunxin
2005-09-09, 09:17 PM
以下是引用 xiaoz 在 2005-9-9 21:10:34 的发言:
LOCNESS is composed of writings by senior high schools students and junior university students in the UK and US, so it should be comparable to the CLEC in student ages. Data produced by TEM-4/8 students and non-English majors in China are both learner data.
Without more suitable dataset, TEM-4/8 corpus can be compared with LOCNESS. I have even found people comparing CLEC with LOB and Brown!
do you mean that LOCNESS can't be compared with middle school students' writings in CLEC , and it''s better be compared with TEM4/8 than COLEC?
hancunxin
2005-09-09, 09:47 PM
以下是引用 xiaoz 在 2005-9-9 21:10:34 的发言:
LOCNESS is composed of writings by senior high schools students and junior university students in the UK and US, so it should be comparable to the CLEC in student ages. Data produced by TEM-4/8 students and non-English majors in China are both learner data.
Dr. Xiao, i have read your post. but i still don't quite understand. my thesis tries to find out if Chinese students are overusing or underusing the present perfect in their writings. Can i compare LOCNESS with the full CLEC or TEM4/8 subcorpus only? what can i do to make my study more representative of Chinese students? if i choose only TEM4/8, then how about the middle school students and non-english majors? very confused. look forward to your reply!
xiaoz
2005-09-09, 10:03 PM
As I said, LOCNESS consists of data from both high school and university students while both English and non-English major writings are learner data. So you should compare the whole CLEC corpus with LOCNESS. But normalise raw frequencies in your comparison as the two corpora are of different sizes.
hancunxin
2005-10-02, 06:25 PM
LOCNESS is a learner corpus. does it mean that students' english is not completely correct, i.e. there are some mistakes in it?
[本贴已被 作者 于 2005年10月02日 19时05分20秒 编辑过]
xujiajin
2005-10-02, 06:40 PM
Can we say Chinese students' Chinese compositions are not native Chinese?
LOCNESS is not a learner corpus as learner corpora in most cases refer to L2 learners' linguistic output.
hancunxin
2005-10-02, 07:23 PM
以下是引用 xujiajin 在 2005-10-2 18:40:02 的发言:
Can we say Chinese students' Chinese compositions are not native Chinese?
LOCNESS is not a learner corpus as learner corpora in most cases refer to L2 learners' linguistic output.
yes, you are right, Dr.Xu. LOCNESS is a native corpus instead. what is the difference between BNC and LOCNESS. i think they are different. English in BNC was revised or edited. for example, some english was quoted from the newspaper or literary masterpieces. whereas, LOCNESS is composed of writings of students in which they may commit many errors. what do you say about that?
xujiajin
2005-10-02, 08:36 PM
LOCNESS and CLEC are comparable corpora.
Comparative studies can be done with the two databases, while BNC is for general purpose linguistic studies.
Slips of pen are possible in LOCNESS, but this will not affect the nativeness of the corpus in general.
xiaoz
2005-10-02, 09:39 PM
LOCNESS is what I call L1 "developmental corpus", as opposed to L2 learner corpus.
hancunxin
2005-10-03, 04:46 PM
以下是引用 xiaoz 在 2005-10-2 21:39:53 的发言:
LOCNESS is what I call L1 "developmental corpus", as opposed to L2 learner corpus.
i can't agree more. i think such a distinction should be made. so what "learner corpus" we refer to is actually a "L2 learner corpus".
[本贴已被 作者 于 2005年10月03日 16时52分24秒 编辑过]
laohong
2005-11-08, 11:17 PM
以下是引用 asan82 在 2005-10-11 13:40:29 的发言:
我发了一封EMAIL给她可是没有回音啊. 看来LOCNESS虽好,可是......呜呜
以下是引用 bravo 在 2005-10-11 14:07:43 的发言:
I have mailed several times the holders, but no reply.
以下是引用 oscar3 在 2005-10-11 14:05:22 的发言:
So did I. DEPRESSING.
In a casual chat with the developers, it seems that they are not really happy to see the "popularity" of LOCNESS in some countries, coz they are surprised to find that there are some people working with LOCNESS corpus though these people got the corpus NOT "via official channels". This, of course, is against the term in the agreement form, "no part of the corpus is to be distributed to a third party without specific authorization from CECL". Maybe this explains why some enquiries about the corpus were not entertained.
BTW, the corpus is not free, the license fee is 100 Euro per copy. Pls see the LOCNESS form below:
LOUVAIN CORPUS OF NATIVE ENGLISH ESSAY WRITING
The LOCNESS corpus is a corpus of native English essay writing (university level and A-level) and is available under the following conditions:
(1) the corpus is to be used for non-commercial purposes only
(2) all publications on research partly or wholly based on the corpus should give credit to the Centre for English Corpus Linguistics (CECL), Université Catholique de Louvain, Belgium. A copy or offprint of the publication should also be sent to CECL
(3) no part of the corpus is to be distributed to a third party without specific authorization from CECL.
The corpus can only be used by the person signing the licence form and researchers working in close collaboration with him/her or students under his/her supervision, attached to the same institution.
(4) Payment of 100 EURO. Method of payment: by international money order or Western Union (in EURO) to
S. Granger. Centre for English Corpus Linguistics.
If you are interested in the corpus and agree to the above conditions, please complete and sign the form and return it to:
Professor Sylviane Granger
Centre for English Corpus Linguistics
Université Catholique de Louvain
1, place Blaise Pascal
B-1348 Louvain-la-Neuve, Belgium
NAME:
INSTITUTION:
DEPARTMENT:
ADDRESS:
E-MAIL:
TELEPHONE:
FAX:
TOPIC OF RESEARCH:
TYPE OF PUBLICATION: MA Dissertation / PhD / Conference paper/ Article / Book
I agree to the above-mentioned conditions stated and enclose an international money order
----------------------- ----------------------------------
(date) (signature)
yet it is of great necessity for them to tell anyone who is interested in it about the ways to gain access to the corpus if possible. access to wider usership is good to it as well as the promotion of corpus linguistics at large. unfortuantely it has done nothing about it. it is pity! nonofficial channel is reasoanble and sensible under the context!!
asan82
2005-11-16, 01:50 PM
原来是这样的啊.
chrisyang
2005-11-16, 04:51 PM
I've got a licensed copy of LOCNESS, which costs 100 euros. could you send me your taggeg version of this? Many thanks!
asan82
2005-12-06, 04:41 PM
八,九百块人民币吧.
看来搞科研真是离不开资金啊.
[本贴已被 作者 于 2005年12月09日 18时32分42秒 编辑过]
slgg6985
2006-02-08, 09:34 PM
I can't download it. It's only a pdf file introducing LOCNESS.
seanxpq
2006-03-21, 10:35 PM
以下是引用 xiaoz 在 2005-9-9 21:10:34 的发言:
Without more suitable dataset, TEM-4/8 corpus can be compared with LOCNESS. I have even found people comparing CLEC with LOB and Brown!
Dr. Xiaoz ,
Do you mean (Without more suitable dataset) it's not good enough to compare CLEC with LOB? or Brown? Or ICE-GB? Thanks!
xiaoz
2006-03-21, 10:43 PM
There is an over-30-year gap between CLEC and LOB/Brown, which is long enough for language change. ICE contains both spoken (60%) and written data but CLEC is only written. I think you can compare CLEC with the written part of ICE if you do not have more comparable data.
jackzch
2006-03-23, 12:13 AM
多谢Dr xiao
jackzch
2006-06-10, 08:28 PM
以下是引用 xiaoz 在 2005-6-20 0:59:47 的发言:
should be free of charge
Dr. Xiao, if that's free, how could we get a copy? Many of us are in need of LOCNESS.
xiaoz
2006-06-10, 08:48 PM
Afraid not free. I think it would cost 100 Euros.
seanxpq
2006-07-09, 11:19 PM
要是再搞个团购就好了(不过团购确实够麻烦各位管理员的)!或者几个人凑点钱把它买下来共享也可以啊!
majorlv511
2006-08-15, 11:00 AM
好贵啊!我一直在找这个语料库,也给作者去了email,但一直没有回音!
今天才看到实情,郁闷啊!
hedy_sisu
2006-08-29, 01:32 PM
谁有Locness语料库,我需要其中的一个部分University of Michigan (codes: ICLE-US-MICH-0001.1-45.1)
43 essays (16,502 words)
Timed essays
Argumentative
No reference tools used
All students are fully English native speakers (except for - nr23 & nr34)
No indication of NL for nr36
Age: 19-23
Topic: Great inventions and discoveries of 20th century and their impact on people’s lives (one per interview - computer, television, etc.),我只需要raw material,哪位好心人帮个忙,非常非常感谢!有偿转让也可以,请发邮件到hedy_sisu@163.com,谢谢!谢谢!!
wangshuhua
2006-09-19, 09:08 AM
我也是希望好心人能和我share一下,有偿的也是应该的。能帮忙者,请联系w_sh04@163.com
xujiajin
2006-09-19, 09:26 AM
这样做应该是不可以的。有偿的话,也应该是跟原作者购买。
Corpus4U从来不鼓励互相传播有版权的资源、软件等。
请问一下有LOCNESS的老师,British pupils' A-level essays 的codes 是什么?
我拿到的LOCNESS没有subcorpus,全都在一个文档,只知道ICLE-US的是美国大学生作文,其他的ESME,ESLE等都不清楚,不好分辨。
麻烦手头有这个语料的老师同学帮忙解释一下,非常感谢!:)
lilyxu7988
2008-07-11, 10:35 PM
Have sent them to your email address. Files too large to upload.
...but returned. Is your email address valid?
[本贴已被 作者 于 2005年07月09日 21时18分36秒 编辑过]
Would you please send them to me, sunny200385@yahoo.com.cn
Thanks
vBulletin® v3.7.4,版权所有 ©2000-2009,Jelsoft Enterprises Ltd.