经过POS标注的CLEC 可信吗?

hancunxin

Moderator
POS 标注, 我想,仅仅对正确的语法或句子形式有效。 它对错误的语言形式也有效吗?
在CLEC 学生的作文当中有许多的错误,这些不规范,不正确的句子,单词,POS TAGGER 也能正确标注吗?恐怕不能吧。 比如以下这两个句子: I sorrow to you. He courage me back school. 这都是学生可能发生的错误。如果用POS TAGGER 来标的话,它很有可能标不出来,因为这些句子不合常规语法。依次类推, 经过POS标注的CLEC 会是可信的吗?
 

xujiajin

管理员
Staff member
据frankliang说,学习者语料的正确率竟然比母语的还高,因为学习者用词中活用现象较少,而且用词相对简单。

当然,最好进行标注后还是要加以人工校对才好。
 

xiaoz

永远的超级管理员
Staff member
From: owner-corpora@lists.uib.no on behalf of Xiaotian Guo
Sent: Fri 22/07/2005 09:51
To: CORPORA@uib.no; Adam Kilgarriff
Subject: Re: [Corpora-List] POS-tagging for spoken English and learner English


Hi, Adam and colleagues

I agree with Paul in that "For learner data … POS tagging accuracy
depends on how advanced the learners are".

I have tried to have a native speaker corpus, LOCNESS and a learner
corpus COLEC, as I call it, POS tagged. It works perfectly well with
LOCNESS. But unfortunately, I was let down by the inaccuracy of the
tagging to COLEC due to the special features of the learners errors. I
am not a computer person, but I speculate that when a tagging system
is devised, it would be based on the syntax rules most native speakers
abide by. However, non-native speakers, especially those with an
intermediate level or below would not produce the language in the way
native speakers produce. You can hardly imagine how messy learner
English could be. That would cause a huge problem to the POS tagging
to a learner corpus and very likely indeed would disable the whole
tagging system. Granger discussed this point in her article in

Granger S., Hung J. and Petch-Tyson S. (eds) 2002. Computer Corpora,
Second Language Acquisition and Foreign Language Teaching. Amsterdam:
John Benjamins Publishing Company.


Of course, it does not mean there will be no solutions to this. If
people try hard enough, they may come up with a better accuracy rate.
As far as I can see (pardon me if I am talking nonsense), at least the
tagging system should not be based on the native speaker syntax rules.
Perhaps the tagging system should be trained with adequate learner
English data? But the problem is that it is hard to find a set of
syntax rules to learner English. Anyway, I will keep all my fingers
crossed for those who are dealing with this part of tagging system
design.

All the best

Xiaotian Guo
PhD Candidate
The Department of English
The University of Birmingham
 

hancunxin

Moderator
can't agree more. but it is different from what xujiajin said above. he said according to Frankliang, a higher accuracy in tagging learner corpus was detected.

[本贴已被 作者 于 2005年09月13日 18时57分03秒 编辑过]
 

frankliang

普通会员
回复:经过POS标注的CLEC 可信吗?

I conducted a pilot study, in which I tagged a set of 16 learner-written essays with both Brill and CLAWS. The result surprised me. The transformation-based POS tagger yielded an accuracy of something like 93%, while the probability-based POS tagger turned out an accuracy of over 98%.
If the taggers are used to tag spoken data, the accuracy will probably be much lower, I guess, as learner-spoken sentences could be a real mess.
The limitations for my pilot study are very clear, though. The data comprise essays written by students in one of the best universities in China. When the taggers are used to tag inferior language, the accuracy may be lower.

FURTHER DISCUSSIONS EXPECTED!!!!!!!!!!!!!!!!!!!!1
 

xiaoz

永远的超级管理员
Staff member
Automatic POS tagging of a learner corpus: the influence of learner error on tagger accuracy

Bertus van Rooy & Lande Schäfer

In Corpus Linguistics 2003

Abstract:

In order to decide which taggers to use to part of speech (POS) tag the Tswana Learner English Corpus (TLEC), the performance of three taggers (TOSCA-ICLE, Brill tagger, and CLAWS7) was evaluated on a 5000-word sample of the corpus. Tagger accuracy was just more than 95% for CLAWS7 and slightly below 90% for the other two taggers.

An evaluation of the sample indicated that learner spelling errors contribute substantially to tagging errors, more than any other type of learner error. Spelling errors were classified into three different categories: non-word errors, real word errors and errors relating to spaces between words. Our results showed that the guessing modules of the taggers fared reasonably well when dealing with non-words, i.e. words that do not occur in the lexicon of English, and therefore not in the lexicons of the taggers. CLAWS7 guessed the POS tag of much more than 50% of the non-words correctly, while the other two guessed correctly just on or below 50%. None of the other types of misspelled words were tagged correctly with any consistency. A number of possible explanations are offered.

All spelling errors were then corrected manually and the edited sample was retagged. With the spelling errors removed, the accuracy of all three taggers improved by 2-3%. After the learner errors were classified, the effect of the various different categories of learner error on tagging accuracy was determined. Our results indicated that just more than a third of the remaining tag errors in CLAWS7 were due to learner error, while this figure dropped to about one-fifth for the other two taggers. However, some error types, such as the omission of words, incorrect conjugation of the verb and the use of incorrect lexical items had a serious detrimental effect on tagger accuracy. On the other hand, learner errors related to the use of articles, prepositions and incorrect assignment of number to common nouns seldom led to incorrect POS tags. Possible reasons for the differential effect of learner errors on tagger accuracy are explored.

Full paper available here:
http://forum.corpus4u.org/upload/forum/2005092023174960.pdf
 

xiaoz

永远的超级管理员
Staff member
Tagging spoken/learner/historical data using CLAWS

Source: Corpora List archive-

Folks in UCREL at Lancaster and elsewhere have got some experience of running CLAWS over corpora such as the spoken part of the BNC, MICASE, ICLE, and historical corpora (Nameless Shakespeare). My impression in general is that the statistical HMM component of the tagger provides the robustness you need for these kind of tasks, but you need to accompany that with tweaks to the other components such as the 'idiom' lists and tokenisation.

Here's some more detail:

1. In the BNC project, the CLAWS transition probabilities were retrained on spoken data. Also there were lexicon additions, special treatment of contractions, truncated words and repetition, all closely tied to the transcription and encoding formats in the BNC spoken corpus. For more detail, see:

Garside, R. (1995) Grammatical tagging of the spoken part of the British National Corpus: a progress report. In Leech, G., Myers, G. and Thomas, J. (eds) (1995), Spoken English on Computer: Transcription, Mark-up and Application. pp.161-7.

Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121.

Also see Nicholas Smith and Geoff Leech's manual for BNC version 2, which has error analysis comparing written and spoken:

http://www.comp.lancs.ac.uk/ucrel/bnc2/bnc2error.htm
http://www.comp.lancs.ac.uk/ucrel/claws/

2. I don't have figures for MICASE which we tagged with CLAWS or an ICLE sub-corpus, but came away with the general impression as above that the probability matrix provides robustness in these types of text which you might expect to cause problems for automatic POS annotation. For learner data, of course, POS tagging accuracy depends on how advanced the learners are. You could have a look at

Bertus van Rooy and Lande Schäfer: An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus

comparing TOSCA-ICLE, Brill tagger, and CLAWS on their data. This was presented at the learner corpus workshop at Corpus Linguistics 2003. The abstract is at
http://tonolab.meikai.ac.jp/~tono/cl2003/lcdda/abstracts/rooy.html
and the full paper is in the CL2003 proceedings.

3. In collaboration with Martin Mueller at Northwestern, we've recently been applying CLAWS to the Nameless Shakespeare corpus and looking at error rates and problems. There are other things which upset CLAWS (and would most likely do the same for other POS taggers) such as different capitalisation and variant spellings. Our approach has been to pre-process these as much as possible, retaining original variants, but fooling CLAWS, if you like, into tagging a version with modern equivalents. See:

Rayson, P., Archer, D. and Smith, N. (2005) VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In proceedings of Corpus Linguistics 2005.

Our experience with Nameless Shakespeare was that CLAWS' current statistical language model copes pretty well in data from that time, but we expect that the probability matrix will need to be retrained if we attempt tagging data much earlier than 1550/1600.
 

frankliang

普通会员
回复:经过POS标注的CLEC 可信吗?

Thank you, Dr.xiaoz, for all the information you provided.
My pilot study with learner spoken data shows that tagging accuracy is about 88%.
 

xiaoz

永远的超级管理员
Staff member
回复:经过POS标注的CLEC 可信吗?

Thanks. 88% is 10% nearly lower than for native data.

Expected tagging accuracies:

written data > spoken data
native data > learner data
higher proficiency level learner data > lower proficiency level learner data

The accuracy for learner spoken data is expected to be low. CLAWS4 has been re-trained for spoken data when the BNC was tagged.


以下是引用 frankliang2005-9-21 22:57:53 的发言:
Thank you, Dr.xiaoz, for all the information you provided.
My pilot study with learner spoken data shows that tagging accuracy is about 88%.
 

frankliang

普通会员
回复:经过POS标注的CLEC 可信吗?

Thanks, xiaoz. The quality of the transcription probably has a great effect on the accuracy rate.
It seems natural that tagging accuracy on learner spoken data is the lowest, while that on native speaker written data is the highest.
 
顶部