PDA

查看完整版本 : 语法与篇章:POS tags in written and spoken Chinese


xiaoz
2005-11-10, 12:52 AM
Statistics for your reference -

Distribution of POS tags in written and spoken Chinese

http://forum.corpus4u.org/upload/forum/2005111000531539.doc


[本贴已被 作者 于 2005年11月11日 21时53分39秒 编辑过]

xujiajin
2005-11-10, 12:11 PM
How can you be sure that written Chinese informed POS tagset can be applied to LLSCC?

I don't think ICTCLAS can be trusted for the POS tagging results without any careful hand editing.

yinghuang
2005-11-10, 12:31 PM
I agree with Dr Xu. Part of speech is still unsettled both in theory and in practice. It does need careful hand editing.

laohong
2005-11-10, 01:35 PM
以下是引用 xiaoz 在 2005-11-10 0:52:13 的发言:
Statistics for your reference -
Distribution of POS tags in written and spoken Chinese
http://forum.corpus4u.org/upload/forum/2005111000531539.doc



The distribution pattern looks quite interesting, but would you like to share us more about how you tagged the two corpora? Or, please point us to the references related if any.

xujiajin
2005-11-10, 09:40 PM
I guess Richard used freeICTCLAS.
http://www.corpus4u.com/forum_view.asp?view_id=557&forum_id=8
中科院计算所汉语词法分析系统ICTCLAS

xiaoz
2005-11-10, 09:54 PM
Tagsets can of course vary from tagger to tagger, or even for the same tagger (e.g. BNC C1-8 tagsets for CLAWS). But I think a tagset for a particular language can apply to both written and spoken registers, but the a tagger trained with written data must be adjusted to tag spoken data. For example, CLAWS was adjusted when the spoken BNC was tagged (but using the same tagset). A tagger may also need adjusting when learner data is tagged.

ICTCLAS achieved an accuracy rate of over 97% for written general Chinese, particularly news texts with which it was trained. But for spoken Chinese, my experiments showed an accuracy rate of 85-95%, varying across spoken genres. In the frequencies posted, the written corpus was hand checked but the spoken corpus was not.

以下是引用 xujiajin 在 2005-11-10 12:11:38 的发言:
How can you be sure that written Chinese informed POS tagset can be applied to LLSCC?

I don't think ICTCLAS can be trusted for the POS tagging results without any careful hand editing.

xujiajin
2005-11-10, 10:04 PM
[quote]以下是引用 xiaoz 在 2005-11-10 21:54:43 的发言:
But I think a tagset for a particular language can apply to both written and spoken registers, but ...
----Cannot agree.


ICTCLAS achieved an accuracy rate of over 97% for written general Chinese, particularly news texts with which it was trained. But for spoken Chinese, my experiments showed an accuracy rate of 85-95%, varying across spoken genres.

-----You did a benchmark test? The lowest accuracy rate 85 percent is a big difference from 97 percent as I see it in a one million corpus.

xiaoz
2005-11-10, 10:07 PM
Well, the view that "Part of speech is still unsettled both in theory and in practice" is an over-statement. For many languages including Chinese, POS tagging has met with great success, with a typical error rate of 3% for general written language. the automatically POS tagged data is sufficiently reliable for many applications.

Of course, as wzli said in another post, raw texts/transcripts are useful. The POS tags in a corpus can be easily removed if you prefer a plain text corpus. But for Chinese, it appears that most corpus exploration tools require tokenised data, which is not plain at all, because tokenisation typically goes through a process similar to POS tagging (the former is the basis of the latter). That means that unless you totally reject Chinese corpus data, some processing which is "unsettled both in theory and in practice" is inevitable.

以下是引用 yinghuang 在 2005-11-10 12:31:52 的发言:
I agree with Dr Xu. Part of speech is still unsettled both in theory and in practice. It does need careful hand editing.

xujiajin
2005-11-10, 10:10 PM
I would prefer a character-based tokenization instead of a word-based one.

xiaoz
2005-11-10, 10:10 PM
Yes, that's right.

以下是引用 xujiajin 在 2005-11-10 21:40:48 的发言:
I guess Richard used freeICTCLAS.
http://www.corpus4u.com/forum_view.asp?view_id=557&forum_id=8
中科院计算所汉语词法分析系统ICTCLAS

xiaoz
2005-11-10, 10:25 PM
1) I assume the POS categories are the same for the same language, but the algorithms/rules of a tagger with written data must be retrained/rewritten for speech. Can you give some examples of POS categories that exist only in writing but not in speech or vice versa?

2) The significantly lower accuracy rate (85%) of ICTCLAS for some spoken genre in my corpus, namely the Callhome Mandarin component, because the LDC had already marked up this part for proper nouns and many spoken features which I preferred to retain in the POS tagged version. While this preprocessing on the part of LDC seriously affected the accurancy rate of POS tagging, it is useful for the studies of spoken Chinese. For the other subcorpora in LLSCC, the tagging accuracy is very close to that for the written language.

以下是引用 xujiajin 在 2005-11-10 22:04:33 的发言:
[quote]以下是引用 xiaoz 在 2005-11-10 21:54:43 的发言:
But I think a tagset for a particular language can apply to both written and spoken registers, but ...
----Cannot agree.

ICTCLAS achieved an accuracy rate of over 97% for written general Chinese, particularly news texts with which it was trained. But for spoken Chinese, my experiments showed an accuracy rate of 85-95%, varying across spoken genres.

-----You did a benchmark test? The lowest accuracy rate 85 percent is a big difference from 97 percent as I see it in a one million corpus.

laohong
2005-11-10, 11:50 PM
Your discussion here is very informative. Can Richard please explain a bit more in what way to map the ICTCLAS tagset (for Chinese text) to CLAWS tagset (for English text)?

xiaoz
2005-11-11, 12:06 AM
Well there may not be direct correspondences between tagsets for different languages. While some POS categories are shared by English and Chinese, others are not (e.g. articles in English and 助词 in Chinese). I used CLAWS tagger and different versions of the associated tagsets to show that different tagsets can be applied for one language or using the same tagger. But within one language, a well designed tagset can apply to both written and spoken registers.

以下是引用 laohong 在 2005-11-10 23:50:54 的发言:
Your discussion here is very informative. Can Richard please explain a bit more in what way to map the ICTCLAS tagset (for Chinese text) to CLAWS tagset (for English text)?

xujiajin
2005-11-11, 12:19 AM
[quote]以下是引用 xiaoz 在 2005-11-10 22:25:58 的发言:
Can you give some examples of POS categories that exist only in writing but not in speech or vice versa?

我从我的SCOUT中随便找了几个例子,不知道可不可说明一点问题?
1. 算错了,[就]加上[就是]喽
2. 他[是]很厉害
3. [那][1.0]我就不知道了
4. [呵呵],咦,谁要非要给你加
5. 那就142唉,[好]高噢…… 这个[好]是副词
6. 你要再[嗯],你要the嗯阿,那个的话,你就你就[闪] [闪]为口语新词
7. 小玲:70岁,打打麻将,[啊]?
小峰:[嗯]
小玲:80岁,晒晒太阳
小峰:[嗯]
小玲:90岁,躺在床上,一百岁,挂在墙上
小峰:[嗯]
小玲:[嗯]
小峰:(哼小曲)
8.
小玲:真的啊?
小峰:啊,放心,难不倒我,[真是],随便划划,刚唱(听不清楚)好好做吧
小玲:嘻嘻

xiaoz
2005-11-11, 12:23 AM
An example that best illustrates the need to retrain tagging algorithm and rewrite tagging rules is mm in English: as a noun measurement unit, or as an interjection. In the public release of the BNC which was tagged using a version of CLAWS retrained for spoken English, only 10 instances of mm were tagged as a noun in the four million words of demographically sampled component of the corpus; for the same part of the corpus tagged using the standard version of CLAWS, 2271 instances were tagged as a noun.

xujiajin
2005-11-11, 12:25 AM
The so-called disfluency abounds in natural speech.

xiaoz
2005-11-11, 12:27 AM
In reply to 14:
Are you sure such usages do not exist in written Chinese (ICTLAS tags 嗯 as e)?

xiaoz
2005-11-11, 12:32 AM
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining foe such data.

以下是引用 xujiajin 在 2005-11-11 0:25:04 的发言:
The so-called disfluency abounds in natural speech.

xujiajin
2005-11-11, 12:36 AM
Disfluency is not "wrong" in the sense of natural language.

xujiajin
2005-11-11, 12:39 AM
[quote]以下是引用 xiaoz 在 2005-11-11 0:32:32 的发言:
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining for such data.

Are there any words in spoken data which cannot be POS-tagged? If so, they become outlaws of the natural language. Why are they discriminated and expelled from grammatical analysis?

xujiajin
2005-11-11, 12:43 AM
以下是引用 xiaoz 在 2005-11-11 0:27:40 的发言:
In reply to 14:
Are you sure such usages do not exist in written Chinese (ICTLAS tags 嗯 as e)?


Yes, this is exactly the problem with ICTLAS. Not all cases of 嗯 can be categorizes like English filled pause uh, um, etc. Chinese 嗯 has some ten types of discourse functions. As such its grammatical identity can be very hard to determine.

xiaoz
2005-11-11, 12:44 AM
I would suggest that we do not conflate POS tagging and discourse tagging. The latter can only be reliably annotated by hand. In English, the discourse functions of "oops" include, e.g. expressing mild apology, shock, or dismay. But for the POS tagging purpose, it is tagged as an interjection (ITJ in the BNC).

以下是引用 xiaoz 在 2005-11-11 0:27:40 的发言:
In reply to 14:
Are you sure such usages do not exist in written Chinese (ICTLAS tags 嗯 as e)?

Yes, this is exactly the problem with ICTLAS. Not all cases of 嗯 can be categorizes like English filled pause uh, um, etc. Chinese 嗯 has some ten types of discourse functions. As such its grammatical identity can be very hard to determine.

xiaoz
2005-11-11, 12:52 AM
Very likely. ICTCLAS often leaves some uncommon words untagged (which I picked up and tagged by hand in my corpora). But this happens to spoken as well as written data.

以下是引用 xujiajin 在 2005-11-11 0:39:23 的发言:
[quote]以下是引用 xiaoz 在 2005-11-11 0:32:32 的发言:
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining for such data.

Are there any words in spoken data which cannot be POS-tagged? If so, they become outlaws of the natural language. Why are they discriminated and expelled from grammatical analysis?

xujiajin
2005-11-11, 12:55 AM
Then any good idea of tagging the bold items in the squared brackets when machines finally fail us?
1. 算错了,[就]加上[就是]喽
2. 他[是]很厉害
3. [那][1.0]我就不知道了
4. [呵呵],咦,谁要非要给你加
5. 那就142唉,[好]高噢…… 这个[好]是副词
6. 你要再[嗯],你要the嗯阿,那个的话,你就你就[闪] [闪]为口语新词
7. 小玲:70岁,打打麻将,[啊]?
小峰:[嗯]
小玲:80岁,晒晒太阳
小峰:[嗯]
小玲:90岁,躺在床上,一百岁,挂在墙上
小峰:[嗯]
小玲:[嗯]
小峰:(哼小曲)
8.
小玲:真的啊?
小峰:啊,放心,难不倒我,[真是],随便划划,刚唱(听不清楚)好好做吧
小玲:嘻嘻

xiaoz
2005-11-11, 01:04 AM
I was using "correct" for learner data. Spoken and learner data types are dificult for taggers.

以下是引用 xujiajin 在 2005-11-11 0:36:15 的发言:
Disfluency is not "wrong" in the sense of natural language.

xujiajin
2005-11-11, 01:04 AM
[quote]以下是引用 xiaoz 在 2005-11-11 0:32:32 的发言:
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining foe such data.

Disfluency (including repetitions, omissions, pauses and many other tongue slips) is ill-formed in syntactic terms, but they are pychologically real.

xujiajin
2005-11-11, 01:08 AM
[quote]以下是引用 xiaoz 在 2005-11-11 1:04:11 的发言:
I was using "correct" for learner data. Spoken and learner data types are dificult for taggers.

Yes, Chomsky finds also difficult to discuss natural speech. So he never gives even a slint at the sloppy linguistic performance. While with corpus data, we cannot turn a blind eye to them.

xiaoz
2005-11-11, 01:54 AM
In reply to 24.

No good idea. The only idea is to manually tag the words machine fails, as I did in LCMC.

Also, if you want to differentiate between different discourse functions, I would suggest developing an annotation scheme and searching for and tagging relevant items by hand in a text editor. I tagged all aspect markers in LCMC using my own scheme instead of the ICTCLAS tags.

xiaoz
2005-11-11, 02:00 AM
In reply to 26 -
Agree that these are natural phenomena in speech. Another spoken feature hard for automatic processing is truncations. A truncated word in Chinese can become another word or a word-forming morpheme while a truncated word in English can become another word or non-word. I recall that in ICE-GB the parser ignores such disfluency for parsing purposes, but for the sudy of spoken language, all such features are important.

xiaoz
2005-11-11, 02:07 AM
In reply to 27 -

Yes agreed. We cannot ignore "inconvenient data" in corpora. In the case of ICE-GB, pragmatism takes over by ignoring such data in parsing. It must be accepted that whether to annotate a corpus, and what types of annotation are included, are determined by the resarch questions a corpus is intended to address. Such decisions are also affected by existing technologies. A balance must be striken between perfection and pragmatism.

A little endnote - I am not like Chomsky who turns a blind eye to performace data.

xiaoz
2005-11-11, 02:15 AM
The debate over the issue of whether POS categories apply to both written and spoken registers reminds of the "differentness" vs. "sameness" approaches to the study of English grammar. The "differentness" approach is taken by the Nottingham School while the "sameness" approach by Biber et al (1999) - see another post in the forum. But still, I think POS categories are different from grammatical categories, with the first dealing with words/tokens whereas tha latter with grammatical structures. Grammatical structures can be vastly different in writing and speech, but words are the same in both writing and speech.

xujiajin
2005-11-11, 12:09 PM
刚才又看到一个例子,这里的什么该怎么标词性。

都是[什么]乱七八糟的。

xiaoz
2005-11-11, 07:50 PM
Similar uses are in fact not rare in written Chinese. If distintions between discourse functions are to be made, they can be made in both written and spoken registers.

闯进 来 的 人 一 脸 凶恶 , 你 也 不 看看 这 是 什么 地方 !
听 小伙子 这么 一 说 , 红杏 大爷 才 联想 起 这 几 天 听 她 闺女 晚上 回家 唠叨 的 , 什么 抢购 风 什么 的 。
画 个 其他 什么 不伦不类 的 图形 呢 ?
再说 , 彩电 价格 那么 高 , 老百姓 买不起 骂娘 , 谈 什么 稳定 ?
倘若 这 这样 的 两 个 具体 问题 都 解决 不 了 , 还 何 谈 什么 真抓实干 ?

xujiajin
2005-11-11, 09:41 PM
Some naive questions:

Again words like 什么 here cannot be pigeonholed grammatically, can they?

In other words, can they be deleted and the sentence still makes good sense?

Or can we say words in written and spoken language are either grammatical or discoursal?

xiaoz
2005-11-11, 10:01 PM
In my view, grammar and discourse are two separate perspectives of linguistic analysis. Grammatical functions and discource functions must not be conflated. POS annotation is grammatical whereas discourse annotation is discoursal, though both types of analysis can be undertaken in the same corpus.

以下是引用 xujiajin 在 2005-11-11 21:41:51 的发言:
Some naive questions:

Again words like 什么 here cannot be pigeonholed grammatically, can they?

In other words, can they be deleted and the sentence still makes good sense?

Or can we say words in written and spoken language are either grammatical or discoursal?

动态语法
2005-11-15, 12:35 AM
My 2 cents:

-Theoretiacally there should be separate taggers for written and spoken language; in
reality, however, it is very difficult, if not impossible, to come up with these taggers.
The main problems are that 1) there is far little research on spoken discourse compared
to written discourse, and 2) spoken and written are relative anyway.

-Currently if we use a tagger that is based on the written language, the main problem
to me is that many multiple-word expressions that function as a single word will be dismentled
because they are not common in the written language. Such examples may include:
就是说,真是的, 那什么,那谁,- a tagger can easily split them into multiple words while in
reality they function as single words.

xujiajin
2005-11-15, 08:51 AM
以下是引用 动态语法 在 2005-11-15 0:35:14 的发言:
-Currently if we use a tagger that is based on the written language, the main problem
to me is that many multiple-word expressions that function as a single word will be dismentled because they are not common in the written language. Such examples may include: 就是说,真是的, 那什么,那谁,- a tagger can easily split them into multiple words while in reality they function as single words.


Agree. That is why I prefer a character-based tokenization for spoken corpora, and multi-word units as Chinese word or phrases.