研究习作: 离合词的语料库研究

动态语法

管理员
Staff member
汉语中有所谓的离合词一说。例如,

站岗 --〉我站了一会儿岗就回去了。
生气 --〉你这是生的哪门子的气呀?
睡觉 --〉今天总算睡了个好觉。

如果用语料库作研究:
1)如何以自动或半自动的手段发现语料中的离合词?
2)有没有人做过这类研究?
 
回复:研究习作: 离合词的语料库研究

The best way to do this kind of study, in my view, is to prepare a list of, say, 40 most commonly used words of this category in MS-Word (40 is chosen because the search/context string must be less than 80 characters including /). Then do the following:

1) Add a space or tab between separable element (suppose there are only two elements);
2) Select All and convert the selection into a table;
3) Select the first column and cut and paste into Notepad and save as a Unicode file (part1.txt);
4) Conver the second column back into text (now one iitem per line);
5) Find and Replace all ^p (new line character in Word) with / (now item1/item2/item3/);
6) Remove the final /;
7) Start WST4 and load the Chinese corpus;
8) Start concord and use file-based concordances;
9) Type in Path:\file1.txt (Path=the folder you save file1.txt) e.g. C:\file1.txt and Press Load;
10) In advanced search copy the string item1/item2/item3...etc into the box for context;
11) Set L=0 and R=n (n>0, the greater n is, the noisy the result contains);
12) search as usual.

So you get something like below, if tis is what you want:

2005080603570982.jpg
 
回复:研究习作: 离合词的语料库研究

1) Add a space or tab between separable element (suppose there are only two elements);
2) Select All and convert the selection into a table;
3) Select the first column and cut and paste into Notepad and save as a Unicode file (part1.txt);
4) Convert the second column back into text (now one item per line);
5) Find and Replace all ^p (new line character in Word) with / (now item1/item2/item3/);
6) Remove the final /;

No need to convert to table. Use Alt to select a column, the left/first column here.
 
Very impressive!

5)Find & Replace all ^p (new line character in Word) with/ (now item1/item2/item3/)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Then all the second column characters are in one row. What is point of doing this?

6) Remove the final /
7) Start WST4 and load the Chinese corpus
^^^^^^^^^^^^^^^^^^^^^^^^^^^
It has to be the corpus with one white space after each character but not a corpus segmented by word or Chinese phrase.
 
回复:研究习作: 离合词的语料库研究

This is one of those occasions where space insertion within a word
processor comes handy.
 
回复:研究习作: 离合词的语料库研究

Point 5:

WST allows file-based search in simple concordance. In advanced contextual search, however, the CONTEXT text box does not accept a text file of alternative items as for the main search word (I will suggest Mike to include this if it does not discrupt the whole program). In this window, therefore, you must use / (or) to link the alternative items.

Point 7:

I assume that the corpus is tokenised, not necessarily with one space between each character.

以下是引用 xujiajin2005-8-6 4:31:53 的发言:
Very impressive!

5)Find & Replace all ^p (new line character in Word) with/ (now item1/item2/item3/)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Then all the second column characters are in one row. What is point of doing this?

6) Remove the final /
7) Start WST4 and load the Chinese corpus
^^^^^^^^^^^^^^^^^^^^^^^^^^^
It has to be the corpus with one white space after each character but not a corpus segmented by word or Chinese phrase.
 
回复:研究习作: 离合词的语料库研究

Indeed, the column selction can be used in 2)

以下是引用 xujiajin2005-8-6 4:19:18 的发言:
1) Add a space or tab between separable element (suppose there are only two elements);
2) Select All and convert the selection into a table;
3) Select the first column and cut and paste into Notepad and save as a Unicode file (part1.txt);
4) Conver the second column back into text (now one iitem per line);
5) Find and Replace all ^p (new line character in Word) with / (now item1/item2/item3/);
6) Remove the final /;

No need to convert to table. Use Alt to select a column, the left/first column here.
 
基于语料库的离合词研究
【作者】 王春霞;
【导师】 陈小荷;
【学位授予单位】 北京语言文化大学;
【学科专业名称】 语言学及应用语言学
【学位年度】 2001
【论文级别】 硕士
【网络出版投稿人】 北京语言文化大学
【网络出版投稿时间】 2002-09-23
【中文摘要】 本文通过对大规模语料的考察与分析,得到了离合词作为一种较为特殊的语言形式在文本中的出现情况及其插入成分的规律,对这些规律进行了总结,获得了离合词的组配模式,没有得到组配模式的离合词则人工写出了它们的插入规则。在此基础上,设计了一个规则和统计相结合的算法,对离合词标注进行了封闭测试和开放测试,开放测试的结果:正确率81.74%,召回率98.27%。 全文共分六个部分: 第一部分:引言。界定了有关离合词的一些概念,确定了该选题的目标和方法,指出本研究的价值和意义,并综述了离合词在语言学界和自然语言信息处理学界的研究现状和地位,以及从中得到的一些启示。说明了本研究所使用的语料。 第二部分:离合词标注的难度分析。从语料中我们对离合词的情况有了大致的了解,对语料进行了初步处理,得到离合词的例句,进行了统计分析。根据例句指出了离合词研究中的困难和有利之处。 第三部分:离合词插入规则的获得和分析。这一部分是确定算法的基础。我们总结了组配模式,根据模式从大量离合词例句获得了一些有效的规则以及其他数据,也为了弥补数据稀疏的不足,人工总结了一部分离合词的规则。 第...
 
回复: 研究习作: 离合词的语料库研究

设计一个简单的算法,然后统计出来就可以了
 
回覆: 研究习作: 离合词的语料库研究

有中文甲乙丙丁级语法大纲里40个离合词的例句吗?
 
Back
顶部