PDA

查看完整版本 : paraconc使用


xudekuan
2006-08-11, 08:40 AM
paraconc要求预先对语料对齐,请问,格式是怎样的?哪位能提供样本?

laohong
2006-08-12, 12:17 PM
是的,文本预先对齐,检索就方便多了。这里给个例子做参考:

中文原文(10句,最好用记事本存成纯文本格式,GB编码):

<seg>那時又將天晚,趙姨娘的聲音只管喑啞起來了,居然鬼嚎一般。</seg>
<seg>無人敢在她跟前,只得叫了幾個有膽量的男人進來坐著。</seg>
<seg>趙姨娘一時死去,隔了些時,又回過來,整整的鬧了一夜。</seg>
<seg>到了第二天,也不言語,只裝鬼臉,自己拿手撕開衣服,露出胸膛,好像有人剝她的樣子。</seg>
<seg>可憐趙姨娘雖說不出來,其痛苦之狀,實在難堪。正在危急,大夫來了,也不敢診脈,只囑咐辦後事罷。說了,起身就走。</seg>
<seg>那送大夫的家人再三央告,說#請老爺看看脈,小的好回稟家主。」</seg>
<seg>那大夫用手一摸,已無脈息。</seg>
<seg>賈環聽了,然後大哭起來。眾人只顧賈環,誰料理趙姨娘。</seg>
<seg>只有周姨娘心裏苦楚。</seg>
<seg>想到做偏房側室的下場頭,不過如此。</seg>


英文原文(对应的10句,用记事本存成纯文本文件):

<seg>She was a terrifying sight , and no one now dared go near her . </seg>
<seg>By evening her voice began to grow hoarse and she sounded more and more like a croaking harpy . </seg>
<seg>None of the women could bear to be in her presence , and they deputed some of the more courageous menfolk to come in and keep watch on her . </seg>
<seg>One minute she seemed to be gone , then she came round again , and so it went on all night . </seg>
<seg>By the next morning she was incapable of speech , her face was horribly contorted and she began rending her clothes and baring her bosom , as if someone else was stripping her naked . </seg>
<seg>She was now totally inarticulate , and the torment she was undergoing was terrible to behold . She seemed to have reached a final crisis , when the doctor arrived . He would not take her pulse , but gave orders at once for her last things to be made ready and himself prepared to leave without further ado . </seg>
<seg>The servant who had brought him entreated him to stay and take her pulse , so that he could at least return with a satisfactory report to his master , and in the end the doctor relented . </seg>
<seg>He felt her pulse once , and pronounced that there was no sign of life . </seg>
<seg>Hearing this , Jia Huan burst out wailing , and immediately everyone 's attention was turned to him and no one spared another thought for Aunt Zhao , lying dead on the kang , her feet bare , her hair in disarray . </seg>
<seg>Only Aunt Zhou seemed affected . </seg>
<seg>She thought morbidly to herself that such is the end of a concubine ! </seg>

注意:这里给的文本句子的起始已经用<seg></seg>标注,因此在用ParaConc里的Load Corpus时,Align Format 里要选择 Start/End Tags。Good Luck!

xudekuan
2006-08-12, 12:57 PM
thank you very much, laohong!
and how do you manage to add <seg> tags to the original, manually or automatically? if auto matically, is there some free tool to do this?

thanks again.

laohong
2006-08-12, 05:03 PM
是的,是自动添加的,不过你得先把句断好才行。

laohong
2006-12-05, 05:52 PM
从2楼的样例文本可以存成两个纯文本文件,一个Chn Seg.txt, 一个Eng Seg.txt。然后把这两个文件倒入ParaConc,得到下图:

http://forum.corpus4u.org/attachment.php?attachmentid=67&stc=1&d=1165312146

敲Align Format那里的那个Options, 得到下图:

http://forum.corpus4u.org/attachment.php?attachmentid=68&stc=1&d=1165312225

这样就可以得到aligned 好了的语料供检索了。

如果把文本中的<seg>和</seg>分别换成<p>和</p>,只要相应地把Align Format那里的那个Options也改成<p>和</p>就可以了。

xudekuan
2006-12-06, 07:27 PM
谢谢老洪。
能否告之如何自动添加<seg> </seg>标记吗?

laohong
2006-12-07, 03:34 PM
这里告诉一个简单的不需要编程的方法:

1、先把中英文本准备好,每行一个句子。注意中英文两个文本中的行数应该一致;如果要想检索到汉字,最好先做好分词处理或是汉字间加空格;

2、用EditPlus (可到http://www.editplus.com下载)打开文本;

3、然后敲菜单里的Search键,选择Replace, 并把Regular Expression 打勾;

4、在Find what里填入引号里的字符(不要复制引号)“\n”,并在Replace with里填入引号里的字符(不要复制引号)“</seg>\n<seg>”; 把鼠标放到文章最开始,然后敲Replace All就得到结果;

5、最后,把文本最后一行那个<seg>移到第一行句首就可以了。

armstrong
2006-12-08, 12:49 AM
thanks a lot,Dr.Hong.