急求解:对BNC的标注语料如何纯净化?

因研究需要从BNC中抽取了一部分语料,但因为是标注过的,不知道如何进行文本纯净化,然后保存为纯文本的形式?
有研究说wordsimith里的text converter可以用来进行文本纯净,需要一个conversion file 即c:\wsmith\convert.txt,但是老师弹出I/O error 103,不知道怎么回事?请指点,谢谢!
另外有人提到了EditPat Pro, 但不知道哪里可以下载,如何使用,能对整个语料库(较大)进行纯净化处理吗?
 

armstrong

高级会员
回复: 急求解:对BNC的标注语料如何纯净化?

首先,要了解你要净化的BNC文本的构成,然后编制相应conversion file 文件,这样才可以做。否则是不行的。
至于利用EditPat Pro,其实是利用其下的替换功能,利用正则表达式进行替换。
不论使用前者或后者都要首先了解BNC文本的构成。
 

xusun575

高级会员
回复: 急求解:对BNC的标注语料如何纯净化?

发一个样本上来诊断一下。
 

xujiajin

管理员
Staff member
回复: 急求解:对BNC的标注语料如何纯净化?

Here is the program I used to detag the BNC before retagging it using the C7 tagset. It removes everything other than the orignal texts and transcripts. you will need to install Perl in order to use the program, which is free. Then follow the steps below:

1) Make a new directory on the machine;
2) COPY the selected files to the dir;
3) Unzip the perl script into the same dir;
4) Double click the program file

A new file will be created for each BNC file, ending in .txt. These new files are what you want.

Warning: This program only works with BNC files.
http://www.corpus4u.org/upload/forum/2005122923225827.zip

Perl script written by xiaoz.
 
回复: 急求解:对BNC的标注语料如何纯净化?

Here is the program I used to detag the BNC before retagging it using the C7 tagset. It removes everything other than the orignal texts and transcripts. you will need to install Perl in order to use the program, which is free. Then follow the steps below:

1) Make a new directory on the machine;
2) COPY the selected files to the dir;
3) Unzip the perl script into the same dir;
4) Double click the program file

A new file will be created for each BNC file, ending in .txt. These new files are what you want.

Warning: This program only works with BNC files.
http://www.corpus4u.org/upload/forum/2005122923225827.zip

Perl script written by xiaoz.
我按照上面的步骤做,可是第四步
4) Double click the program file,也就是 BNCdetag.pl,但这个以PL结尾的程序如何打开呢?
查了下说是需要DzSoft Perl Editor,又去下载了,但还是不能运行,说是还要Activeperl,这个Activeperl比上面的Perl Editor还要大!装了好久没装好!应该怎么办?
 

xujiajin

管理员
Staff member
回复: 急求解:对BNC的标注语料如何纯净化?

Activeperl一定要装,装上就行了。是很大。
 
回复: 急求解:对BNC的标注语料如何纯净化?

Activeperl一定要装,装上就行了。是很大。

Activeperl是装上了,怎么一运行BNCdetag,我里面的文件全部被清零了,就是只剩下文件夹空壳了,全部文件大小显示为都是0字节,晕那!什么原因呢?(附件是绿色版的perl editor)
 

附件

回复: 急求解:对BNC的标注语料如何纯净化?

给初学者提个醒,在对原始语料做这种处理之前一定要先做好备份,另外建议先抽出一小部分做试验,成功后再全部处理,否则很可能造成较大损失
 

xiaoz

永远的超级管理员
Staff member
回复: 急求解:对BNC的标注语料如何纯净化?

I'm sorry this has happened, which is most unfortunate. It happened because the original filenames of the BNC files have been changed: original BNC filenames look like fb4, without the extension .txt instead of fb4.txt.

To process the BNC files with names like fb4.txt, you can modify lines 4 and 15 of the script as follows:

Line 4:
@files=grep (/\b[A-Z0-9]{3}\.txt\b/i, readdir (DIR));

Line 15:
$output="new_".$fn;

Then you will get the resultng files like new_fb4.txt, which are what you want.


Activeperl是装上了,怎么一运行BNCdetag,我里面的文件全部被清零了,就是只剩下文件夹空壳了,全部文件大小显示为都是0字节,晕那!什么原因呢?(附件是绿色版的perl editor)
 
回复: 急求解:对BNC的标注语料如何纯净化?

不知道是否符合你的要求。欢迎拍砖。我在外地,否则我可以把自己的宏调整一下传给你用。
分析一下需要清除的内容,然后再调整一下工具即可。
做的很好,就是纯文本即可,但具体怎么做,如何调整宏之类的,请指教!
 
回复: 急求解:对BNC的标注语料如何纯净化?

I'm sorry this has happened, which is most unfortunate. It happened because the original filenames of the BNC files have been changed: original BNC filenames look like fb4, without the extension .txt instead of fb4.txt.

To process the BNC files with names like fb4.txt, you can modify lines 4 and 15 of the script as follows:

Line 4:
@files=grep (/\b[A-Z0-9]{3}\.txt\b/i, readdir (DIR));

Line 15:
$output="new_".$fn;

Then you will get the resultng files like new_fb4.txt, which are what you want.
sorry, i still didn't get it. there are more than one file and there are more than one line in a file. how can i modify so many files and so many lines only by hand? is there any way that is more convinient to tranform the format once for all?
 

xusun575

高级会员
回复: 急求解:对BNC的标注语料如何纯净化?

做的很好,就是纯文本即可,但具体怎么做,如何调整宏之类的,请指教!
怎么做,在网络也讲不清楚。单个文件和成百上千的文件,处理起来都是一样的,不会增加多少时间。我们这款detagger虽好用管用,因一无技术含量,二无形象,还是不出家门的好:p
如果不介意,请发过来或分批发过来,我给你弄一下吧。

ps: BNC world edition和baby edition 我都有,但不知放在哪个盘上了,用不着,也不想费心去找。说明一下,让你放心,呵呵:D
 
回复: 急求解:对BNC的标注语料如何纯净化?

孙教授总是那么的幽默。
我们都希望您的detagger出家的好。
 

xiaoz

永远的超级管理员
Staff member
回复: 急求解:对BNC的标注语料如何纯净化?

Not really so many files. Just the programme file named detag.pl, then proceed as instructed.

sorry, i still didn't get it. there are more than one file and there are more than one line in a file. how can i modify so many files and so many lines only by hand? is there any way that is more convinient to tranform the format once for all?
 
回复: 急求解:对BNC的标注语料如何纯净化?

对不起,我也有相同问题,不过不是BNC,是ICE-HK,我想把它的heading 和tagger 全去掉,试过用论坛上的detagger 工具,单个文档可以,但是批量不行,而且如果先用文档整理器合并后再detag 就卡再那里再也没反应了,试过几次,崩溃中。。。
请教各位老师了。我认真看了上面的帖,但是我是电脑白痴型,对上面大家说的很多程序或术语如堕云雾,请指教basic方法。
急切等待,万分感激。
 

xusun575

高级会员
回复: 急求解:对BNC的标注语料如何纯净化?

对不起,我也有相同问题,不过不是BNC,是ICE-HK,我想把它的heading 和tagger 全去掉,试过用论坛上的detagger 工具,单个文档可以,但是批量不行,而且如果先用文档整理器合并后再detag 就卡再那里再也没反应了,试过几次,崩溃中。。。
请教各位老师了。我认真看了上面的帖,但是我是电脑白痴型,对上面大家说的很多程序或术语如堕云雾,请指教basic方法。
急切等待,万分感激。
应该好解决。也发个样本上来诊断一下?:p
 
顶部