(英文版)Detagging Tool

Thanks for sharing it ! By the way, have you finished your previous software for CLEC? Would you please kindly send us a beta version if you have done it ?
 
#

[本贴已被 作者 于 2006年06月29日 13时00分56秒 编辑过]

[本贴已被 作者 于 2006年06月30日 14时14分28秒 编辑过]
 
哈哈!还这么神秘.吊人口味啊?现在工具太多了,人简直成了工具的奴隶.雪中送炭已经有了,你那个充其量也只是锦上添花.让老板自己藏着吧


[本贴已被 作者 于 2006年06月30日 09时11分57秒 编辑过]
 
回复:(英文版)Detagging Tool

Thanks for sharing.

The program works very well for short files. But when I tested it on a file with a
size of about 2.0 MB, it took about 5 min. and still didn't finish. I couldn't tell whether
or not it's working since there is no progress report as to how much that has been
processed. So some indicator would be nice.
 
这个软件的容错性,不好。对于“[sn8-” 这样不完全的标记会出问题,另外对于大文件的处理方法我还没有优化,所以会很慢。这个工具只有30多行代码还很简陋,请大家原谅。我会根据大家的反馈不断丰富它。
 
A real detagger should be able to remove everything other than the textual data. Try your tool on a BNC file to see if it can remove the corpus header in addition to everything in <>.
 
这个工具的源代码
----------------------------------------------------------------

unit Unit1;

interface

uses
Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
Dialogs, StdCtrls, ComCtrls, ShellCtrls, Grids, Outline, DirOutln,
FileCtrl, ShellAPI, WinSkinData;

type
TForm1 = class(TForm)
Button1: TButton;
FileListBox1: TFileListBox;
DirectoryListBox1: TDirectoryListBox;
DriveComboBox1: TDriveComboBox;
GroupBox1: TGroupBox;
CheckBox1: TCheckBox;
CheckBox2: TCheckBox;
CheckBox3: TCheckBox;
CheckBox4: TCheckBox;
Label1: TLabel;
Button2: TButton;
Label2: TLabel;
Label3: TLabel;
SkinData1: TSkinData;
procedure DriveComboBox1Change(Sender: TObject);
procedure Button1Click(Sender: TObject);
procedure DirectoryListBox1Click(Sender: TObject);
procedure Button2Click(Sender: TObject);
procedure Button3Click(Sender: TObject);
procedure Label2Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;

var
Form1: TForm1;

implementation

{$R *.dfm}

//去除标签 <>, []成对标签
function DeLabel1(str,strBegin,strEnd:String):String;
var
i,j :Integer;
begin

while (Pos(strBegin, str) > 0) and (Pos(strEnd, str) > 0) do
begin
Application.ProcessMessages;
i := Pos(strBegin, str);
j := Pos(strEnd, str);
delete(str,i,j-i+1);
end;

//删除多余空格

str := StringReplace(str,' ',' ',[rfRePlaceAll]);
str := StringReplace(str,' ',' ',[rfRePlaceAll]);
str := StringReplace(str,' ',' ',[rfRePlaceAll]);

result := str;

end;

//去除标签 _ ,/ 不成对标签
function DeLabel2(str,strBegin,strEnd:String):String;

var
hs : String;
p : Integer;
strLst: TStringList;

begin
strLst := TStringList.Create;
strLst.Clear; // 清除字符串中的内容
if Length(str)=0 then // 长度为0
Exit;
p:=Pos(strEnd,str);
while P<>0 do
begin
hs:=Copy(str,1,p-1); // 复制字符

if Pos(strBegin,hs)>0 then hs := Copy(hs,1,Pos(strBegin,hs)-1);
strLst.Add(hs); // 添加到列表
Delete(str,1,p); // 删除字符和分割符
p:=Pos(strEnd,str); // 查找分割符
end;
if Length(str)>0 then
strLst.Add(str); // 添加剩下的条目

RESULT := StringReplace(strLst.Text,#13#10,' ',[rfReplaceAll]);

end;




procedure TForm1.DriveComboBox1Change(Sender: TObject);
begin
Form1.DirectoryListBox1.Drive := Form1.DriveComboBox1.Drive;
end;

procedure TForm1.Button1Click(Sender: TObject);
var
i: Integer;
strLst: TStringList;
str: String;
begin
if Trim(Label1.Caption)='' then exit;
if (CheckBox1.Checked = false) and (CheckBox2.Checked = false) and (CheckBox3.Checked = false) and (CheckBox1.Checked = false) then exit;


strLst := TStringList.Create;
strLst.Clear;
for i := 0 to FileListBox1.Count - 1 do
begin
FileListBox1.ItemIndex := i;
strLst.LoadFromFile(FileListBox1.FileName);
str := strLst.Text;
str := ' ' + str + ' ';
strLst.Clear;
if CheckBox1.Checked then str := DeLabel1(str,'[',']');
if CheckBox2.Checked then str := DeLabel1(str,'<','>');
if CheckBox3.Checked then str := DeLabel2(str,'_',' ');
if CheckBox4.Checked then str := DeLabel2(str,'/',' ');
strLst.Add(Trim(str));
strLst.SaveToFile(Label1.Caption + '\' + FileListBox1.Items.Strings);
strLst.Clear;
end;

showmessage('OK!');

end;

procedure TForm1.DirectoryListBox1Click(Sender: TObject);
begin
FileListBox1.Directory := DirectoryListBox1.Directory;
end;

procedure TForm1.Button2Click(Sender: TObject);
var
dir:String;

begin
if SelectDirectory(dir,[],500) then
Label1.Caption := Dir;
Button1.Enabled := true;
end;

procedure TForm1.Button3Click(Sender: TObject);
var i: Integer;
begin
for i := 0 to FileListBox1.Count - 1 do
begin
FileListBox1.ItemIndex := i;
showmessage(FileListBox1.FileName);
end;

end;

procedure TForm1.Label2Click(Sender: TObject);
begin
ShellExecute(handle,nil,pchar('mailto:JYL_JAVA@126.com'),nil,nil,sw_shownormal);
end;

end.
 
回复: (英文版)Detagging Tool

It seems it doesn't work on all the BNC xml files. CBF.xml is still detagginnnnnnnnnnnnnnnng
 
Back
顶部