如何把Gotagger标注格式转换为claws格式?Perl代码

singer

普通会员
请教诸位:在用word smith tools 3.0对gotagger标注过的文件进行检索时,在setting中选择tags to ignore的对话框中应该怎么设置?
如果无法实现的话,怎么才能把gototagger的标注格式转换为claws的格式?
 

xiaoz

永远的超级管理员
Staff member
You can ignore POS tags in Wordsmith. But first you need to post a paragraph tagged using gotagger.
 

xujiajin

管理员
Staff member
You do not need to adjust the setting in WS.
Alternatively replace all "_"s with "<" and all spaces with > plus a space.
 

singer

普通会员
Thanks ,Dr.Xu, Your method works well.
I will put some sentences here.
before replacing:
A_DT Healthy_NNP Diet_NNP
A_DT healthy_JJ diet_NN is_VBZ very_RB important_JJ for_IN people_NNS ._. It_PRP contains_VBZ some_DT fat_JJ ,_,
after replacing:
A<DT>Healthy<NNP>Diet<NNP
A<DT>healthy<JJ>diet<NN>is<VBZ>very<RB>important<JJ>for<IN>people<NNS>.<.>

We can see there a few "<.>" here, how can I remove them?Thanks for your reply.
 

xiaoz

永远的超级管理员
Staff member
I have actually uploaded a number of programs to this site that convert between different POS styles.
 

xujiajin

管理员
Staff member
回复:[求助]如何把gototagger的标注格式转换为claws的格式?

以下是引用 singer2006-5-16 19:58:36 的发言:
Thanks ,Dr.Xu, Your method works well.
I will put some sentences here.
before replacing:
A_DT Healthy_NNP Diet_NNP
A_DT healthy_JJ diet_NN is_VBZ very_RB important_JJ for_IN people_NNS ._. It_PRP contains_VBZ some_DT fat_JJ ,_,
after replacing:
A<DT>Healthy<NNP>Diet<NNP
A<DT>healthy<JJ>diet<NN>is<VBZ>very<RB>important<JJ>for<IN>people<NNS>.<.>

We can see there a few "<.>" here, how can I remove them?Thanks for your reply.
You're supposed to replace all spaces with > plus a SPACE.
It is no point removing all peroids. If you wish to get rid of them anyway, replace all <.> with nothing.
 

xujiajin

管理员
Staff member
回复:[求助]如何把gototagger的标注格式转换为claws的格式?

以下是引用 xiaoz2006-5-16 23:09:19 的发言:
I have actually uploaded a number of programs to this site that convert between different POS styles.
$line=~s/(\S+)\/(\S+)/<w POS="$2">$1<\/w>/g;
Richard, I saved the above line as a .pl file and put the dummy text in the same folder, but it didn't do any conversion. What else should I do?
 

xiaoz

永远的超级管理员
Staff member
Here is a Perl script that that converts the word_tag style into the word<tag> style. I have commented the script for those interested. You can save the text below as a format.pl.

opendir (DIR, ".") or die "Could not open the current directory"; #open the current dir
@files=readdir (DIR); #read all files in the dir and save them in an array
closedir (DIR); #close the dir

foreach $file (sort (@files)) #loop dealing with each file
{
next if ($file eq "."); #skip the file named '.'
next if ($file eq ".."); #skip file name ".."
next if ($file=~/format\.pl/); #skip the perl script itself

open (FHI, $file) or die "Could not open $file"; #open the file or exit

$output="new_".$file; #name the output file

open (FHO, ">$output") or die "Could not open $output"; #create output file

while ($line=<FHI>) #read each line in the file
{
$line=~s/(\S+)_(\S+)/$1<$2>/g; #matching and replacing all
$line=~s/ +/ /g; #replace more whitespaces with just one
print FHO $line; #print to the output file
}
close (FHI); #close the input file
close (FHO); #close the output file
}
 

xujiajin

管理员
Staff member
The above script works perfectly with _TAG to <TAG> conversion.

But when replace the line "$line=~s/(\S+)_(\S+)/$1<$2>/g;" with "$line=~s/(\S+)\/(\S+)/<w POS="$2">$1<\/w>/g;", it failed to convert the word_TAG to the XML style.
 

xiaoz

永远的超级管理员
Staff member
That's because your line matches word/tag, which is not found. Changing that line as follow:

$line=~s/(\S+)_(\S+)/<w POS="$2">$1<\/w>/g;

will do the trick.
 

xiaoz

永远的超级管理员
Staff member
If you refer to <tag>word by the BNC style, change the matching line into:

$line=~s/<(\S+)>(\S+)/<w POS="$1">$2<\/w>/g;
 

刘语料

封禁用户
Dr.Xiao, I have another to ask you to help me, that is how to extract the combinations of nouns modified by adjectives in the following XML text: could you tell me the search pattern?
thank you in advance.

<w POS="AT">the</w> <w POS="NP">Fulton</w> <w POS="NN">County</w> <w POS="JJ">Grand</w> <w POS="NN">Jury</w> <w POS="VBD">said</w> <w POS="NR">Friday</w> <w POS="AT">an</w> <w POS="NN">investigation</w> <w POS="IN">of</w> <w POS="NP$">Atlanta's</w> <w POS="JJ">recent</w> <w POS="NN">primary</w> <w POS="NN">election</w> <w POS="VBD">produced</w> <w POS="AT">no</w> <w POS="NN">evidence</w> <w POS="CS">that</w> <w POS="DTI">any</w> <w POS="NNS">irregularities</w>
 

xiaoz

永远的超级管理员
Staff member
As your corpus is in XML, you can index the corpus using Xaira, declaring the w element as the Addkey. After indexing, open the corpus Xaira client, and use Query builder to search for all occurrences of adjectives (JJ) followed immediately (declare the link type as Next) by a noun (NN).
 
顶部