请教：用AntConc提取词块出现问题

主题发起人 Pennylin
时间 2011-06-12

P

Pennylin

2011-06-12

#1

语料如下：

<B> <overlap /> hi </B>

<B> yeah <overlap /> all right </B>

<B> I think that's the topic I'm interested in or I'd like to tell you about which is a country I have visited and: it is a country which has impressed me . obviously (erm) . to describe my visit and say why I found that country particularly impressive </B>

<B> (eh) . well it was . Mexico (eh) that I visited for<?> some four years ago and it was .. around Christmas time . so . it was really . (er) . quite . an experience for me . (eh) it was not my first flight but it was my first overseas flight . (erm) . just at Christmas that is December the twenty-fourth and we had turkey <starts laughing> and champagne <stops laughing> </B>

<B> on the plane yes so it started from there no it started in Spain . cos I (eh) . left from here I went to see my friend in Spain <overlap /> and then I spent some time </B>

<B> with her there and I left . I I took the . the flight (em) . on Christmas day so this was the first time and . (eh) . it still is my first time that I've . (eh) ... had one and the same day twice </B>
<B> <overlap /> <XXX> . </B>

结果提取出来的是字母块，如下：

1 33 h e r e
2 25 w e l l
3 24 t h e r
4 23 y e a h
5 22 t h a t
6 21 t h i s
7 20 v e r y
8 18 d t h e
9 18 n d t h
10 18 t h e s
11 17 t h i n
12 14 a n d t
13 14 i g h t
14 13 e t h e

不知是什么原因？设置了不同的Language encodings都是如此。
忘各位指导指导！非常感谢！

F

frankli1

2011-06-12

#2

回复: 请教：用AntConc提取词块出现问题

我试了一下，完全可以可以啊，没出现你所说的情况。

P

Pennylin

2011-06-13

#3

回复: 请教：用AntConc提取词块出现问题

作者 frankli1:
我试了一下，完全可以可以啊，没出现你所说的情况。

是，我用其它的语料库也没问题。但是用这个语料库时就发生这种情况。估计跟语料库的字体啊什么的有关？但我把它改为Times New Roman或宋体后仍出现这种情况。请问问题可能出现在哪呢？

seanxpq

corpus explorer

2011-06-13

#4

回复: 请教：用AntConc提取词块出现问题

作者 Pennylin:
是，我用其它的语料库也没问题。但是用这个语料库时就发生这种情况。估计跟语料库的字体啊什么的有关？但我把它改为Times New Roman或宋体后仍出现这种情况。请问问题可能出现在哪呢？

我把你上面的内容用ANTCONC做了一遍，没问题啊。

X

xiaoz

永远的超级管理员

Staff member

2011-06-13

#5

回复: 请教：用AntConc提取词块出现问题

应该是跟语言设置有关。可能是设置成中文把每个字符作为一个词了。

作者 Pennylin:
是，我用其它的语料库也没问题。但是用这个语料库时就发生这种情况。估计跟语料库的字体啊什么的有关？但我把它改为Times New Roman或宋体后仍出现这种情况。请问问题可能出现在哪呢？

F

frankli1

2011-06-13

#6

回复: 请教：用AntConc提取词块出现问题

我又试了一下，发现如果把这段文本保存为unicode格式，就会出现你所说的问题。把文本另存为utf-8就可以了。至于为什么我也不很清楚
unicode这种字符编码方法是可以容纳全世界所有语言文字的编码方案，挺怪的。

Last edited: 2011-06-13

P

Pennylin

2011-06-14

#7

回复: 请教：用AntConc提取词块出现问题

作者 frankli1:
我又试了一下，发现如果把这段文本保存为unicode格式，就会出现你所说的问题。把文本另存为utf-8就可以了。至于为什么我也不很清楚
unicode这种字符编码方法是可以容纳全世界所有语言文字的编码方案，挺怪的。

奇怪，在.txt文本好像不能另存为(utf-8)?是在“字体”栏选吧？如果不是，在哪里呢？

感谢各位的指点！

F

frankli1

2011-06-14

#8

回复: 请教：用AntConc提取词块出现问题

作者 Pennylin:
奇怪，在.txt文本好像不能另存为(utf-8)?是在“字体”栏选吧？如果不是，在哪里呢？

感谢各位的指点！

另存时在编码栏选就可以

P

Pennylin

2011-06-14

#9

回复: 请教：用AntConc提取词块出现问题

作者 frankli1:
另存时在编码栏选就可以

谢谢！试了，同时把AntConc的Language Encodings也设置成utf-8，但还是不行。试了很多其它字体都不行，怎么办？！

L

LAnthony

2011-06-14

#10

Re: 回复: 请教：用AntConc提取词块出现问题

作者 Pennylin:
谢谢！试了，同时把AntConc的Language Encodings也设置成utf-8，但还是不行。试了很多其它字体都不行，怎么办？！

This is almost definitely a problem with the character encoding. I suggest you open the file in a browser and then make sure that the character encoding is really UTF-8. Then, if it is, open it in AntConc, set the language encoding there to UTF-8 and things will work properly.

If you still have problems, can you attached a single original file (not copied and pasted)?

Laurence.

Last edited: 2011-06-14

xujiajin

管理员

Staff member

2011-06-14

#11

回复: 请教：用AntConc提取词块出现问题

I think many corpus4u members, like me, are more than surprised that Laurence could understand Chinese posts.

P

Pennylin

2011-06-15

#12

回复: 请教：用AntConc提取词块出现问题

作者 xujiajin:
I think many corpus4u members, like me, are more than surprised that Laurence could understand Chinese posts.

Agree!

Laurence, it worked properly now. Sorry for my carelessness. The problem is that I've too many files here and it's too time-consuming to transfer them manually into utf-8 files one by one. I've attached an original file here. Could you check and suggest which language encoding to choose in AntConc so that it can work properly without transferring the file into an utf-8 file? Thank you.

X

xiaoz

永远的超级管理员

Staff member

2011-06-15

#13

回复: 请教：用AntConc提取词块出现问题

Your sample is in Unicode (UTF16). You can download UltraCodingSwitch or Multilingual Corpus Tools (http://www.lancs.ac.uk/fass/projects/corpus/cbls/resources.asp) to convert them into UTF-8 or ANSI in a batch.

作者 Pennylin:
Agree!

Laurence, it worked properly now. Sorry for my carelessness. The problem is that I've too many files here and it's too time-consuming to transfer them manually into utf-8 files one by one. I've attached an original file here. Could you check and suggest which language encoding to choose in AntConc so that it can work properly without transferring the file into an utf-8 file? Thank you.

P

Pennylin

2011-06-15

#14

回复: 请教：用AntConc提取词块出现问题

Thank you very much, Xiaoz. I've downloaded ultracodingswitch and my problem is solved. I've also downloaded MLCT and tried to install it via its run_mlct_concordance_jar. However, it seems that it can't be installed. What's the problem?

Thank you again for your help and thanks to all who have kindly offered me help these days!

X

xiaoz

永远的超级管理员

Staff member

2011-06-15

#15

回复: 请教：用AntConc提取词块出现问题

You'll need to install Java to use MLCT.

P

Pennylin

2011-06-16

#16

回复: 请教：用AntConc提取词块出现问题

Thanks!

L

LAnthony

2011-06-16

#17

Re: 回复: 请教：用AntConc提取词块出现问题

作者 Pennylin:
Thanks!

You don't need to do anything special. Just choose UCS-2LE in the language encoding options of AntConc and everything will work as expected. (I don't think the encoding is (UTF16) as someone else wrote.

However, because the file name will be encoded in the legacy encoding of your system, it will appear garbled. This is why UTF-8 is now preferred, because the first 256 characters are identical to ASCII.

I hope that helps.

Laurence.

You must log in or register to reply here.

Share:

Reddit Pinterest Tumblr WhatsApp Email 链接

顶部