PDA

查看完整版本 : [讨论]WordSmith Tools v4.0处理中文要加空格


lngzlz
2005-10-03, 11:19 PM
第4版处理中文时,好像仍然要加空格,才能concord?

[本贴已被 作者 于 2005年10月03日 23时19分54秒 编辑过]

清风出袖
2005-10-03, 11:25 PM
对!可以使用ACWT的汉语分词器进行空格,而后进行处理!http://www.corpus4u.com/forum_view.asp?forum_id=7&view_id=420这里也讨论了这个问题--是在讨论wordsmith3 中处理汉语怎么办,可以参考一下!羡慕中---有wordsmith4这么好的东东!

xiaoz
2005-10-03, 11:40 PM
光用Concord,应该不需加空格,但Wordlist和keyword就要加。但首先必须把data转换成Unicode.


以下是引用 lngzlz 在 2005-10-3 23:19:34 的发言:
第4版处理中文时,好像仍然要加空格,才能concord?

[本贴已被 作者 于 2005年10月03日 23时19分54秒 编辑过]

lngzlz
2005-10-04, 08:30 PM
谢谢!如何把data转换成Unicode?

xujiajin
2005-10-04, 08:53 PM
4. 哪些软件可以自动作编码转换(GB/BIG5/UTF-8/UNICODE=UTF-16)?

a) Multilingual Corpus Tool by Scott Piao, 成批转换
http://www.lancs.ac.uk/staff/piaosl/research/download/download.htm

b) WordSmith Tools 4, GB/BIG5 -> UNICODE (UTF-16) 成批转换

c) 南极星NJ Star 文本转换器, 单个转换
http://www.njstar.com

d) b) Chinese Annotation Tool可在线处理简体汉语文本, 单个转换
http://www-rohan.sdsu.edu/~chinese/annotate.html
Perl 版本:http://www.mandarintools.com/segmenter.html

e) MS Word/Notepad, 单个转换

Find more at
http://www.corpus4u.org/showthread.php?t=699

Character encoding in corpus construction
http://www.corpus4u.org/showthread.php?t=416

lngzlz
2005-10-04, 11:32 PM
Dr. Xu, I fail to find "WordSmith Tools 4, GB/BIG5 -> UNICODE (UTF-16) 成批转换". Could you help me locate it in WordSmith Tools 4?

xiaoz
2005-10-05, 12:31 AM
In WordSmith 4, go to "Utilities - Text converter" in the main menu.

Check Text conversion Activated.

Select the filefolder to be converted and make other adjustments as desired (keep the original data and create an extra coppy in the temp directory?).

In the Conversion type, select "into Unicode based on", and select "Chinese (People's Republic of).

Click on the OK button at the bottom.

lngzlz
2005-10-05, 06:34 PM
XiaoZ, thanks a lot for your timely help. But concord still does not work OK with Chinese Unicode-encoded file. It seems that the Chinese text must be segmented with blanks before going on to use the functions of Concord, Keyword and Wordlist.

xiaoz
2005-10-05, 08:05 PM
The best thing to do with Chhinese text is, of course, to tokenise the data. Yet you still need to convert the data into Unicode. Running texts without segmentation will work with Concord, not Wordlist or Keyword. (I only tested running test with Concord.)

The problem you encountered probably has to do with settings. You will need to select language and font properly before loading the texts (in Adjust settings).

lngzlz
2005-10-05, 11:38 PM
Thank you very much, Dr. XiaoZ. But look at my screen explanation:

http://forum.corpus4u.org/upload/forum/2005100523420041.jpg


[本贴已被 作者 于 2005年10月05日 23时42分12秒 编辑过]

xiaoz
2005-10-06, 01:11 AM
The best practice is to avoid using Chinese characters in filenames.

hancunxin
2005-10-06, 08:24 AM
我按照大家的帖子用wordsmith3实验了一下。发现,经过分词后的中文,concord 没有问题。不过,使用其他功能就不行。我先用ictclas分词, 然后使用NOTEPAD将中文转换成了“unicode文档”。我的操作系统是windows me ,所以不知道转换后的unicode 是8还是16,或者都不是。经过unicode转换和分词处理的中文在用wordsmith3。0进行wordlist功能时,发现不行。

xiaoz
2005-10-06, 08:55 AM
The discussions in this thread are related to Wordsmith 4, but version 3.

With version 3, as long as your Chinese data are tokenised, there is no need to convert to Unicode - By the way, if you use "save as" and select "Unicode document" in Notepad, it is Unicode (UTF-16). But you do need a simplied Chinese version of Windows or a language pack. Still only Concord will work, but not Wordlist or Keyword.

In Wordsmith 4, you must convert Chinese data into Unicode (you can use the Text Converter in Utilities of WS4). In this case, even texts not tokenised will work with Concord. But the data must be tokenised if you want to make a wordlist or extract keywords.

lngzlz
2005-10-06, 09:45 AM
Dear Dr. Xiao, I changed the file name into English. It still pops up window which says "no concordance entries found". Could you upload several screenshots for me to follow your operations? Thanks a lot in advance.

patricx
2005-10-06, 09:55 AM
have you tokenized your txt files? then you have to be sure your texts are in unicode.

lngzlz
2005-10-06, 10:12 AM
以下是引用 xiaoz 在 2005-10-6 8:55:05 的发言:
even texts not tokenised will work with Concord.

xiaoz
2005-10-06, 10:36 AM
The problem you have encountered, I suspect, is most likely that your data have not been converted into Unicode properly. You have to the Text Converted in WST4 to do the conversion (see my earlier replies). Or you can use MLCT to convert GB2312 (GBK) data into UTF-8 (not UTF-16, Mike cannot explain why the UTF-16 data converted using MLCT can be processed by wordsmith 4), and then click on the icon for "test Unicode" when choosing texts.

Here are a few screen dumps that show how WST4 works well with untokenised Unicode Chinese data.

http://forum.corpus4u.org/upload/forum/2005100610352494.jpg

http://forum.corpus4u.org/upload/forum/2005100610355893.jpg

http://forum.corpus4u.org/upload/forum/2005100610401766.jpg

http://forum.corpus4u.org/upload/forum/2005100610362342.jpg


[本贴已被 作者 于 2005年10月06日 10时40分28秒 编辑过]

lngzlz
2005-10-06, 12:48 PM
Thanks for Dr. Xiao's timely help. If possible, could you upload a small part of your untokenised.xml for me to have a test. I am sure I have done my data conversion in a correct way, maybe not?

patricx
2005-10-06, 12:51 PM
you have tokenize your data, that's the problem. why not tokenize your data first, then have a try?

lindaxiao
2006-08-07, 02:54 PM
请各位赐教:没有注册号的Wordsmith 4.0是不是只能显示25条检索行啊,我使用concord检索时,第26行显示的是“past demo limit”,请问,这个问题怎么解决呢?thanks a lot in advance!

laohong
2006-08-08, 02:11 AM
试用版的Wordsmith 4.0,检索结果限制为只显示前25条检索行。解决这个问题的办法就是花钱买个注册号。

armstrong
2006-08-15, 12:48 PM
你说的对,从第二十六行开始就显示“past demo limit”。如果你想看后面的,就必须删掉前面的,总之只能保持二十五行。

rainbow
2007-01-14, 11:30 AM
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

--------------------------------------------------------------------------------

光用Concord,应该不需加空格,但Wordlist和keyword就要加。但首先必须把data转换成Unicod
为什么我用concord不加空格时,或者用ictclas分词后,查出来的并不是我输进去的检索词?而每个词都加空格后才能查出我输进的检索词,是什么原因?

xujiajin
2007-01-14, 07:03 PM
那你加空格不就完了嘛。

xiaoz
2007-01-14, 07:10 PM
This is related to double-byte "character encoding". I tink you will find an answer in this article:
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/chapter4.htm

回复:[讨论]WordSmith Tools v4.0处理中文要加空格

--------------------------------------------------------------------------------

光用Concord,应该不需加空格,但Wordlist和keyword就要加。但首先必须把data转换成Unicod
为什么我用concord不加空格时,或者用ictclas分词后,查出来的并不是我输进去的检索词?而每个词都加空格后才能查出我输进的检索词,是什么原因?

xujiajin
2007-01-17, 09:49 PM
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

--------------------------------------------------------------------------------

光用Concord,应该不需加空格,但Wordlist和keyword就要加。但首先必须把data转换成Unicod
为什么我用concord不加空格时,或者用ictclas分词后,查出来的并不是我输进去的检索词?而每个词都加空格后才能查出我输进的检索词,是什么原因?
----

能上传一份你所谓的用ictclas分词后的文本样本吗?因为从你讲的情况来看好像有点怪。

rainbow
2007-01-18, 11:45 AM
thanks Dr.xu's kindness,i tried another time and solved the promble.perhaps because the wordsmith was too tired and need a restart.just joking

c56780
2008-09-21, 12:43 PM
http://www.corpus4u.org/picture.php?albumid=4&pictureid=11