PDA

查看完整版本 : [Weird crap] What would you say about such a toy?


dzhigner
2005-11-14, 03:53 AM
This thing, which I call a "GugleExtractor" is born out of my craziness for digging up things on Internet. The GugleExtractor, as it sounds like, is a tool that extracts text-only snippets out of Google results, which I've assumed not totally insuficient. And I've kind of worked up a weird algorithm to denoise the extracted lines, by simply giving up those unclean. To each extracted snippet I attached a button. A onclick browser window pops up with a text-only page from which the snippet is quoted. This function comes handy when an apparently useful result doesn't have adequate cotext. When it comes to the speed, I'd dare say it gets its job done in a matter of seconds, even when all results are exhausted.

About two years ago, Google began to come in handy for me when it came to my realization that my English sucks and regular corpora let me down in providing sufficient clues when I write in this language. Just another day, I came up with the idea of GugleExtractor, tired of staring at gazing at those jumbled Google snippets, and now I've got one. I made such a boast about this stuff as a trial run, and turns out it's worth me staying up at this ridiculously late hour.

So what would you say about such a toy?
http://forum.corpus4u.org/upload/forum/2005111403534162.jpg


[本贴已被 作者 于 2005年11月14日 03时59分15秒 编辑过]

oscar3
2005-11-14, 09:23 AM
Interesting.

laohong
2005-11-14, 09:25 AM
It looks great, but do you have a trial version for us to play with?

刘语料
2005-11-14, 10:03 AM
agree with laohong.

tiger
2005-11-14, 11:08 AM
very useful in extracting texts from the internet to build a web-as-corpus archive.
can you kindly provide a trial version?

armstrong
2005-11-14, 06:17 PM
could you kindly provide a trial version? Mr.Ding.

lngzlz
2005-11-14, 07:59 PM
Terrific!

xusun575
2005-11-14, 08:48 PM
Hi,dzhigner, u guy really amazing!

Haiyang
2005-11-14, 09:01 PM
To quote Linus Torvalds, to make your source code open is actually a good idea.

laohong
2005-11-14, 10:50 PM
Isn't it better to make the right window to display the output in terms of concordances?

[本贴已被 作者 于 2005年11月15日 01时17分49秒 编辑过]

ineedgerf
2005-11-14, 11:31 PM
It's not bad if we come and post and share all those wonderful ideas!

dzhigner
2005-11-15, 05:14 AM
It's being debugged and will be fininshed within a couple of days.
I should make it clear that this tool can only extract goolge snippets. It can be used to collect a sample, but only when it still works if the context is small.
http://forum.corpus4u.org/upload/forum/2005111504504256.gif
This tool cuts a snippet apart at "...". Take the following snippet as an example.
http://forum.corpus4u.org/upload/forum/2005111505020224.gif
Such a snippet is to be cut into 2 lines, for they are not continuous, taken from different parts of a file or a page, in which case the length of cotext/context is by no means guaranteed. This is why most Web-as-corpus tools should download the pages.
A tip: by adding one or two "*", the snippet can be extended.

I am kind of thinking about making a concordancer out of this tool.

laohong
2005-11-15, 09:42 AM
Thanks, we are eager to have a look at it.

dzhigner
2005-11-21, 03:29 AM
GugleExtractor V.1 crafted by Ding Zheng
http://forum.corpus4u.org/upload/forum/2005112103282754.rar
This tool is still very primitive. It is programmed in VB.net. To run it, .NET framework 1.1 or higer is required.
1. Use the Google Form on the left to search.
2. When the first result page is done, menu item "Extract" will be enabled.
3. Use menu item "Extract" to extract title, snippet and/or URL of each result returned by Google. Lines which don't contain any key word won't be displayed.
4. Specify the maximum of results and minimum length of each text line, but this should be done before search & extract.
5. Choose the components of a result item, including title, snippet, URL and "icons", which are actually add-ons to open the original page or a text-only page which is done by a CGI at HTTP://www.WebCorp.org.uk. This CGI only convert HTML. It won't convert PDF or anything for that matter.
6. "Browser" menu only works with the browser on the left, the result browser can't go back or go forward.
7. Only when a search is done with the search form this tool provides, a extract can be started. So a new extract can be prepared in two ways: use "Reset" or use "Browser>Back" to return to the search form and do a search, then the menu item "Extract" will be enabled.
8. Although this tool is meant only to extract English text, it works with other languages, but a simple filter function won't work.
I truly need suggestions and actually I can't give any guarantee that it won't go wrong. In case error occurs the tool shuts down and won't cause any trouble. By the way, if you don't like the picture of Diogenes, just delete it.

dzhigner
2005-11-21, 03:42 AM
这里可下载并安装 Microsoft® .NET Framework 1.1 版可再发行组件包
http://www.microsoft.com/downloads/details.aspx?displaylang=zh-cn&FamilyID=262D25E3-F589-4842-8157-034D1E7CF3A3

laohong
2005-11-21, 10:43 AM
Thanks, let's make use of it.

清风出袖
2005-11-21, 12:21 PM
thanks a lot, dzhigner! you are a veritable handy man!

armstrong
2005-11-21, 06:23 PM
thnaks, dzhigner!

dzhigner
2005-11-21, 10:46 PM
The text lines are, more often than not, still kind of noisy and even worse if shown up in the form of centered KWIC. So I didn't make it a KWIC concordancer. I just left the lines the way they are, and of course. kind of denoised. But the denoising part only works well (and actually not very well) with English. Anyway, it comes in handy when some hypothetical expressions or collocations need to be tested.

This tool is the first trial. Whenever I come up with new idea I will update it.


[本贴已被 作者 于 2005年11月22日 03时46分57秒 编辑过]

刘语料
2005-11-21, 11:03 PM
thanks a lot, dzhigner .

tiger
2005-11-23, 11:36 PM
every time i pushed the "reset" button, the attached error message appeared.
why?

http://forum.corpus4u.org/upload/forum/2005112323360255.rar



[本贴已被 作者 于 2005年11月23日 23时39分34秒 编辑过]

[本贴已被 作者 于 2005年11月23日 23时42分19秒 编辑过]

tiger
2005-11-24, 05:15 PM
every time i pushed the "reset" button, the attached error message appeared.
why?

http://forum.corpus4u.org/upload/forum/2005112417150794.gif

tiger
2005-11-25, 11:21 PM
has anyone met with the above problem? thanks in advance.

dzhigner
2005-11-26, 12:41 AM
MAYBE MICROSOFT.MSHTML.DLL IS NOT PROPERLY REGISTERED...

tiger
2005-11-26, 06:09 PM
what can i do then?

dzhigner
2005-12-03, 01:25 AM
搜索MICROSOFT.MSHTML.DLL.
并在"开始>运行"中输入 regsvr32 microsoft.mshtml.dll的路径