查看完整版本 : 如何去掉标记部分的内容?Tag removal : remove tags
oscar3
2005-09-11, 10:55 AM
请问各位,如何去掉以下文本中带<>以及两个<>之间的内容,只保留干净的文本。比如去掉整个<source>Target 9:1. 1997</>的内容,当然,<source>和</source>之间的内容是变量。谢谢http://forum.corpus4u.org/upload/forum/2005091110545342.jpg
I added English key words in the title to make easy future retrieval.
Jiajin
[本贴已被 xujiajin 于 2005年09月11日 19时12分55秒 编辑过]
Haiyang
2005-09-11, 10:58 AM
试试用空格替换想去掉的。
oscar3
2005-09-11, 11:09 AM
不过,起始符<>和终止符</>之间的内容是变量,首先查找就是问题。比如,#<title>你好</>,<title>谢谢</title>,<title>再见</title>......#这些内容混杂在同一个文本文件中,如何快捷去掉#之间的所有内容,Ocean提供的方法好像不管用。
http://forum.corpus4u.org/upload/forum/2005091111315291.jpg
xiaoz
2005-09-11, 11:48 AM
Do you have many files to process? Do they have a common feature in filenames (e.g. all end in .txt, or any other common character string)? A very simple script will do the job at one go reliably. Just give more details of your filenames and the part(s) you want to remove from each file.
oscar3
2005-09-11, 12:08 PM
以下是引用 动态语法 在 2005-9-11 11:31:55 的发言:
http://forum.corpus4u.org/upload/forum/2005091111315291.jpg
This may srtip off all the tags, however, will leave those texts between the tags.
以下是引用 oscar3 在 2005-9-11 12:08:38 的发言:
以下是引用 动态语法 在 2005-9-11 11:31:55 的发言:
This may srtip off all the tags, however, will leave those texts between the tags.
Ok. That's even easier. A regular expression (正则表达式)will do.
如果你习惯于用MS Word, 用下述正则表达式即可:\<*\>*\<\/*\>
其中 \/ 是 \ 和 / 两个符号在一起。(不知道论坛上出来会不会变样。)
http://forum.corpus4u.org/upload/forum/2005091113154276.jpg
This is more powerful, which may be what you wanted (I forgot about
the numbers in your sample):
In MS Word, do Search and Replace with:
\<*\>(?)*([0-9])*\<\/*\>
http://forum.corpus4u.org/upload/forum/2005091113374381.jpg
To make things easier, just copy the string
\<*\>(?)*([0-9])*\<\/*\>
from here and paste it to your MS Word.
[本贴已被 作者 于 2005年09月11日 13时54分50秒 编辑过]
oscar3
2005-09-11, 03:42 PM
以下是引用 动态语法 在 2005-9-11 13:38:10 的发言:
This is more powerful, which may be what you wanted (I forgot about
the numbers in your sample):
In MS Word, do Search and Replace with:
\<*\>(?)*([0-9])*\<\/*\>
http://forum.corpus4u.org/upload/forum/2005091113374381.jpg
To make things easier, just copy the string
\<*\>(?)*([0-9])*\<\/*\>
from here and paste it to your MS Word.
[本贴已被 作者 于 2005年09月11日 13时54分50秒 编辑过]
动态语法, thank you very much for your help, it works so perfectly, though I don't quit understand why.[emb2][emb2][emb2]
[本贴已被 作者 于 2005年09月11日 15时44分19秒 编辑过]
xujiajin
2005-09-11, 07:02 PM
See also an earlier post:
How to remove tags at one go?
http://www.corpus4u.org/showthread.php?t=784
xujiajin
2005-09-12, 02:58 PM
Thank 动态语法 for the help.
Any idea if I want to remove Chinese texts and keep English texts?
I have looked up in the regular expressions and not come up with a good solution.
xujiajin
2005-09-12, 03:00 PM
Sorry, I posted the above message on the wrong the thread.
vBulletin® v3.7.4,版权所有 ©2000-2009,Jelsoft Enterprises Ltd.