文本转换
From Corpus4u wiki
通常 Corpus Tools 都要求语料是纯文本格式,Word 格式不太适合检索,因为文本中包含有 Word 自身的 tags (Laohong)。 下面是一些文本转换的软件:
- CutePDF 生成 PDF 文件(免费软件)
- To convert your files into PDF, CutePDF is strongly recommended(free of charge). It can convert English, Chinese or any language, and images. The resulting PDF do not bear some ugly hallmark. (Recommended by Richard Xiao here)
- PDF2HTML GPL 自由软件
- pdftohtml is a utility which converts PDF files into HTML and XML formats.
- After you convert the PDF files to MS Word files, use MS Word “转换向导” 模板 (Batch Conversion Wizard.) to convert them to text files: You can do a lot of files in one go. (动态语法)
- PDF2TXT is a PDF to Text convert tool. It provides batch converting of Adobe Acrobat PDF files to plain text. PDF2TXT is easy to use mean for converting Adobe Acrobat PDF files into plain text. With PDF2TXT you have the following options: 1)Ability to convert PDF to Text; 2)Batch converting. You may add several files to convert to text; 3)Command line support. You may use the program in a command line mode.
