[语料发布] Buddhist Sacred-Texts Corpus

本文由 ArthurW2016-12-09 发表於 "语料汇集" 讨论区

  1. BUDDHIST SACRED TEXTS CORPUS

    Source: http://www.sacred-texts.com/bud
    Compiler: Jiayue Wang
    Time: 8 December 2016

    The texts were extracted from web pages downloaded from the website. Each line that begins with a hashtag (#) indicates the webpage and its relative path in the website.

    The corpus was created in a Linux environment, encoded in UTF-8, using Unix-style line ending (LF).

    Notes:
    1. A small part of the texts were extracted from "index" and other web pages which are not Buddhist texts but website comments etc.
    2. Although text extration was done in the order of filenames, e.g. ami01.txt > ami02.txt > ami03.txt, wrong orders may occasionally occur.
    3. Use of the corpus data is restricted to non-commercial purposes.
    4. The corpus can be freely re-distributed, provided the readme file is kept in the package.

    ----
    Jiayue Wang arthur0421[AT]163.com
    College of Foreign Studies
    Guangxi University for Nationalities
    Nanning 530006
    China
     

    附件文件: