A 9-billion-word USENET corpus for free download

laohong

管理员
Staff member
A USENET corpus (2005-2007) [BETA VERSION]

This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2007, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non-words, and spelling errors. The corpus is untagged, raw text. It may be neccessary to process the corpus further to put the corpus in a format that suits your needs.

Processing: All NNTP headers were discarded. All message bodies that had the same 128bit SHA-1 hash as other message bodies were discarded (reducing duplication of documents from cross-posts). To reduce the amount of garbage data and non-english text in the corpus, the following pre-processing steps were taken:

All documents that were less than 500 words and greater than 500,000 words were omitted.

Documents that contained less than 90% English words were omitted. (English words were defined as words that are contained in a 100,000 words dictionary of english).

To anonymize the text, we aslo did the following:
Replaced all of the obvious e-mail addresses with the token <EMAILADDRESS>.
Replaced all of the obvious HTTP URLs with the token <URL> , and news URLs with <NEWSURL>.

Corpus size: over 9 billion words
Data size: over 12gb, compressed
Last Update: March 16th, 2007

Download the corpus at:
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

Citation: Shaoul, C. & Westbury C. (2007) A USENET corpus (2005-2007) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)

Acknowledgments: This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

If you have any questions about this corpus, please contact Cyrus Shaoul
 
回复: A 9-billion-word USENET corpus for free download

Before filling in the form, I hesitated:

How is this corpus useful to me? I can't figure out any potential use in my case.

Do you have any idea?
 
Back
顶部