搜寻结果

  1. W

    Collocation statistics MI, t, z...

    回复:Collocation statistics MI, t, z... That accounts for it. Thanks.
  2. W

    Collocation statistics MI, t, z...

    回复:Collocation statistics MI, t, z... Great discussion. Thanks Dr. Xiao On a seminar in the University of Birmingham, Pernilla once demonstrated a comparison of two collocates lists: one is of T-score, the other is of MI. After sorted. it is strange to see that the two lists are almost utterly...
  3. W

    [求助]N-Gram, ngram

    回复:[求助]N-Gram Thanks a lot for this meticulous discussion. But I have to clarify that lemmatization and POS tagging are two different processes, as self-evident as they are. But lemmatization is still based on the pre-occupied notions about which word forms should be lemmatized to what. One...
  4. W

    [求助] How to ignore all the tags in CLEC?

    回复:[求助] How to ignore all the tags in CLEC? In that case, you'll have to use wordsmith and make a tag file for yourself. Please consult the help file for detailed instruction. And there are other ways... 1. Work with word and replace the tags with a new unique and markup, delete the unwanted...
  5. W

    [求助]"90 Sentences and headings" in WS3

    Simply ignore them. Double click a concordance line or a type in the wordlist, you will see the text in which the line is located. And the text can be displayed both in sentence mode and paragraph mode, as wordsmith claims, but it always throws out confusing things. The counting of headings is...
  6. W

    [求助] How to ignore all the tags in CLEC?

    回复:[求助] How to ignore all the tags in CLEC? 编辑->替换->(选择‘使用通配符’)->在‘查找内容’中键入: \<*\> 然后单击‘全部替换’,然后再键入: \[*\] 单击‘全部替换’。 用不用CLEC中的附码,取决于个人的研究需要。但是没有任何插入码的干净文本总是很有用的,研究者既可以自己附码,也可以做其它分析。...
  7. W

    [求助] How to ignore all the tags in CLEC?

    回复:[求助] How to ignore all the tags in CLEC? Backup the text files and delet all the tags using word or Powergrep and you will get a clean text copy.
  8. W

    [求助]N-Gram, ngram

    回复:[求助]N-Gram More about 'form' Yes, you can say a form has something to do with lemma, that different 'forms' can be lemmatized as one base form. But I feel there's more to it. In corpus investigation, a unique form is not only simply morphological or grammatically distinctive, but carries...
  9. W

    [求助]N-Gram, ngram

    回复:[求助]N-Gram Not exactly. A token in a corpus is actually a running word and contrastive term for 'a type': the occurrences of a type are tokens. We don't often say 'character' for the English language. The word 'word' is a highly ambiguous one, so sometimes one tends to use 'form' or 'word...
  10. W

    Parallel image text corpus of Chinglish 开心译站

    Great idea! It might be helpful to categorize such sources, say in terms of media or locations, for later processing and retrieval. Visit also http://www.silverladder.com/literature/chinglish/ And a discussion on chinglish: http://www.cjvlang.com/Spicks/fengqing.html
  11. W

    用Word 制作机助附码工具:不会编程也能做

    回复:用Word 制作机助附码工具:不会编程也能做 谢谢xujiajin的补充说明。其实WORD中的自动图文集、宏以及宏的自动录制、VBA都可以用来完成那些需要大量重复的文字录入或附码工作。制作宏的思路非常好,在选择文本片断后可同时插入开始和结束两个码。感兴趣的可以试试powergrep这个工具软件,用它来统计和查询附码文本相当好用,其对文本的操控功能非常强大。其网址是:http://www.powergrep.com
  12. W

    [讨论]语料库在研究中应该占多大比重?

    回复:[讨论]语料库在研究中应该占多大比重? Yes, corpus-informed sounds more neutral. I love the term. Corpus is as a matter of fact one of the information sources.
  13. W

    [讨论]语料库在研究中应该占多大比重?

    回复:[讨论]语料库在研究中应该占多大比重? There are different users of corpora: corpus referenced or supported: the corpora are used to support or illustrate some prefabricated ideas. The use of the corpus evidence is quite opportunistic as the process is quite selective: the user only incurs the instances that...
  14. W

    用Word 制作机助附码工具:不会编程也能做

    Such a tool (my word-tagger) is useful for manual tagging, where only the human being knows where to insert what. And POS tagging and parsing can be done automatically using the software. Suppose you need to concentrate on certain specific features and like to dig them out from the text, you...
  15. W

    有没有可用来检索中文的concordancer?

    MLCT java concordancer; Concordance by R. J. C. Watt; WS 4.0 by Mike Scott; Powergrep 3.0; paraconc; regexbuddy; Concapp, and more.... but all are of limited functions and good for demonstration only.
  16. W

    用Word 制作机助附码工具:不会编程也能做

    Upon Dr Xu Jiajin's suggestion, I post the following for share. Hope it might help a little. 用MS WORD制作机助附码工具 李文中 1、附码方案及码集。假定你已准备好了一套标注方案(该方案可以基于自己的理论框架设定,也可以来自对语料的先导研究),包括名称及对应的码,如[fm1]表示拼写错误。设计的码既可以很复杂,也可以很简洁,主要根据自己的需要来做。设计好的标注方案叫“码集”(tagset),设计前最好进行先导分析,设计好后做试验性附码(trial...
  17. W

    How to write a thesis based on spoken data?

    We have just developed the "College Learner Spoken English Corpus" (COLSEC), a sister corpus of the COLEC (College Learner English Corpus), and the CD-ROM of which will be published together with our book by Shanghai Foreign Languages Education Press in October this year. Hope it helps. That...
  18. W

    [纠错]CLEC在转写、标注方面的一些问题

    回复:[原创]CLEC st 3和st 4子库一些格式上的纰漏 关于CLEC子库大学学习者英语语料库的抽样在《语料库语言学导论》中有所介绍。当时抽取试卷作文部分时,通过四六级考委的帮忙,直接到旧试卷仓库中抽的,各省的卷子都有,方法是每隔10本抽一本,滤除掉6分以下的作文,共抽了2000多篇。正式抽样前先做了试抽样。自由作文部分相对集中一些,郑州几个高校有一些,河南师大有一些,再就是上海、广州几个高校。后来又补充进一些清华大学的自由语料。自由语料的整个搜集过程算不上随机抽样,主要是人力物力达不到。以后条件成熟,可以大规模组织人力抽样,这样代表性可能会更强些。...
Back
顶部