1. hittle2008

    请教关于对数似然检验值(log likelyhood)与互信息值的计算

    There is a detailed explanation on http://ucrel.lancs.ac.uk/llwizard.html
  2. hittle2008


    Nice job!
  3. hittle2008


    Have the corpus POS-tagged and exclude the auxiliary words to discard them in your statistics.
  4. hittle2008


    WST has TTR stastistics once you produced the word list from the corpus. That would be the easiest way for you, there are other ways, though.
  5. hittle2008


    Use regular expression to detag the XML files as plain texts or keep pos tags only if you need them. BTW, WST does have text converter that works with BNC XML files to convert them to texts.
  6. hittle2008


  7. hittle2008

    Verbnet 如何使用和打开

    It's not designed to be human readable and you need to use with its Java API, which you can find here (https://verbs.colorado.edu/verb-index/inspector/). After compilation, you can fire up a command by following the examples. I used 'java vn.Inspector ../new_vn -i -Va -Oown-100.xml ' to...
  8. hittle2008


    If you have downloaded the XML BNC version, you can try to extract the genre or text category information from the header with XML parser such as Beautifulsoup or lxml in Python, or save yourself some trouble and go to http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php, they have the...
  9. hittle2008


    The easiest way is to use some POS tagger or segmenter that allows you to use your own user dictionary, usually one term per line, such as Jieba or Zpar, but you need to have access to a Linux machine.
  10. hittle2008


    If you don't know how to write XML file with a scripting/programming language, then you can save your annotated file as a csv file and have it converted with either an online or an offline converter.
  11. hittle2008


  12. hittle2008

    NLPIR(ICATLAS2014) Python 3.3 调用多文件分词处理

    1. 准备条件: 1) 你需要安装python 3.0 以上版本; 2)你需要下载python-NLPIR 封装包,并修改相关文件参数,具体参见 (http://www.52ml.net/14450.html); 3)将附件中我修改的CALLAPI.txt 改名为CALLAPI.py放入你修改好的封装包根目录之中; 4)利用文本编辑器修改CALLAPI.py 中我的文件路径(path="'语句)成你的语料所在路径; 5)运行该项文件,其将在语料所在目录生成热原始文件名+result.txt格式的结果文件。 2. 注意事项: 1)不适用Mac...
  13. hittle2008


    回复: 请教计算词性标注正确率的方法。 就词性标注而言,我的理解的计算方法为: precision= 正确标注数/所有标注(该赋码)数 recall=正确标注数/(正确标注数+未标注但属于该赋码类型数) 补充一下:以上数目肯定是取样人工校对了 中文期刊网上应该 有类似文章,下载一篇 看看,如果我说的不对,顺便 告诉我一声。
  14. hittle2008


    回复: 请教计算词性标注正确率的方法。 我的意思: 1. 采用不同标注器的标注集大小、内容不尽相同,比较起来有难度。 2. 你的理解没错, 标注器训练语料越大、越均衡,得出的标注准确率才更具有代表性。但即便如此,应用到不同语料标注,依然会有出入,有时还很大。 你现在怎么研究这个了?
  15. hittle2008


    回复: 请教计算词性标注正确率的方法。 每一种标注器的标注集都不完一样,怎么对比,首先你得确定不同标注集之间的是否完全对应,对于一些交叉、包含现象怎么归类? 这些召回率或准确率的计算应该受训练数据集和测试数据集代表性的影响比较大。