如何提取BNC特定字库,并列出频数最高的名词,形成词表?

本文由 xiaokan0072016-03-01 发表於 "语料库检索" 讨论区

  1. 在毕业论文中,准备自己建立一个由汉译英的英文文本构建的翻译语料库。试图研究该语料库中的“动词+名词”的搭配使用情况。试图利用母语语料库BNC做对比研究。具体的路径是,首先从BNC中提取出频数最高的名词,形成一个词表。之后,再检索这些高频词的“动词+名词”搭配。再以这个高频词表为基准,从翻译语料库中提取相应的“动词+名词”搭配。之后,分析本族语语料库中的搭配同翻译语料库中的搭配的差异。
    现在问题是,有两个关键的技术处理手段没有掌握:

    一:BNC中名词的frequency wordlist 怎么提取?

    二:提取后的名词,如何检索它的“动词+名词”搭配?
     
  2. If you have downloaded the XML BNC version, you can try to extract the genre or text category information from the header with XML parser such as Beautifulsoup or lxml in Python, or save yourself some trouble and go to http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php, they have the annotated information of different registers for your query results.
     
  3. I found someone has provided the frequency list. Thanks a lot~ By the way,do you have any advice to make concordance in BNC_Xml_Editon. Althoughh Xaira is recommend to deal with it, I just failed to make the index with it. You may check the procedures I 've taken in my blog:https://i4language.wordpress.com/2016/04/01/how-to-use-xaira-to-deal-with-bnc_xml_edition/
     
  4. Use regular expression to detag the XML files as plain texts or keep pos tags only if you need them. BTW, WST does have text converter that works with BNC XML files to convert them to texts.