[CODE SNIPPET] VB.NET制作词频表

dzhigner

Moderator
这个简单的 SNIPPET 统计英文词频,稍加修改可处理经过切分词的汉语文本。
2005080415490181.jpg
 
Private Sub BUILDWLIST(ByVal FLNM As String)
Dim WDUPCS, LINE As String
Dim WORDS As String
Dim READER As System.IO.StreamReader
Dim WDREGEX As System.Text.RegularExpressions.Regex
Dim HASHTBL As New Hashtable
Dim STRDELIMS As String
STRDELIMS = ".,;:!#$^&()<>+=/\|`~ " & Chr(34) & Chr(32)
Dim DELIMS As Char () = STRDELIMS.ToCharArray
READER = New System.IO.StreamReader (FLNM, IO.File.Mode.Open, System.Text.Encoding.GetEncoding("gb2312"))
WDREGEX = New System.Text.RegularExpressions.Regex ("^\b[A-Za-z] [A-Za-z\-']+\b$", 9)
Do While Not LINE Is Nothing
LINE = Trim(READER.Readerline)
If Not LINE = "" And Not LINE Is Nothing Then
WORDS = LINE.Split(DELIMS)
Dim en As IEbumerator = WORDS.GetEnumerator ()
While en.MoveNext
WDUPCS = CType (en.Current, String).ToUpper
If WDREGEX.IsMTACH(en.Current) Then
If Not HASHTBL.Contains(WDUPCS) Then
HASHTBL.Add(WDUPCS, 1)
Else
HASHTBL.Item(WDUPCS) = CType(HASHTBL.Item(WDUPCS), Integer) + 1
End If
End While
End If
Loop
READER.Close()
Console.WriteLine (HASHTBL.Count)
Dim ENDIC As IDictionaryEnumerator-HASHTBL.GetEnumerator
Do While ENDIC.MoveNext
Console.WriteLine (ENDIC.Key & vbTab & ENDIC.Value)
Loop
HASHTBL = Nothing
End Sub
对吗?帮忙看看。
 
回复:[CODE SNIPPET] VB.NET制作词频表

以下是引用 xujiajin2005-8-5 14:15:59 的发言:
Private Sub BUILDWLIST(ByVal FLNM As String)
Dim WDUPCS, LINE As String
Dim WORDS As String
Dim READER As System.IO.StreamReader
Dim WDREGEX As System.Text.RegularExpressions.Regex
Dim HASHTBL As New Hashtable
Dim STRDELIMS As String
STRDELIMS = ".,;:!#$^&()<>+=/\|`~ " & Chr(34) & Chr(32)
Dim DELIMS As Char () = STRDELIMS.ToCharArray
READER = New System.IO.StreamReader (FLNM, IO.File.Mode.Open, System.Text.Encoding.GetEncoding("gb2312"))
WDREGEX = New System.Text.RegularExpressions.Regex ("^\b[A-Za-z] [A-Za-z\-']+\b$", 9)
Do While Not LINE Is Nothing
LINE = Trim(READER.Readerline)
If Not LINE = "" And Not LINE Is Nothing Then
WORDS = LINE.Split(DELIMS)
Dim en As IEbumerator = WORDS.GetEnumerator ()
While en.MoveNext
WDUPCS = CType (en.Current, String).ToUpper
If WDREGEX.IsMTACH(en.Current) Then
If Not HASHTBL.Contains(WDUPCS) Then
HASHTBL.Add(WDUPCS, 1)
Else
HASHTBL.Item(WDUPCS) = CType(HASHTBL.Item(WDUPCS), Integer) + 1
End If
End While
End If
Loop
READER.Close()
Console.WriteLine (HASHTBL.Count)
Dim ENDIC As IDictionaryEnumerator-HASHTBL.GetEnumerator
Do While ENDIC.MoveNext
Console.WriteLine (ENDIC.Key & vbTab & ENDIC.Value)
Loop
HASHTBL = Nothing
End Sub
对吗?帮忙看看。

[本贴已被 作者 于 2005年08月05日 14时17分22秒 编辑过]

我犯了个大错误,忘了贴文本,您办了件大好事。我感激得不成样子。
 
统计词频
用正则表达式筛选
输出大小写控制
词元表对于
多语言支持
可以读取多个文件

大家对这个词统计工具还有什么建议,如果觉得它能有点用,我把它制成成品软件,现在我还只是在VB环境下使用。
 
dzhigner, 我是个外行,你还是帮忙看看上面的script有没有问题。谢谢。
 
回复:[CODE SNIPPET] VB.NET制作词频表

以下是引用 xujiajin2005-8-6 0:22:58 的发言:
dzhigner, 我是个外行,你还是帮忙看看上面的script有没有问题。谢谢。

如下是是一个类模块作为例子:
Imports System.IO
Imports System.Text
Imports System.Text.RegularExpressions

Public Class BUILD_WORDLIST_DEMO1
Dim TK As Integer

Private Function BUILDWLIST(ByVal FLNM As String, ByVal WDRegex As Regex, ByVal STRDELIMS As String, ByVal OUTPUT_UPCS As Boolean, ByVal ENCODING As Encoding, ByVal HASH2 As Hashtable) As StringBuilder
Dim WDUPCS, LINE As String
Dim WORDS As String()
Dim READER As System.IO.StreamReader
Dim HASHTBL As New Hashtable
Dim en As IEnumerator
Dim DELIMS As Char() = STRDELIMS.ToCharArray
Dim BOO As New StringBuilder
Try
READER = New System.IO.StreamReader(FLNM, ENCODING)
Do While Not READER.Peek < 0
LINE = Trim(READER.ReadLine)
If Not LINE = "" And Not LINE Is Nothing Then
WORDS = LINE.Split(DELIMS)
en = WORDS.GetEnumerator()
Do While en.MoveNext
If OUTPUT_UPCS Then
WDUPCS = CType(en.Current, String).ToUpper
If HASH2.Contains(WDUPCS) Then
WDUPCS = HASH2.Item(WDUPCS)
End If
Else
WDUPCS = CType(en.Current, String)
End If
If WDRegex.IsMatch(en.Current) Then
TK = TK + 1
If Not HASHTBL.Contains(WDUPCS) Then
HASHTBL.Add(WDUPCS, 1)
Else
HASHTBL.Item(WDUPCS) = CType(HASHTBL.Item(WDUPCS), Integer) + 1
End If

End If
Loop
End If
Loop
READER.Close()
Console.WriteLine(HASHTBL.Count)
Dim ENDIC As IDictionaryEnumerator = HASHTBL.GetEnumerator
Do While ENDIC.MoveNext
BOO.Append(CType(ENDIC.Key, String) & vbTab & CType(ENDIC.Value, String) & vbCrLf)
' HASHTBL.Remove(ENDIC.Current)
Loop
HASHTBL = Nothing
Return BOO
Catch ex As Exception
MsgBox(ex.ToString)
READER.Close()
HASHTBL = Nothing
If Not BOO Is Nothing Then
Return BOO
BOO = Nothing
Else : Return Nothing
End If
End Try
End Function


Public Sub MAIN()
TK = 0
Dim HASH As New Hashtable
Dim LINE As String
Dim WORDS As String()
Dim DELI As Char()
ReDim DELI(0)
DELI(0) = "=" '本例中使用的词元表的结构如:abolished=abolish
Dim REGX As New Regex("^\b[A-Za-z\-]+\b", 9)
Dim EC As Encoding = Encoding.UTF8
Dim STRDELI As String = ".,;:!#$^&()<>+=/\'?|`~ " & Chr(34) & Chr(32)
Dim FN As String = "D:\ENGLISH_CORPORA\RAW\BROWN_SENTENCE.TXT"
Dim SB As StringBuilder
Dim LEMREADER As StreamReader = New StreamReader("D:\ENGLISH_CORPORA\LEMMALIST.TXT", EC)
Do While Not LEMREADER.Peek < 0
LINE = Trim(LEMREADER.ReadLine)
WORDS = LINE.Split(DELI)
If WORDS.GetUpperBound(0) >= 1 Then
If Not HASH.Contains(WORDS(0).ToUpper) Then
HASH.Add(WORDS(0).ToUpper, WORDS(1).ToUpper)
End If
End If
Loop
SB = BUILDWLIST(FN, REGX, STRDELI, True, EC, HASH)
If Not SB Is Nothing Then
Console.WriteLine("TOTAL:" & TK)
Console.Write(SB.ToString)
End If
End Sub
End Class



'**************************************************************
调用以上类的方法,例子如下:
Private Sub MenuItem5_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MenuItem5.Click
Dim k As New BUILD_WORDLIST_DEMO
k.MAIN()
End Sub
‘**************************************************************



[本贴已被 作者 于 2005年08月06日 04时14分20秒 编辑过]
 
回复: [CODE SNIPPET] VB.NET制作词频表

("^\b[A-Za-z] [A-Za-z\-']+\b$", 9)
学习中读老帖, 请问以上式中的",9"表示什么? 谢谢!
 
回复: [CODE SNIPPET] VB.NET制作词频表

("^\b[A-Za-z] [A-Za-z\-']+\b$", 9)
学习中读老帖, 请问以上式中的",9"表示什么? 谢谢!
正则表达式选项,区分大小写之类的,几个月没动过那些东西了,有些想不起来。不过这个窍门一定要知道,选项的位置只有一个所谓的flag,多个选项怎么办,比如既要设置单行模式又要设置大小写敏感?很简单,把选项对应的常数加起来就行了。9就是某两个选项常数的和。而且这个sum不会和其他的选项的组合相等。这真是个奇妙的技术。
 
Back
顶部