谢谢ArthurW的分享。有关MSTTR和MATTR,可见
https://cran.r-project.org/web/packages/koRpus/koRpus.pdf
的66页和67页。
MSTTR (Mean Segmental Type-Token Ratio):
(1) segments the text into fixed-length segments (e.g., 100 words per segment).
(2) For each segment, the TTR is calculated as the ratio of unique words (types) to the total number of words (tokens) in that segment.
(3) The TTR values from all the segments are averaged to produce the MSTTR.
这个基本上是WordSmith中STTR(Standardized TTR)的做法。
MATTR (Moving-Average Type-Token Ratio) 中的Moving-Average又叫Moving-Window。
(1) A window size (e.g., 100 words) is chosen.
(2) The TTR is calculated for the first window of words (e.g., the first 100 words).
(3) The window is then shifted by one word, and the TTR is recalculated for the new window.
(4) The final MATTR value is the average TTR across all the windows.
两种方法均旨在克服TTR受文本长度或语料库规模影响较大的问题。
Cf. Limitations of TTR:
https://www.sketchengine.eu/glossary/type-token-ratio-ttr/