PDA

查看完整版本 : [求助]lexical density tools needed


valeriazuo
2005-09-18, 09:20 PM
各位语料库的前辈们, 我是搞功能语法的, 在做硕士毕业论文,需要能查词汇密度(lexical density)的软件, 请各位多多指教,指出那个软件能查和在哪能下载,thanks。

[本贴已被 xujiajin 于 2005年09月18日 21时46分11秒 编辑过]

xujiajin
2005-09-18, 09:48 PM
Can you describe the expected output of your lexical density tools?
Do you refer functional grammar to systemic-functional grammar or otherwise?

xiaoz
2005-09-18, 10:13 PM
ACWT should be able to do that.

xusun575
2005-09-18, 10:25 PM
pls try this site: http://textalyser.net/

xujiajin
2005-09-18, 10:42 PM
以下是引用 xiaoz 在 2005-9-18 22:13:40 的发言:
ACWT should be able to do that.


Richard has a good memory.

http://forum.corpus4u.org/upload/forum/2005091822404312.jpg

xujiajin
2005-09-18, 10:47 PM
以下是引用 xusun575 在 2005-9-18 22:25:59 的发言:
pls try this site: http://textalyser.net/


Yes. LD analysis can be found at the site.
Complexity factor (Lexical Density) :

frankliang
2005-09-18, 11:29 PM
There are several vocabulary tools that can be downloaded here:

http://www.swan.ac.uk/cals/calsres/lognostics.htm
They may measure lexical depth, lexical richness, etc.

frankliang
2005-09-18, 11:46 PM
And here is Paul Nation's wonderful vocabulary tool:

http://www.vuw.ac.nz/lals/staff/paul-nation/RANGE32.zip
http://www.corpus4u.org/upload/forum/2005091823480834.gif

动态语法
2005-09-19, 04:37 AM
Beware of the different ways of calculating lexical desity. ACWT
gives just two of those commonly used methods. So when you use
an online LD analyzer, you need to know how the author defines LD
(or any other concept for that matter).

Whatever method you choose to use, there should be a theoretical
justification for it.

xiaoz
2005-09-19, 07:48 AM
A conventional way to measure the lexical density is to use TTR (type-token ratio). In comparing texts of different sizes, standardised TTR is recommended. Both are computed automatically when a word list is created using wordsmith's Wordlist.

valeriazuo
2005-09-19, 10:53 AM
Thanks for all of your suggestions and detailed links. Dear mr. xu, I intend to investigate the distinctions between spoken language and written language. I'm required to finish my thesis in the framework of FG because my tutor's interest is FG and discourse analysis. I want to adopt Ure's way or Halliday's way to calculate lexical density . The output can indicate students' oral language simliar to written one, not real spoken language.

frankliang
2005-09-19, 12:25 PM
TTR has been criticised a lot for its sensitivity to text length, and Malvern and Richards have proposed a different measure, which is much less sensitive to text length. Their tool is available at http://www.swan.ac.uk/cals/calsres/lognostics.htm
Related literature:
Durán, P., Malvern, D., Richards, B., & Chipere, N. (2004). Developmental trends in lexical diversity. Applied Linguistics, 25 (2), 220-242.
Malvern, D., & Richards, B. (2002). Investigating accommodation in language proficiency interviews using a new measure of lexical diversity. Language Testing, 19 (1), 85-104.

清风出袖
2005-09-19, 04:04 PM
thanks a lot, Dr. Xiao and Mr. 动态语法!doyou know any books available at home that cover the justification of lexical density? Thanks a lot for reminding me of the neccesity of making sure of the different way of measuring lexical density!

valeriazuo
2005-09-19, 04:28 PM
Mr. xiao, thanks for your advice, but where can I get ACWT ?

xiaoz
2005-09-19, 05:02 PM
TTR is sensitive to text length. That's why I said standardised TTR is to be used when comparing texts of different lengths...


以下是引用 frankliang 在 2005-9-19 12:25:16 的发言:
TTR has been criticised a lot for its sensitivity to text length, and Malvern and Richards have proposed a different measure, which is much less sensitive to text length. Their tool is available at http://www.swan.ac.uk/cals/calsres/lognostics.htm
Related literature:
Durán, P., Malvern, D., Richards, B., & Chipere, N. (2004). Developmental trends in lexical diversity. Applied Linguistics, 25 (2), 220-242.
Malvern, D., & Richards, B. (2002). Investigating accommodation in language proficiency interviews using a new measure of lexical diversity. Language Testing, 19 (1), 85-104.

xiaoz
2005-09-19, 05:03 PM
Sigh...
http://www.corpus4u.org/showthread.php?t=798

以下是引用 valeriazuo 在 2005-9-19 16:28:45 的发言:
Mr. xiao, thanks for your advice, but where can I get ACWT ?

frankliang
2005-09-19, 07:08 PM
Dr. Xiaoz, while I agree that standardized TTR is better as TTR, I am more inclined to accept the view that standard TTR is not really a good measure either, as the measure only takes into account part (say, the first 1000 words) of texts. In so doing, a good deal of the data have not been used. Therefore, using standardized TTR will lead to a waste of data.
Do you agree?

valeriazuo
2005-09-19, 09:17 PM
Mr.xiao, thanks for your link. what a pity! I couldn't operate it well. I opened a text file and applied a tool - Calculate LD(a la Ure/Stubbs) to it but there was no expected outcome. A dialog box popuped, which asked me to fill the number of content words and corpus size. In fact, I want it to count the number of content words and the size by itself but how can I order the tool to do this job? Thanks a lot for your kind advice.

xiaoz
2005-09-19, 10:14 PM
Not excatly just the first thousand words of each text. Here is what Mike says about STTR:

"The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)"


以下是引用 frankliang 在 2005-9-19 19:08:47 的发言:
Dr. Xiaoz, while I agree that standardized TTR is better as TTR, I am more inclined to accept the view that standard TTR is not really a good measure either, as the measure only takes into account part (say, the first 1000 words) of texts. In so doing, a good deal of the data have not been used. Therefore, using standardized TTR will lead to a waste of data.
Do you agree?

xiaoz
2005-09-19, 10:18 PM
The author of this tool should be able to answer this question of yours and help in this respect...

以下是引用 valeriazuo 在 2005-9-19 21:17:10 的发言:
Mr.xiao, thanks for your link. what a pity! I couldn't operate it well. I opened a text file and applied a tool - Calculate LD(a la Ure/Stubbs) to it but there was no expected outcome. A dialog box popuped, which asked me to fill the number of content words and corpus size. In fact, I want it to count the number of content words and the size by itself but how can I order the tool to do this job? Thanks a lot for your kind advice.

动态语法
2005-09-19, 11:07 PM
以下是引用 valeriazuo 在 2005-9-19 21:17:10 的发言:
Mr.xiao, thanks for your link. what a pity! I couldn't operate it well. I opened a text file and applied a tool - Calculate LD(a la Ure/Stubbs) to it but there was no expected outcome. A dialog box popuped, which asked me to fill the number of content words and corpus size. In fact, I want it to count the number of content words and the size by itself but how can I order the tool to do this job? Thanks a lot for your kind advice.


You used ACWT properly but over estimated its capabilities. It doesn't do the
automatic calculation of the numbers of function words and content words.
(I am not aware of any lexical tools that do this automatically, and what counts
as function words/content words has to be decided by the researcher.)

That being said, it doesn't seem to be a terribly hard thing to do to figure out
the number of what you believe to be content/function words in your corpus.
Here is some suggestion:

1) Use an English/Chinese POS tagger to tag your corpus first;
2) Use a program to search/calculate the frequencies of the tags (not words)
of the function words in your definition;
3) Use the Ure/Stubbs method in ACWT to calculate the LD value.

The reason for searching function word tags in step 2) is that function words
tend to be a more limited set than content word classes. But you could do either
content or function classes and use the total corpus size to figure out the size
of the other class.

frankliang
2005-09-19, 11:45 PM
以下是引用 Xiaoz 在 2005-9-19 19:08:47 的发言:
Not excatly just the first thousand words of each text. Here is what Mike says about STTR:

"The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)"
===========

I see. In that case, I will probably base my calculations on every 100 or 200 hundred words rather than every 1000 words, as most learner texts are not that long. Thank you!

[本贴已被 作者 于 2005年09月19日 23时48分50秒 编辑过]

xiaoz
2005-09-20, 12:36 AM
Function words (in any language) are expected to sit on top of a word frequency list. It should not be very difficult to have a rough estimate of their total frequencies using a POS tagged corpus.

动态语法
2005-09-20, 02:41 AM
以下是引用 xiaoz 在 2005-9-20 0:36:41 的发言:
Function words (in any language) are expected to sit on top of a word frequency list. It should be very difficult to have a rough estimate of their total frequencies using a POS tagged corpus.



If the user knows the tagset (see a sample of the CLAWS1 tagset below) and
has access to such tools as WordSmith Tools, s/he can use the 'file-based
concordance' feature of WS Tools to search just those tags that mark
'function' words. It shouldn't be too hard a thing to do, I would think, even
though I haven't tried it myself.

Similar results can conceivably be obtained by using RegExp tools (e.g. PowerGrep).

In the worst case scenario the user just has to search each category and get a sum
out of the individual searches.

-----Sample Tags for Function Words-----
AT
singular article (a, an, every)
ATI
article (the, ze, no)
CC
co-ordinating conjunction (and, or, but, so, then, yet, only, for)
DTX
determiner/double conjunction (either, neither)
EX
existential THERE

....

valeriazuo
2005-09-21, 07:01 PM
My Godness. It's a really hard job for me to follow all of your instructions but I'll have a try anyway. Thanks for all of your considerate advice. Thanks a lot.

xiaoz
2005-09-21, 07:42 PM
"My Godness", you appear to know little about POS tagging and about the CLAWS tagset. You will find more info in the section of Corpus Tagging and Annotation on this site.

动态语法
2005-10-24, 01:36 AM
Following up on this old thread...

Lexical Desity (LD) is an attaractive concept, but can anyone show me
some really nice studies that take advantage of (some versions of)
LD? I know there is a big literature in language development using
LD, in corpus linguistics except for Stubbs it seems like very few people
have used this concept.

jjm
2007-06-21, 10:00 AM
求助文章:1. Developmental trends in lexical diversity
2. Investigating accommodation in language proficiency interviews using a new measure of lexical diversity
3.A New Measure of Lexical Diversity

laohong
2007-06-21, 10:08 AM
求助文章:1. Developmental trends in lexical diversity
2. Investigating accommodation in language proficiency interviews using a new measure of lexical diversity
3.A New Measure of Lexical Diversity


Author? Journal/Book? Publisher? Date?

jjm
2007-06-22, 08:47 AM
1. Durán, P., Malvern, D., Richards, B., & Chipere, N. (2004). Developmental trends in lexical diversity. Applied Linguistics, 25 (2), 220-242.
2. The Lexical Profile of Second Language Writing: Does It Change Over Time?
B Laufer - RELC Journal, 1994

jjm
2007-06-22, 08:56 AM
Not excatly just the first thousand words of each text. Here is what Mike says about STTR:

"The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)"

How to Trace the Growth in Learners Active Vocabulary? A Corpus-based Study
Author: Agnieszka Le o-Szyma ska
Source: Language and Computers, Teaching and Learning by Doing Corpus Analysis. Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19-24 July, 2000. KETTEMANN, Bernhard and Georg MARKO (Eds.), pp. 217-230(14)
Publisher: Rodopi
in this article, the author mentioned Type/Token Ratio, the Standardised Type/Token Ratio and the Mean Type/Token Ratio, i wonder the difference between the Standardised Type/Token Ratio and the Mean Type/Token Ratio.
and how to get the the Mean Type/Token Ratio.

jjm
2007-07-23, 05:00 PM
pls try this site: http://textalyser.net/

i have tried this software,but its fuction is restricted within 1000words.

jjm
2007-07-23, 06:53 PM
Richard has a good memory.

http://forum.corpus4u.org/upload/forum/2005091822404312.jpg

if we use ACWT to calculate the lexical density, firstly,we ned to know th number of content words, and how to get it ?