ZT-New suite of freely available natural language processing (NLP) tools available


We are happy to announce the release of 6 freely available NLP tools that measure lexical sophistication, text cohesion, syntactic complexity, sentiment, social positioning, and cognition.

The tools are easy to use through a graphical user interface, are housed on a user’s hard drives (freeing the systems from dependence on the internet), work on most operating systems (Windows, Mac, Linux), and allow for batch processing of text files.

All of the tools described below can be downloaded at www.kristopherkyle.com/tools.

TAALES. TAALES (Kyle & Crossley, 2015) incorporates about 200 indices related to basic lexical information (i.e., the number of words and n-grams, the number of word and n-gram types), lexical frequency (i.e., how many times an item occurs in a reference corpus), lexical range (i.e., how many documents in which a reference corpus an item occurs), psycholinguistic word information (e.g., concreteness, familiarity, meaningfulness), academic language (i.e., items that occur more frequently in an academic corpus than in a general use corpus) for both single words and multi-word units (n-grams such as bigrams and trigrams), strength of association, contextual distinctiveness, word neighbor information, lexical decision times, age of exposure, and semantic lexical relations (hypernymy and polysemy).

TAASC: TAASSC measures large and fined grained clausal and phrasal indices of syntactic complexity and usage-based frequency/contingency indices of syntactic sophistication. TAASSC includes 14 indices measured by Lu’s (2010, 2011) Syntactic Complexity analyzer, 31 fine-grained indices or clausal complexity, 132 fine-grained indices of phrasal complexity, and 190 usage-based indices of syntactic sophistication. The SCA measures are classic measures of syntax based on t-unit analyses (Wolfe-Quintero et al., 1998; Ortega, 2003). The fine-grained clausal indices calculate the average number of particular structures per clause and dependents per clause. The fine-grained phrasal indices measure seven noun phrase types and ten phrasal dependent types. The syntactic sophistication indices are grounded in usage-based theories of language acquisition (Ellis, 2002a; Goldberg, 1995; Langacker, 1987) and measure the frequency, type token ratio, attested items, and association strengths for verb-argument constructions (VAC) in a text.

SEANCE: SEANCE is a sentiment analysis tools that relies on a number of pre-existing sentiment, social positioning, and cognition dictionaries. SEANCE contains a number of pre-developed word vectors developed to measure sentiment, cognition, and social order. These vectors are taken from freely available source databases such as SenticNet (Cambria, Speer, Havasi, & Hussain, 2010; Cambria, Havasi, & Hussain, 2012) and EmoLex (Mohammad & Turney, 2010, 2013). In some cases, the vectors are populated by a small number of words and should be used only on larger texts that provide greater linguistic coverage in order to avoid non-normal distributions of data (e.g., Lasswell dictionary lists [Laswell & Namewirth, 1969] and the Geneva Affect Label Coder [GALC; Scherer, 2005] lists). For many of these vectors, SEANCE also provides a negation feature (i.e., a contextual valence shifter; Polanyi & Zaenen, 2006) that ignores positive terms that are negated. The negation feature, which is based on Hutto and Gilbert (2014), checks for negation words in the 3 words preceding a target word. SEANCE also includes the Stanford part of speech (POS) tagger (Toutanova, Klein, Manning, & Singer, 2003) as implemented in Stanford CoreNLP (Manning et al., 2014). The POS tagger allows for POS tagged specific indices for nouns, verbs, and adjectives.

CLA is a simple but powerful text analysis tool that allows users to analyze texts using very large custom dictionaries. In addition to words, custom dictionaries can include n-grams and wildcards.

SiNLP is a simple tool that allows users to analyze texts with regard to the number of words, number of types, TTR, letters per word, number of paragraphs, number of sentences, and number of words per sentence for each text. In addition, users can analyze texts with regard to their own custom dictionaries.

Scott Crossley, Ph.D.
Department of Applied Linguistics/ESL
Georgia State University