Monday, 23 March 2009

From corpus to clusters: Gries and Divjak's suggested methodology

In Behavioral profiles: a corpus approach to cognitive semantic analysis (to appear), Gries and Divjak propose a methodology to approach polysemy both using an empirical approach and following the Cognitive Linguistics (CL) framework. The author's methodology is of interest for my project because of I adopt an empirical approach, I follow the CL framework and my investigated words (i.e. may, can and pouvoir) are all polysemous lexical items.


In their introduction, the authors review:

i) The treatment of polysemy in CL

ii) Present existing issues behind the identification of the prototypical sense(s) of a word

iii) Claim that a more sophisticated quantitative approach to corpus investigation would provide cognitive-linguistically relevant results.


Gries and Divjak’s methodology is based on the assumption that it “is radically corpus-based because it relies on the correlation between distributional patterns and functional characteristics to a much larger extent than most previous cognitive-linguistic work” (p.60). The authors claim that their methodology “aims at providing the best of both worlds, i.e. a precise, quantitative corpus-based approach that yields cognitive-linguistically relevant results” (p.60)

Method:

Four-step method based on the concepts of ID tags (cf. Atkins 1987) and the notion of Behavioral Profile (cf. Hanks’s 1996).

The method assumes that “the words or sense investigated are part of a network of words/senses”:

“In this network, elements which are similar to each other are connected in such a way that the strength of the connection reflects the likelihood that the elements display similar behavior with respect to phonological, syntactic, semantic or other type of linguistic behaviour” (p.61)

The four stages:

Stages 1-3 are concerned with data processing.

Stage 4 is concerned with meaningful data evaluation.

  1. The retrieval of all instances of a word’s lemma from a corpus
  2. A manual analysis of many properties of the word form (i.e. the annotation of the ID tags)
  3. The generation of a co-occurrence table
  4. The evaluation of the table by means of exploratory and other statistical techniques

Data processing:

Stage 1: use of a concordance program to retrieve all hits of a lemmata of a word

Stage 2: all hits are annotated for ID tags

Results from step 2 are displayed in a co-occurrence table where each row contains:

· one citation of the word in question

· each column contains an ID tag

· each cell contains the level of the ID tag for this citation

Stage 3: The co-occurrence table is turned into a frequency table (every row contains a level of an ID tag while every column contains a sense of the polysemous word. Each cell in the table provides the frequency of occurrence of the ID tags with the word sense(s)

[NB: to compare senses that occur at different frequencies, absolute frequencies need to be turned into relative frequencies (i.e. within ID tag percentages)]

Step 3 results in the Behavioral profile for a word sense: “each sense of a word (…) is characterized by one co-occurrence vector of within-ID tag relative frequencies” (p.63)

Stage 4 of Gries and Divjak’s methodology evaluates the vector-based behavioural profiles identifies in stage 3.

Data evaluation

The evaluation can be carried out using quantitative approaches (i.e. standardized statistical tests).

Gries and Divjak recognise two types of evaluations: monofactorial and multifactorial evaluations:

  • Monfactorial evaluation: looks at token frequency and type frequency. “A useful strategy to start with is identifying in one’s corpus the most frequent senses of the word(s) one is investigating” (p.64)

  • Multifactorial evaluation: The authors specifically focus on the exploratory technique of hierarchical agglomerative cluster analysis. The Hierarchical agglomerative cluster analysis (HAC) is a family of methods that aims at identifying and representing (dis)similarity relations between different items.

How to do a Hierarchical agglomerative cluster analysis:

i) Relative co-occurrence frequency table needs to be turned into a similarity/dissimilarity matrix (need to settle on a specific measure)

ii) Selection of an amalgamation strategy ( =algorithm that defines how the elements that need to be clustered will be joined together on the basis of the variables or the ID tags that they were inspected for (most widely used amalgamation strategy is Ward’s rule)

iii) Results appear in the form of a hierarchical tree diagram representing distinguishable clusters with high within-cluster similarity and low between-cluster similarity


Detailed analysis of the clustering solution

i) Assessment of the ‘cleanliness’ of the tree diagram

ii) Assessment of the clearest similarities emerging from the tree diagram

iii) Between-cluster differences can be assessed using t-values

NB: “the fact that a cluster analysis has grouped together particular sense/words does not necessarily imply that these senses or words are identical or even highly similar – it only shows that these sense/words are more similar to each other than they are to the rest of the senses/words investigated. By means of standardized z-scores, one can tease apart the difference between otherwise highly similar senses/words and shed light on what the internal structure of a cluster looks like” (p.67)

The author's methodology and my project:

  • Can the authors' method lead to the identification of semantic clusters between the different senses of may, can and pouvoir?
  • If so, what semantic features characterise each cluster? Can between-cluster differences be identified?
  • How useful is the proposed methodology for the elaboration of a cross-linguistic semantic network of the senses of may, can and pouvoir?
  • How useful is the proposed methodology for both the identification of cross-linguistic between cluster differences and the identification of within-cluster characterics?
Overall, the exploration of the authors' proposed methodology using my data should prove a useful exercise because it provides the opprotunity to investigate the mental semantic organisation of word senses at cross-linguistic level.

1 comment: