Tuesday 16 June 2009

Statistical techniques for an optimal treatment of polysemy

In this post, I introduced the work of Dylan Glynn who is broadly concerned with developing methodology for corpus-data investigation. Glynn adheres to the Cognitive Linguistics/Semantics framework. Of interest here is a research project he contributed to with the collaboration of Dirk Geeraerts and Dirk Speelman, and concerned with the assessment of the efficacity of two statistical techniques, namely exploratary vs. confirmatory techniques of statistical analysis. Glynn, Geeraerst and Speelam presented the results of their study at the 10th International Cognitive Linguistics Conference in Cracow in July 2007, in a paper entitled Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics [the abstract is accessible from page 11 of the link]. For the purpose of this post I can unfortunately only summarise the content of that paper based on its abstract. As I do not have access to the full paper, I am not in a position to critically assess the arguments proposed by Glynn, Geeraearst and Speelman.

According to the authors, the two main -- and actively currently used by Cognitive Linguists, statistical techniques for corpus-data investigation are i) exploratory techniques (i.e. the Cluster Analysis, used in Gries 2006; the Correspondence Analysis, used in forthcoming Glynn) and confirmatory techniques (i.e. Linear Discriminant Analysis, used in Gries 2003 and Wulff 2004; Logistic Regression Analysis, used in Heylen 2005 and De Sutter & al. in press)

The authors define the aim of each technique as follows:

The goal of (...) exploratory statistics is to identify and visualize patterns in the data. These patterns are argued to represent patterns of usage (...). Exploratory statistics analysis does not permit inferences about the language, only the sample, or dataset, investigated. However, in confirmatory statistics, inference is made from the sample to the population. In other words, one claims that what is seen in the data is representative of the language generally. (abst.)


In the light of my own project, the author's study is of particular relevance because it identifies the case of polysemy, as an object of investigation, as requiring specific methodological attention:
Current trends in the study of polysemy have focused on exploratory techniques.
However,
[t]he importance of these techniques notwithstanding, the cognitive framework needs to deepen its use of quantitative research especially through the use of confirmatory multivariate statistics.
Further,

Within Cognitive Linguistics, [Linear Discriminant Analysis technique and Logistic Regression Analysis technique] have been successfully used to capture the various conceptual, formal, and extralinguistics factors that lead to the use of one construction over another. However, the study of polysemy differs at this point. Instead of examining the variables that effect the use of one parasynonymous forms to another, we are examining the interaction of a range of formal variables (the lemma and its syntagmatic and inflectional variation), semantic variables, and extralinguistic variables, in the search of correlations across all of these. One possible multivariate technique for this type of data is Log-Linear Modelling. (abst.)
In the course of their study, the authors identified complex sets of correlations between formal and semantic variables through exploratory studies and then modelled these correlations using The Log-Linear Analysis technique.

At this point, Glynn, Geeraerst and Speelman's paper calls for a comparative study of specific polysemous lexical items contextualised in different language varieties and using, in turn, both the Cluster Analysis exploratory technique and the Log-Linear Modelling confirmatory technique. Such study would contribute to the identification of a possible optimal statistical technique for the investigation of corpus-data.

No comments:

Post a Comment