Tuesday 16 June 2009

Comparing exploratory statistical techniques for semantic descriptions

As Glynn, Geeraerst and Speelman state in Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics (paper presented at the 10th International Cognitive Linguistics Conference in Cracow in July 2007):

Current trends in the study of polysemy have focused on exploratory techniques such as Cluster Analysis and Correspondence Analysis. (abst.)
Broadly, exploratory techniques "identify and visualise patterns in the data". This technique "does not permit inferences about the language, only the sample, or dataset, investigated" (abst.)

On the occasion of the Quantitative Investigations in Theoretical Linguistics 3 event in Helsinki on June 3rd 2008, Dylan Glynn presented a comparison of both the Cluster and Correspondence Analysis statistical methods for the purpose of semantic description (Clusters and Correspondences. A comparison of two exploratory statistical techniques for semantic description) [the powerpoint presentation for this paper can be found here].

Over the past fifteen years, corpus-based research in the field of Cognitive Linguistics has produced a number of studies demonstrating the wide use of both statistical techniques. In his paper, Glynn compares both techniques on the grounds on quality/accuracy of graphic representation of the data and accuracy of relative associations of variables as revealed in the data. The assessment of the accuracy of relative associations of variables for each statistical method is based on a regression analysis which takes into consideration "the relationship between the mean value of a random variable and the corresponding values of one or more variables" (OED).

For the purpose of his investigation, Glynn carried out a case study examining the semantic structure of the lexeme annoy in comparison with hassle and bother in a large non-commercial corpus of English specified for the American vs. British English regional difference (for the purpose of that case study Glynn identified the working variables of morpho-syntax and Frame Semantic argument structure). Glynn points out that the Cluster Analysis and Multivariate Correspondence Analysis methods involve different types of graphic representations which in turn, present a number of shortcomings:

One important difference between the two techniques is that Cluster Analysis is primarily designed to present its results in the form of dendograms where Correspondence Analysis relies on scatter plots. The dendograms of HCA offer clear representations of both the groupings of features and the relative degree of correlation of those features. (...) The principle shortcoming of this representation is that it gives the false impression that all the data falls into groups, where in fact this may not be the case. (...) The scatter plots of Correspondence Analysis, although at times difficult to interpret, offer a much more "analogue" representation of correlation. (...) [T]he representation of the plot is (...) much more approximative than the dendogram. (p.2)
Through his case study, Glynn confirms the usefulness of both statistical methods as exploratory techniques. He also points out the possibility of unreliability of both methods to accurately process complex multivariate data and cautions analysts about the use of those methods for the specific purpose of confirmatory analysis. However, in the context of exploratory analysis, "the contrast in the result of the complicated analysis across the three lexemes [annoy,hassle and bother] suggests that MCA [Multivariate Correspondence Analysis] is better suited to a truly multivariate exploratory research" (p.2)

With regard to my project, Glynn's paper raises a couple of points:

i) the need to decide on the statistical nature of my overall project analysis -- exploratory, confirmatory or perhaps both possibly following a comparative format (?)

ii) the urgency to clearly identify the number and the nature of the variables through which I intend to investigate my data sets as those will be influential in the choice of statistical method -- at exploratory stage at least.

3 comments:

  1. Useful quote on the theme of choosing statistical methods in corpus linguistics:

    Quote retreived from the abstract of 'Balancing Acts: Empirical Pursuits in Cognitive Linguistics", paper presented by John Newman at the 10th International Cognitive Conference in July 2007 at the university of Cracow:

    "Within the field of corpus linguistics, in particular, where we are deealing with connected discourse, they are unresolved issues concerning the most appropriate statistical methods to use" (abst.)

    ReplyDelete
  2. I occurs to me that the Qs you're identifying here would be good for a SMAGG meeting. You could assign the paper to us...

    ReplyDelete
  3. Sounds good, thank you , I'll go ahead with the idea...

    ReplyDelete