Thursday 18 June 2009

Profile-based methodology for the comparison of language varieties

In this post, I would like to briefly point out the usefulness of 'profiling' methods for corpus data investigation. For that purpose I specifically refer to a paper entitled Profile-based linguistic uniformity as a generic method for comparing language varieties (2003), authored by Dirk Speelman, Stefan Grondelaers and Dirk Geerearst. The authors' paper is inspired by studies in language varieties and research methods currently used in dialectometry. For my own purposes, it is interesting to note that the authors make a case for the validity of profile-based linguistic methodology for corpus-data investigation as the annotation process of my data will include profiling occurrences of may, can and the lemma pouvoir.

In their paper, the authors present "the 'profile-based uniformity', a method designed to compare language varieties on the basis of a wide range of potentially heterogeneous linguistic variables" (abst.) The aim of the authors is to show that profiling investigated lexical items contributes to the identification of global dissimilarities between language varieties on the basis of individual variables which are ultimately summarised in global dissimilarities. Such process allows language varieties to be clustered or charted via the use various multivariate techniques.

Unlike standard methods of corpus investigation, namely frequency counts, the profile-based method "implies usage-based, but add another criterion. The additional criterion is that the frequency of a word or a construction is not treated as an autonomous piece of information, but is always investigated in the context of a profile."(p.11)

The profile-based approach assumes that mere frequency differences in a corpus contribute to the identification of differences between language varieties. According to the authors, the profile-based approach presents two advantages: the avoidance of thematic bias and the avoidance of referential ambiguity.

For the purpose of my project, the author's paper generally supports my methodological choice to semantically profile the occurrences of may, can and lemma pouvoir as found in my data. However, in their case study (see paper on p.18) the authors choose to take an onomasiological perspective (i.e. to use a concept as a starting point, and then investigate which words are associated with that concept). My project, on the other hand, takes on the opposite perspective, namely the semasiological approach which in the first instance considers individual words and looks at the semantic information that may be associated with those words. Inevitably, such difference in approaching the word/sense/concept interface leads to differing acceptations of the term 'profile' as both onomasiological and semasiological perspectives have different starting points. In that respect, the authors consider '[a] profile for a particular concept or linguistic function in a particular language variety [to be] the set of alternative linguistic means used to designate that concept or linguistic function in that language variety, together with their frequencies" (p.5)

For the purpose of my project, the term profile necessarily needs to be defined at word level and needs to incorporate the elements of sense and morpho-syntactic information. In that regard, the Behavioural Profile methodology proposed by Gries and Divjak in Quantitative approaches in usage-based cognitve semantics: myths, erroneous assumptions, and a proposal (in press) is an appropriate methodology for my project. Broadly, the BP methodology involves the identification of both semantic and morpho-syntactic features characteristic of the investigated lexical item, as found in the data. Ultimately, these identified features are used as linguistic variables and are investigated statistically. In the BP model, the identified features are referred to and processed as ID tags, each one of which contributes to the profiling of the lexical item under investigation.

To sum up, Speelman, Grondelaers and Geeraerst's paper provides me here not only with the opportunity to reflect on the notion of 'profiling' in the context of corpus-data investigation but also with the opportunity to consider the notion in the perspective of my own study.







Tuesday 16 June 2009

Comparing exploratory statistical techniques for semantic descriptions

As Glynn, Geeraerst and Speelman state in Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics (paper presented at the 10th International Cognitive Linguistics Conference in Cracow in July 2007):

Current trends in the study of polysemy have focused on exploratory techniques such as Cluster Analysis and Correspondence Analysis. (abst.)
Broadly, exploratory techniques "identify and visualise patterns in the data". This technique "does not permit inferences about the language, only the sample, or dataset, investigated" (abst.)

On the occasion of the Quantitative Investigations in Theoretical Linguistics 3 event in Helsinki on June 3rd 2008, Dylan Glynn presented a comparison of both the Cluster and Correspondence Analysis statistical methods for the purpose of semantic description (Clusters and Correspondences. A comparison of two exploratory statistical techniques for semantic description) [the powerpoint presentation for this paper can be found here].

Over the past fifteen years, corpus-based research in the field of Cognitive Linguistics has produced a number of studies demonstrating the wide use of both statistical techniques. In his paper, Glynn compares both techniques on the grounds on quality/accuracy of graphic representation of the data and accuracy of relative associations of variables as revealed in the data. The assessment of the accuracy of relative associations of variables for each statistical method is based on a regression analysis which takes into consideration "the relationship between the mean value of a random variable and the corresponding values of one or more variables" (OED).

For the purpose of his investigation, Glynn carried out a case study examining the semantic structure of the lexeme annoy in comparison with hassle and bother in a large non-commercial corpus of English specified for the American vs. British English regional difference (for the purpose of that case study Glynn identified the working variables of morpho-syntax and Frame Semantic argument structure). Glynn points out that the Cluster Analysis and Multivariate Correspondence Analysis methods involve different types of graphic representations which in turn, present a number of shortcomings:

One important difference between the two techniques is that Cluster Analysis is primarily designed to present its results in the form of dendograms where Correspondence Analysis relies on scatter plots. The dendograms of HCA offer clear representations of both the groupings of features and the relative degree of correlation of those features. (...) The principle shortcoming of this representation is that it gives the false impression that all the data falls into groups, where in fact this may not be the case. (...) The scatter plots of Correspondence Analysis, although at times difficult to interpret, offer a much more "analogue" representation of correlation. (...) [T]he representation of the plot is (...) much more approximative than the dendogram. (p.2)
Through his case study, Glynn confirms the usefulness of both statistical methods as exploratory techniques. He also points out the possibility of unreliability of both methods to accurately process complex multivariate data and cautions analysts about the use of those methods for the specific purpose of confirmatory analysis. However, in the context of exploratory analysis, "the contrast in the result of the complicated analysis across the three lexemes [annoy,hassle and bother] suggests that MCA [Multivariate Correspondence Analysis] is better suited to a truly multivariate exploratory research" (p.2)

With regard to my project, Glynn's paper raises a couple of points:

i) the need to decide on the statistical nature of my overall project analysis -- exploratory, confirmatory or perhaps both possibly following a comparative format (?)

ii) the urgency to clearly identify the number and the nature of the variables through which I intend to investigate my data sets as those will be influential in the choice of statistical method -- at exploratory stage at least.

Statistical techniques for an optimal treatment of polysemy

In this post, I introduced the work of Dylan Glynn who is broadly concerned with developing methodology for corpus-data investigation. Glynn adheres to the Cognitive Linguistics/Semantics framework. Of interest here is a research project he contributed to with the collaboration of Dirk Geeraerts and Dirk Speelman, and concerned with the assessment of the efficacity of two statistical techniques, namely exploratary vs. confirmatory techniques of statistical analysis. Glynn, Geeraerst and Speelam presented the results of their study at the 10th International Cognitive Linguistics Conference in Cracow in July 2007, in a paper entitled Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics [the abstract is accessible from page 11 of the link]. For the purpose of this post I can unfortunately only summarise the content of that paper based on its abstract. As I do not have access to the full paper, I am not in a position to critically assess the arguments proposed by Glynn, Geeraearst and Speelman.

According to the authors, the two main -- and actively currently used by Cognitive Linguists, statistical techniques for corpus-data investigation are i) exploratory techniques (i.e. the Cluster Analysis, used in Gries 2006; the Correspondence Analysis, used in forthcoming Glynn) and confirmatory techniques (i.e. Linear Discriminant Analysis, used in Gries 2003 and Wulff 2004; Logistic Regression Analysis, used in Heylen 2005 and De Sutter & al. in press)

The authors define the aim of each technique as follows:

The goal of (...) exploratory statistics is to identify and visualize patterns in the data. These patterns are argued to represent patterns of usage (...). Exploratory statistics analysis does not permit inferences about the language, only the sample, or dataset, investigated. However, in confirmatory statistics, inference is made from the sample to the population. In other words, one claims that what is seen in the data is representative of the language generally. (abst.)


In the light of my own project, the author's study is of particular relevance because it identifies the case of polysemy, as an object of investigation, as requiring specific methodological attention:
Current trends in the study of polysemy have focused on exploratory techniques.
However,
[t]he importance of these techniques notwithstanding, the cognitive framework needs to deepen its use of quantitative research especially through the use of confirmatory multivariate statistics.
Further,

Within Cognitive Linguistics, [Linear Discriminant Analysis technique and Logistic Regression Analysis technique] have been successfully used to capture the various conceptual, formal, and extralinguistics factors that lead to the use of one construction over another. However, the study of polysemy differs at this point. Instead of examining the variables that effect the use of one parasynonymous forms to another, we are examining the interaction of a range of formal variables (the lemma and its syntagmatic and inflectional variation), semantic variables, and extralinguistic variables, in the search of correlations across all of these. One possible multivariate technique for this type of data is Log-Linear Modelling. (abst.)
In the course of their study, the authors identified complex sets of correlations between formal and semantic variables through exploratory studies and then modelled these correlations using The Log-Linear Analysis technique.

At this point, Glynn, Geeraerst and Speelman's paper calls for a comparative study of specific polysemous lexical items contextualised in different language varieties and using, in turn, both the Cluster Analysis exploratory technique and the Log-Linear Modelling confirmatory technique. Such study would contribute to the identification of a possible optimal statistical technique for the investigation of corpus-data.

The place of Cognitive Linguistics on the French linguistics scene

As previously described here, part of my project involves the investigation of the lemma pouvoir in a native French subdata set. Analyses of quantitative results of such investigation will be carried out according to the Cognitive Linguistics (CL) framework. Carrying out a literature review including the polysemy of pouvoir in relation to the CL framework has, so far, proved a little tricky. This post provides a little bit of background on the place of CL in France and in French linguistics generally. At the Congres Mondial de Linguistique Francaise in Paris in July 2008, Dirk Geeraerst discussed the situation of CL in the context of French linguistics in a very informative paper entitled La Reception de la Linguistique Cognitive dans la Linguistique du Francais. Bonne lecture!

Monday 15 June 2009

Dylan Glynn on the theme of data-driven methodology in Cognitive Linguistics and its usefulness for the treatment of polysemy

In this post, I would like to bring attention to the work of Dylan Glynn whose on-going research is concerned with bridging the empirical and the cognitive. Here is how Glynn describes his own work:

The focus of my work is the development of methodology within the theoretical framework of Cognitive Linguistics. This school of thought imposes the minimal theoretical assumptions upon its model of language. It is for this reason that it is best placed to properly capture the complexity of language in a holistic manner.

In methodological terms, I am most interested in finding ways to capture the multidimensional nature of language structure, from prosody and morphology through to semantics and culture. Specifically, I concentrate on the semantics of Grammatical Constructions, the polysemy and synonymy of lexis, iconicity in morphology, and the interaction of grammar, pragmatics, and metaphor-metonymy.(https://perswww.kuleuven.be/~u0049977/ling.html) [accessed 15/06/09]



As part of a talk given at the 10th International Cognitive Linguistics Conference in July 2007 at the University of Cracow, entitled Usage-Based Cognitive Semantics: A Quantitative Approach, Glynn makes a case for the quantitative treatment of lexical and constructional semantics and claims that "[c]orpus data respects the complexity of language and, if treated in sufficiently large quantities, enables generalisations about language structure that other methods cannot" (abst.). Further, "usage-based quantitative methodology (...) facilitates attempts to reveal the interaction between the different parameters of language simultaneously" (abst.) [my emphasis]

During his opening talk of the theme session Empirical Evidence. Converging approaches to constructional meaning to the Third International Conference of the German Cognitive Linguistics Association on September 25th-27th 2008, Glynn points out the fast growing interest in empirical cognitive research, particularly in the field of Cognitive Semantics:
Cognitive Linguistics has recently witnessed a new and healthy concern for empirical methodology. Using such methods, important in-roads have been made in the study of near-synonymy, syntactic alternation, syntactic variation and lexical licensing.
Further,
Empirical methods, and methodology generally, are one of the most important concerns for any descriptive science and the recent blossoming of research in this respect in Cognitive Linguistics can be seen as a maturing of the field. A range of recent anthologies on the issue, including Gries & Stefanowitsch (2006), Stefanowitsch & Gries (2006), Gonzales-Marquez & al. (2007), Andor & Pelyvas (forth.), Newman & Rice (forth.), and Glynn & Fischer (in preparation), can be seen as testimony to the importance attached to this issue. Despite the advances in this regard, how the different methods and the results they produce inform each other remains largely ill-understood. Although this question of how elicited, experimental and found data relate has been addressed in the work of Schonefeld (1999,2001), Gries & al. (2005, in press), Goldberg (2006), Arppe & Jarvikivi (in press), Gilquin (in press), Divjak (forth.), and Wiechmann (subm.), it warrants further investigation.
The fast development of data-driven investigation methods within the field of Cognitive Linguistics is further pointed out by Glynn in his opening talk to the theme session Empirical Approaches to Polysemy and Synonymy, at the Cognitive and Functional Perspectives on Dynamic Tendencies in Languages event, on May 29th-June 1 2008. In that particular address, Glynn presents empirical cognitive approaches as a way to address existing issues in the cognitive treatment of polysemy:
Within the cognitive tradition, both the study of polysemy and synonymy have rich traditions. Brugman (1983) and Vandeloise (1984) began the study of sense variation in spatial prepositions that evolved into the radial network model applied to a wide range of linguistic forms, especially grammatical cases and spatial prepositions (Janda 1993, Cuyckens 1995). (...) Despite the success of this research, studies such as Sandra & Rice (1995) and Tyler and Evans (2001) identified serious shortcomings. In light of this, empirical cognitive approaches to semantic structure do not question the validity of the radial network model, but seek to develop methods for testing proposed semantic variation and relation. (abs.) [my emphasis]
In relation to my project (which includes a Cognitive Linguistics treatment of polysemous may,can and pouvoir via an investigation of corpus data), it is with much excitement that I begin to explore the work if Dylan Glynn.

Below is a selected bibliography of Glynn's work and that will be of interest for my research (unfortunately, several references are still in press or in preparation!):

  • Glynn, D. In press (6pp). Multifactorial Polysemy. Form and meaning variation in the complex web of usage. R. Caballero (ed.). Lexicología y lexicografía. Proceedings of the XXVI AESLA Conference. Almería: University of Almería Press.
  • Glynn, D. 2008. Polysemy, Syntax, and Variation. A usage-based method for Cognitive Semantics. V. Evans & S. Pourcel (eds). New Directions in Cognitive Linguistics. Amsterdam: John Benjamins.
  • Glynn, D. 2006. Conceptual Metonymy - A study in cognitive models, reference-points, and domain boundaries. Poznan Studies in Contemporary Linguistics 42: 85-102.
  • Glynn, D. 2006. Cognitive Semantics and Lexical Variation. Why we need a quantitative approach to conceptual structure. O. Prokhorova (ed.). Edinstvo sistemnogo i functionalnogo andliza yazykov (Systemic and Functional Analysis of Language). 53-60. Belgorod: Belgorod University Press.
In preparation:

  • Glynn, D., Multidimensional Polysemy. A case study in usage-based cognitive semantics. Will be submitted to Cognitive Linguistics.
  • Glynn, D., Geeraerts, D., & Speelman, D. Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics. D. Glynn & K. Fischer (ed.). Usage-Based Cognitive Semantics. Corpus-Driven methods for the study of meaning. Berlin: Mouton de Gruyter.
  • Glynn, D. & Fischer, K. (eds). Usage-Based Cognitive Semantics. Corpus-Driven methods for the study of meaning. Berlin: Mouton de Gruyter.
  • Glynn, D. Mapping Meaning. Toward a usage-based methodology in Cognitive Semantics. Will be submitted to Mouton de Gruyter.

Sunday 14 June 2009

Behavioral Profiling and polysemy

In their paper entitled In defense of corpus-based methods: A behavioral profile analysis of polysemous 'get' in English (presented at the 24th North West Linguistics Conference, 3-4th May 2008), Andrea L. Berez and Stefan Th. Gries make a general case for the use of corpus data. their paper serves as a response to Raukko's (1999,2003) proposal to disregard corpus data investigations in favour of experimentally motivated studies. Berez and Gries conclude that:

[A] rejection of corpus-based investigations of polysemy is premature: our BP approach to get not only avoids the pitfalls Raukko mistakenly claims to be inherent in corpus research, it also provides results that are surprisingly similar to his own questionnaire-based results, and Divjak and Gries (to appear) show how predictions following from a BP study are strongly supported in two different psycholinguistic experiments." (P.165)
Before conducting a case study of polysemous get -- the results of which are compared , in the second part of the paper, to those presented in Raukko's An "intersubjective" method for cognitive semantic research on polysemy: the case of 'get' (1999), the authors briefly state the advantages of corpus data:

- (...) the richness of and diversity of naturally-occurring data often forces the researcher to take a broader range of facts into consideration;
- the corpus output from a particular search expression together constitute an objective database of a kind that made-up sentences or judgements often do not. More pointedly, made-up sentences or introspective judgements involve potentially non-objective (1) data gathering, (ii) classification, (iii) interpretive process on the part of the researcher. Corpus data, on the other hand, at least allow for an objective and replicable data-gathering process; given replicable retrieval operations, the nature, scope and the ideas underlying the classification of examples can be made very explicit (...) (p.159)

Methodologically, Berez and Gries attempt to make their case by targeting 'polysemy' as their domain of investigation and by applying the Behavioral profiling method (described here):

Given the recency of this method, the number of studies that investigate highly polysemous items is still limited. We therefore apply this method to the verb to get to illustrate that not only does it not suffer from the problems of the intersubjective approach, but it also allows for a more bottom-up/data-driven analysis of the semantics of lexical elements to determine how many senses of a word to assume and what their similarities and differences are. (p.157)

Generally, the results encountered in both Berez and Gries' study and Raukko study are very similar. However, Berez and Gries' BP approach allows for a finer grained investigation:
we show that some of our results are incredibly close to Raukko's, but also provide an illustration of how the BPs can combine syntactic and semantic information in a multifactorial way that is hard to come by using the kinds of production experiments Raukko discusses. (p.159)

With regard to my project, broadly concerned with a corpus-driven investigation of polysemous lexical items , Berez and Gries' paper provides, methodologically, a useful illustration of how to exploit corpus data optimally for the retrieval of semantic information.

Tuesday 9 June 2009

Behavioral Profiles, snake plots and cross-linguistic comparisons

This post complements this earlier post: The corpus-based Behavioral Profile approach to cognitive semantics as it revisits the Behavior Profile (BP) methodology and reports how, according to Divjak and Gries, snake plot representations can graphically reveal the relative significance of ID tags thus allowing for cross-linguistic ID tag-level comparisons. In this post I make reference to Divjak and Gries recent paper: Corpus-based cognitive semantics: a contrative study of phasal verbs in English and Russian (to appear).

Overall, Divjak and Gries demonstrate that the BP methodology not only allows to pick up dissimilarities between polysemous and near synonyms but it also allows to recognise and simultaneously process dissimilarities that are characteristically different:

"Because these dissimilarities are of an entirely different order, they can only be picked up if a methodology is used that adequately captures the multivariate nature of the phenomenon. The Behavioral Profiling approach we have developed and apply here does exactly that." (p.273, abst.).


For their investigation of polysemous and near synonymous lexical items the authors assume the existence of networks of words/senses. They also assume that the investigated lexical items in their study are included in such networks. Further, these networks demonstrate internal structure in the sense that "elements which are similar to each other are connected and the strength of the connection reflects the likelihood that the elements display similar syntactic and semantic behaviour" (p.281)


Divjak and Gries' paper achieves three goals:


1/ Presents the BP methodology as a means to provide a usage-based characterisation of the lemma under investigation by identifying individual syntactic and semantic characteristic features.

2/ Demonstrates that a snake plot graphic representation of those syntactic and semantic characteristic features allows to rank them in order of significance and therefore contributes to the identification of clusters of senses "on the basis of distributional characteristics collected in BPs" (p.292). Consequently, snake plots representations allow for the recognition of prototypical features of the investigated lexical items.
3/ Illustrates that semantically the BP approach allows for more rigorous investigation of translational cross-linguistic equivalents.

Overall, the authors are testing the BP approach for a simultaneous treatment of both language-specific data and cross-linguistic data.

"The (...) purpose is to show that this approach can also be applied to the notoriously difficult area of cross-linguistic comparisons. (...) [T]he approach will be put to the test by attempting a simultaneous within-language description and across-languages comparison of polysemous and near-synonymous items belonging to different subfamilies of Indo-European, i.e., English and Russian" (p.277)

Generally, Divjak and Gries' paper encourages to put the BP methodology further to the test by applying it to an interlanguage type of data where the investigated lexical items in language x and carving a specific conceptual space
is used by a native speaker of language y whose conceptual space for the translational equivalent of the investigated item in language x is potentially different. In other words and with regard to the application of the BP methodology to my project, while the paper raises questions about the nature of conceptual spaces in interlanguage, it convincingly offers a methodology that would allow for the computation of my three-way data (including native English, native French and Fr-English interlanguage, details of the three sub-corpora can be found here). Simultaneous treatment of may, can and pouvoir can be carried out within language -- taking into account the native English data vs. the Fr-English interlanguage data, and across language -- taking ito account the native French vs. native English vs. Fr-English interlanguage data. Finally, the BP approach also provides the opportunity to investigate the possibilty of a correlation between the word class membership of may, can and pouvoir and their semantic BPs.