Friday 16 October 2009

"can" and "may" in present-day English, Yvan Lebrun, 1965

In that post, I referred to a corpus-based study of "can" and "may" by Yvan Lebrun, namely "can" and "may" in present-day English (1965). In this post, I briefly present -- or rather log, the scope of Lebrun's study (I will present his general conclusions in a later post):

  • the study is corpus-based and includes data from both British and American English
  • a variety of genres are featured in the data: short stories, novels, plays, newspapers, scientific texts
  • all texts featuring in the data were published between 1955 and 1962
  • the study includes occurrences of might and could
  • numbers of occurrences:
  1. Total number of occurrences, including may, can, might, could: 4765
  2. Total number of occurrences of can: 2024
  3. Total number of occurrences of could: 1745
  4. Total number of occurrences of may: 491
  5. Total number of occurrences of might: 505
  • Methodologically, Lebrun scanned each instance of the modals to ascertain lexical meanings. The modals were considered to convey the same lexical meaning whenever their semantical contents proved identical once such significant oppositions as "present"vs. "past" or "indicative" vs. "conditional" had been discarded (p.11)
  • In order to decide on the semantical content of the modals, Lebrun relies on the context for each instance
  • Lebrun first carries out a recognition process of all the lexical senses and then attempts to define them
  • the process of defining the lexical senses was first motivated by Sommerfelt's recommendation (i.e. 'the definition be able to replace the word in an ordinary sentence'). Such 'replacement' process was abandoned on the basis that:

"In none of the lexical meanings CAN, COULD, MIGHT, MAY can be equated with a substitutable word or phrase. In fact, each of their lexical senses is so wide that only a long series of 'synonyms' can cover it" (p.11)

Further,

"Instead of defining CAN, COULD,MIGHT, MAY by means of longish strings of juxtaposed partial equivalents and thus blurring out the internal unity of the lemma's meaning, I renounced Sommerfelt's principle and aimed at definitions that (a) embrace every facets of the sense they are meant to cover, (b) bring out the internal unity of each meaning, and (c) emphasize what the various significations of a lemma have in common." (p.11)

  • Overall methodological strategy :
  1. Based on the three recognised lexical meanings and that are common to CAN, COULD, MIGHT, MAY, calculated how often each of these three meanings were expressed by MAY rather than by CAN and by MIGHT rather than by COULD.
  2. Lebrun examines cases where MAY and CAN are synonyms
  3. Based on the discovery that some collocations exclude the use of one of the two synonyms , Lebrun calculated the frequency of MAY relatively to CAN in kinds of clauses where either word can be used idiomatically and tried to find out if this relative frequency is independent of the context [my emphasis, this part of Lebrun's methodology reinforces the idea of including, in my study, two separate variables (i.e. SENSES and CONTEXT) for a treatment of the meanings of MAY and CAN, as featuring in my data. For more details on this, see this previous post).

Further reading (of early studies):

Lebrun, Y., Can and May, A Problem of Multiple Meaning, in Proceedings of the Ninth International Congress of Linguistics, 1962 (The Hague, Mouton, 1964)

Ten Bruggencate, K, The Use of Can and May, in Taalstudie 3 (1882), 94-106

Wood, F., May and Might in Modern English, Moderna Sprok 49 (1955), 247-253

Senses vs. context in the coding of the semantics of MAY and CAN

This brief post is a continuation on the theme of the previous post where I raise the issue of coding the senses of "may" and "can" most effectively for the purpose of statistical analysis. It provides a short update on my current line of thinking re the design an optimal coding system for the meanings of "may" and "can".

With regards to my project, I am at a stage now where I am about to start the annotation of the senses of "may" and "can" as featuring in my data and I am currently concerned with the issue of defining an appropriate degree of granularity for that stage of the coding. In other words, I need to establish how much of contextual information should be included in the coding of senses of the modals. Further, and with regards to the above, I'm now considering the inclusion of an extra variable (in addition to a SENSES variable) for the investigation of the behaviour of "may" and "can", namely that of CONTEXT. The motivation behind the inclusion of the CONTEXT variable would be to ultimately assess/quantify contextual weight on the semantics of the modals. Also, including a CONTEXT variable in the study would allow for the exclusion of 'contextuality' as a level of the SENSES variable. I would therefore approach, with SENSES, each occurrence of the modals according to their generally recognised "core" meanings. The advantages of dealing with the senses of the modals from the perspectives of both context and core meanings is that firstly, the number of levels included for each variable will be smaller than if only one variable was considered, which would facilitate the recognition of possible patterns in the data. Also, the two variables CONTEXT and SENSES could then be tested for possible mutual interaction which ultimately could be quantified statistically. Such a design of the data would also allow me to address a whole chunk of literature in the English modals that tries to assess what, semantically, belongs to the modals and what belongs to the context and the situation of utterance, and to what measure. To my knowledge, that line of work still remains to be experimentally challenged. Identifying/differentiating two meaning-related variables such as SENSES and CONTEXT could facilitate the possible inclusion of an experimental task that would aim to assess potential statistical results. I'm currently exploring the feasibility of that possibility.

Sunday 11 October 2009

Coding the English modals for senses : Leech & Coates (1980), Coates (1983) and Collins (1988)

Despite the overwhelming literature on the semantics of the English modals and the numerous attempts by many scholars to identify their core meanings and related senses, very few studies have in fact used a corpus-based approach for the purpose of their classification. The current record that I have of such studies counts the following publications, in chronological order of publication:

  • Joos, M. (1964) The English Verb: Form and Meaning. Madison and Milwaukee
  • Lebrun,Y. (1965) "CAN" and "MAY" in present-day English. Presses Universitaires de Bruxelles
  • Ehrman,M.E. (1966) The meanings of the Modals in Present-Day American English. The Hague and Paris
  • Hermeren,L.(1978) On Modality in English: A study of the Semantics of the Modals, Lund:CWK Gleerup
  • Leech,G.N & Coates, J. (1980) Semantic Indeterminacy and the modals. In Greenbaum, S. & al. (eds) Studies in English Linguistics. The Hague: Mouton.
  • Coates, J. (1983) The Semantics of the Modal Auxiliaries. London & Canberra: Croom Helm.
  • Collins, P. (1988) The semantics of some modals in contemporary Australian English. Australian Journal of Linguistics 8, p.261-286
  • Collins, P. (2009) Modals and Quasi-Modals in English. Rodopi
The work of Peter Collins is of particular interest to me, being the most recent in time and therefore benefiting from the latest developments both in the field of modality and corpus linguistics:

"Modals and Quasi-modals in English" reports the findings of a corpus-based study of the modals and a set of semantically-related 'quasi-modals' in English. The study is the largest and most comprehensive to date in this area, and is informed by recent developments in the study of modality, including grammaticalization and recent diachronic change. The selection of the parallel corpora used, representing British, American and Australian English, was designed to facilitate the exploration of both regional and stylistic variation." (11/10/09)


In his 1988 paper, Collins proposes to investigate possible differences in the distribution and the semantics of can, could, may and might in three varieties of English, namely Australian English, British English and American English. Below, I specifically refer to Collins' 1988 paper.

In terms of theoretical framework, Collins adopts a framework based on Leech and Coates (1980) and Coates (1983), two studies that count amongst the most influential corpus-based studies on the English modals. Collins' motivations behind borrowing an already existing framework are twofold:
  1. To facilitate comparisons between results from his study and those encountered in Coates (1983)
  2. According to Collins, the framework proposed in Leech and Coates (1980) and Coates (1983) "accounts more adequately than any other so far proposed for the complexity and indeterminacy of modal meaning, and is therefore particularly useful in handling the recalcitrant examples that one is forced to confront in a corpus-based study" (p.264)
Considering that Collins' methodological and theoretical approaches are anticipated to feature in my study at one stage or another, I report here his overall framework as well as his taxonomy of the senses of MAY/CAN.

Collins' (borrowed) taxonomy includes the notions of "core" meanings, "periphery" meanings and graded degrees of membership:

A central concept is that of a fuzzy semantic set, whose members range from the "core" (representing the prototypical meaning) to the "periphery" of the set, with continually graded degrees of membership (the phenomenon of "gradience", as explored by Quirk 1965)" p.264

In the case of CAN, the core meaning of the modal is recognised to be that of ability and the periphery meaning that of possibility. More explicitly:

CAN in the sense of ability is paraphrasable as "be able to" or "be capable of". In prototypical, or "core" cases CAN refers to permanent accomplishment, and is more or less synonymous with "know how to".

Collins further notes that core ability cases are "characterised by the presence of animate, agentive subject, a dynamic main verb, and determination of the action by inherent properties of the subject referent". Generally, the more an occurrence lacks these properties, the less prototypical it becomes. In other words, depending on the number of those characteristics present in a given occurrence, the meaning of CAN will be more or less prototypical, depending on its position between the core and the periphery.

So to sum up, gradience has to do with the nature of class membership.

Collins (borrowed) theoretical framework also includes two other cases, namely ambiguity and merger which are two different sorts of indeterminacy. Ambiguity refers to cases where "it is not possible to decide from the context which of two (or more) categorically distinct meanings is the correct one" (p.265) and merger refers to cases "where there are two mutually compatible meanings which are neutralised in a certain context" (p.265)

Including both the notions of gradience and indeterminacy, the theoretical framework adopted in Collins (1988) is thus both categorical (i.e. it includes semantic categories such as ability, permission, possibility) "on the grounds that:

  • "they co-occur with distinct syntactic and semantic features" (p.266) [see paper for a listings of which syntactic and semantic features typically occur in specific semantic uses of the modals]
  • "they involve distinct paraphrases" (p.266)
  • "ambiguous cases can occur" (p.266)

and fuzzy as the framework allows for gradience.

Semantic categories for CAN in Collins (1988)
  • Root meanings, including ability (possible paraphrase: 'able to', 'capable of'), permission (possible paraphrase: 'allowed', 'permitted'), possibility (possible paraphrase: 'possible for')
Collins notes that

Root Possibility may be regarded (...) as an 'unmarked' meaning, where there is no clear indication either of an inherent property of the subject or of a restriction. The meaning is simply that the action is free to take place, that nothing in the state of the world stands in its way (...). Root Possibility is sometimes difficult to distinguish from ability because ability implies possibility. (...). Because ability CAN and permission CAN normally require a human or at least animate subject, Root Possibility is generally the only sense available when the subject is inanimate" (p.270)

Semantic categories for MAY in Collins (1988)

  • Epistemic Possibility (possible paraphrase: 'it is possible that ...')
  • Permission
  • Root Possibility

Collins notes that

Epistemic Possibility is to be distinguished from Root Possibility in terms of its commitment to the truth of the associated proposition. Whereas Epistemic Possibility expresses the likelihood of an event's occurrence, Root possibility leaves open the question of truth and falsehood, presenting the event as conceivable, as an idea (p.274)

At this point, it will be interesting to see if the theoretical framework adopted in Collins (2009) has remained the same as the one chosen in Collins (1988) or if any amendments were made. In the next few days I will investigate Collins latest framework before starting coding the senses of MAY/CAN as featuring in my data.

Saturday 26 September 2009

Polysemy, syntax, and variation -- a usage-based method for Cognitive Semantics (contribution by Dylan Glynn, 2009)

Hello again, after three months of quietude during which I have been exclusively concentrating on setting up my data for statistical analysis. I have also recently temporarily relocated to UCSB, Santa Barbara from where I will continue to work on my project as a visiting scholar as well as attend Stefan Gries' courses in statistics for linguists with R.

This brief post acknowledges Dylan Glynn's contribution to New Directions in Cognitive Linguistics (2009) entitled 'Polysemy, syntax, and variation -- a usage-based method for Cognitive Semantics'. Also, this post mainly deals with the issue of polysemy in relation to Quantitative Multifactorial method and does not cover Glynn's chosen statistical technique of Correspondence Analysis proper.

In the interest of time, this post does not engage in any discussion that could arise from Glynn's contribution but rather serves as a personal log of potentially useful quotations and points that I will investigate at a later stage.

Glynn's contribution provides a thorough overview of the treatment of polysemy in Cognitive Linguistics. Glynn's overall premise in relation to polysemy is:

to conserve the network model but to complement [it] with another method: a corpus-driven quantified and multifactorial method (p.76)
Further, Glynn points out that with such multifactorial method inevitably requires to approach polysemy in a non-theoretical fashion:

Such an approach employs a kind of componentional analysis that identifies clusters of features across large numbers of speech events. In other words, rather than analyse the possible meanings of a lexeme, a polysemic network should 'fall out' from an analysis that identifies clusters of the cognitive-functional features of a lexeme's usage. These features do not in any way resemble those of the Structuralist componentional analyses, since they are not based on a hypothetical semantic system, but describe instances of real language usage and are based upon encyclopaedic semantics of that language use in context (p.76)

In relation to the syntagmatic and paradigmatic dimensions of polysemy, Glynn recognises that the interaction between the schematic and/or morpho-syntactic semantics and lexical semantics is yet to be established. Within a dichotomous CL context where 'one position is that syntactic semantics override lexical semantics' and the other position is that 'there exists a complex interaction between all the various semantic structures in all degrees of schematicity', Glynn makes the working assumption that

syntactic variation affects a polysemy network and that its effect cannot be satisfactorily predicted by positing meaning structure associated with grammatical forms and classes a priori. We must therefore account for this variable as an integral part of semantic description. (...) It means that for a given lemma, or root lexeme, there will be semantic variation depending on its syntagmatic context, in other words, its collocation, grammatical, and even tense or case will necessarily affect the meaning of the item" (p.82)

In his approach to polysemy, Glynn treats each lexeme 'as a onomasiological field, or set of parasynonyms' (p.82).


Further reading:

Zelinsky-Wibbelt, C. (1986). An empirically based approach towards a system of semantic features. Proceedings of the 11th International Conference on Computational Linguistics 11:7-12

Thursday 18 June 2009

Profile-based methodology for the comparison of language varieties

In this post, I would like to briefly point out the usefulness of 'profiling' methods for corpus data investigation. For that purpose I specifically refer to a paper entitled Profile-based linguistic uniformity as a generic method for comparing language varieties (2003), authored by Dirk Speelman, Stefan Grondelaers and Dirk Geerearst. The authors' paper is inspired by studies in language varieties and research methods currently used in dialectometry. For my own purposes, it is interesting to note that the authors make a case for the validity of profile-based linguistic methodology for corpus-data investigation as the annotation process of my data will include profiling occurrences of may, can and the lemma pouvoir.

In their paper, the authors present "the 'profile-based uniformity', a method designed to compare language varieties on the basis of a wide range of potentially heterogeneous linguistic variables" (abst.) The aim of the authors is to show that profiling investigated lexical items contributes to the identification of global dissimilarities between language varieties on the basis of individual variables which are ultimately summarised in global dissimilarities. Such process allows language varieties to be clustered or charted via the use various multivariate techniques.

Unlike standard methods of corpus investigation, namely frequency counts, the profile-based method "implies usage-based, but add another criterion. The additional criterion is that the frequency of a word or a construction is not treated as an autonomous piece of information, but is always investigated in the context of a profile."(p.11)

The profile-based approach assumes that mere frequency differences in a corpus contribute to the identification of differences between language varieties. According to the authors, the profile-based approach presents two advantages: the avoidance of thematic bias and the avoidance of referential ambiguity.

For the purpose of my project, the author's paper generally supports my methodological choice to semantically profile the occurrences of may, can and lemma pouvoir as found in my data. However, in their case study (see paper on p.18) the authors choose to take an onomasiological perspective (i.e. to use a concept as a starting point, and then investigate which words are associated with that concept). My project, on the other hand, takes on the opposite perspective, namely the semasiological approach which in the first instance considers individual words and looks at the semantic information that may be associated with those words. Inevitably, such difference in approaching the word/sense/concept interface leads to differing acceptations of the term 'profile' as both onomasiological and semasiological perspectives have different starting points. In that respect, the authors consider '[a] profile for a particular concept or linguistic function in a particular language variety [to be] the set of alternative linguistic means used to designate that concept or linguistic function in that language variety, together with their frequencies" (p.5)

For the purpose of my project, the term profile necessarily needs to be defined at word level and needs to incorporate the elements of sense and morpho-syntactic information. In that regard, the Behavioural Profile methodology proposed by Gries and Divjak in Quantitative approaches in usage-based cognitve semantics: myths, erroneous assumptions, and a proposal (in press) is an appropriate methodology for my project. Broadly, the BP methodology involves the identification of both semantic and morpho-syntactic features characteristic of the investigated lexical item, as found in the data. Ultimately, these identified features are used as linguistic variables and are investigated statistically. In the BP model, the identified features are referred to and processed as ID tags, each one of which contributes to the profiling of the lexical item under investigation.

To sum up, Speelman, Grondelaers and Geeraerst's paper provides me here not only with the opportunity to reflect on the notion of 'profiling' in the context of corpus-data investigation but also with the opportunity to consider the notion in the perspective of my own study.







Tuesday 16 June 2009

Comparing exploratory statistical techniques for semantic descriptions

As Glynn, Geeraerst and Speelman state in Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics (paper presented at the 10th International Cognitive Linguistics Conference in Cracow in July 2007):

Current trends in the study of polysemy have focused on exploratory techniques such as Cluster Analysis and Correspondence Analysis. (abst.)
Broadly, exploratory techniques "identify and visualise patterns in the data". This technique "does not permit inferences about the language, only the sample, or dataset, investigated" (abst.)

On the occasion of the Quantitative Investigations in Theoretical Linguistics 3 event in Helsinki on June 3rd 2008, Dylan Glynn presented a comparison of both the Cluster and Correspondence Analysis statistical methods for the purpose of semantic description (Clusters and Correspondences. A comparison of two exploratory statistical techniques for semantic description) [the powerpoint presentation for this paper can be found here].

Over the past fifteen years, corpus-based research in the field of Cognitive Linguistics has produced a number of studies demonstrating the wide use of both statistical techniques. In his paper, Glynn compares both techniques on the grounds on quality/accuracy of graphic representation of the data and accuracy of relative associations of variables as revealed in the data. The assessment of the accuracy of relative associations of variables for each statistical method is based on a regression analysis which takes into consideration "the relationship between the mean value of a random variable and the corresponding values of one or more variables" (OED).

For the purpose of his investigation, Glynn carried out a case study examining the semantic structure of the lexeme annoy in comparison with hassle and bother in a large non-commercial corpus of English specified for the American vs. British English regional difference (for the purpose of that case study Glynn identified the working variables of morpho-syntax and Frame Semantic argument structure). Glynn points out that the Cluster Analysis and Multivariate Correspondence Analysis methods involve different types of graphic representations which in turn, present a number of shortcomings:

One important difference between the two techniques is that Cluster Analysis is primarily designed to present its results in the form of dendograms where Correspondence Analysis relies on scatter plots. The dendograms of HCA offer clear representations of both the groupings of features and the relative degree of correlation of those features. (...) The principle shortcoming of this representation is that it gives the false impression that all the data falls into groups, where in fact this may not be the case. (...) The scatter plots of Correspondence Analysis, although at times difficult to interpret, offer a much more "analogue" representation of correlation. (...) [T]he representation of the plot is (...) much more approximative than the dendogram. (p.2)
Through his case study, Glynn confirms the usefulness of both statistical methods as exploratory techniques. He also points out the possibility of unreliability of both methods to accurately process complex multivariate data and cautions analysts about the use of those methods for the specific purpose of confirmatory analysis. However, in the context of exploratory analysis, "the contrast in the result of the complicated analysis across the three lexemes [annoy,hassle and bother] suggests that MCA [Multivariate Correspondence Analysis] is better suited to a truly multivariate exploratory research" (p.2)

With regard to my project, Glynn's paper raises a couple of points:

i) the need to decide on the statistical nature of my overall project analysis -- exploratory, confirmatory or perhaps both possibly following a comparative format (?)

ii) the urgency to clearly identify the number and the nature of the variables through which I intend to investigate my data sets as those will be influential in the choice of statistical method -- at exploratory stage at least.

Statistical techniques for an optimal treatment of polysemy

In this post, I introduced the work of Dylan Glynn who is broadly concerned with developing methodology for corpus-data investigation. Glynn adheres to the Cognitive Linguistics/Semantics framework. Of interest here is a research project he contributed to with the collaboration of Dirk Geeraerts and Dirk Speelman, and concerned with the assessment of the efficacity of two statistical techniques, namely exploratary vs. confirmatory techniques of statistical analysis. Glynn, Geeraerst and Speelam presented the results of their study at the 10th International Cognitive Linguistics Conference in Cracow in July 2007, in a paper entitled Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics [the abstract is accessible from page 11 of the link]. For the purpose of this post I can unfortunately only summarise the content of that paper based on its abstract. As I do not have access to the full paper, I am not in a position to critically assess the arguments proposed by Glynn, Geeraearst and Speelman.

According to the authors, the two main -- and actively currently used by Cognitive Linguists, statistical techniques for corpus-data investigation are i) exploratory techniques (i.e. the Cluster Analysis, used in Gries 2006; the Correspondence Analysis, used in forthcoming Glynn) and confirmatory techniques (i.e. Linear Discriminant Analysis, used in Gries 2003 and Wulff 2004; Logistic Regression Analysis, used in Heylen 2005 and De Sutter & al. in press)

The authors define the aim of each technique as follows:

The goal of (...) exploratory statistics is to identify and visualize patterns in the data. These patterns are argued to represent patterns of usage (...). Exploratory statistics analysis does not permit inferences about the language, only the sample, or dataset, investigated. However, in confirmatory statistics, inference is made from the sample to the population. In other words, one claims that what is seen in the data is representative of the language generally. (abst.)


In the light of my own project, the author's study is of particular relevance because it identifies the case of polysemy, as an object of investigation, as requiring specific methodological attention:
Current trends in the study of polysemy have focused on exploratory techniques.
However,
[t]he importance of these techniques notwithstanding, the cognitive framework needs to deepen its use of quantitative research especially through the use of confirmatory multivariate statistics.
Further,

Within Cognitive Linguistics, [Linear Discriminant Analysis technique and Logistic Regression Analysis technique] have been successfully used to capture the various conceptual, formal, and extralinguistics factors that lead to the use of one construction over another. However, the study of polysemy differs at this point. Instead of examining the variables that effect the use of one parasynonymous forms to another, we are examining the interaction of a range of formal variables (the lemma and its syntagmatic and inflectional variation), semantic variables, and extralinguistic variables, in the search of correlations across all of these. One possible multivariate technique for this type of data is Log-Linear Modelling. (abst.)
In the course of their study, the authors identified complex sets of correlations between formal and semantic variables through exploratory studies and then modelled these correlations using The Log-Linear Analysis technique.

At this point, Glynn, Geeraerst and Speelman's paper calls for a comparative study of specific polysemous lexical items contextualised in different language varieties and using, in turn, both the Cluster Analysis exploratory technique and the Log-Linear Modelling confirmatory technique. Such study would contribute to the identification of a possible optimal statistical technique for the investigation of corpus-data.

The place of Cognitive Linguistics on the French linguistics scene

As previously described here, part of my project involves the investigation of the lemma pouvoir in a native French subdata set. Analyses of quantitative results of such investigation will be carried out according to the Cognitive Linguistics (CL) framework. Carrying out a literature review including the polysemy of pouvoir in relation to the CL framework has, so far, proved a little tricky. This post provides a little bit of background on the place of CL in France and in French linguistics generally. At the Congres Mondial de Linguistique Francaise in Paris in July 2008, Dirk Geeraerst discussed the situation of CL in the context of French linguistics in a very informative paper entitled La Reception de la Linguistique Cognitive dans la Linguistique du Francais. Bonne lecture!

Monday 15 June 2009

Dylan Glynn on the theme of data-driven methodology in Cognitive Linguistics and its usefulness for the treatment of polysemy

In this post, I would like to bring attention to the work of Dylan Glynn whose on-going research is concerned with bridging the empirical and the cognitive. Here is how Glynn describes his own work:

The focus of my work is the development of methodology within the theoretical framework of Cognitive Linguistics. This school of thought imposes the minimal theoretical assumptions upon its model of language. It is for this reason that it is best placed to properly capture the complexity of language in a holistic manner.

In methodological terms, I am most interested in finding ways to capture the multidimensional nature of language structure, from prosody and morphology through to semantics and culture. Specifically, I concentrate on the semantics of Grammatical Constructions, the polysemy and synonymy of lexis, iconicity in morphology, and the interaction of grammar, pragmatics, and metaphor-metonymy.(https://perswww.kuleuven.be/~u0049977/ling.html) [accessed 15/06/09]



As part of a talk given at the 10th International Cognitive Linguistics Conference in July 2007 at the University of Cracow, entitled Usage-Based Cognitive Semantics: A Quantitative Approach, Glynn makes a case for the quantitative treatment of lexical and constructional semantics and claims that "[c]orpus data respects the complexity of language and, if treated in sufficiently large quantities, enables generalisations about language structure that other methods cannot" (abst.). Further, "usage-based quantitative methodology (...) facilitates attempts to reveal the interaction between the different parameters of language simultaneously" (abst.) [my emphasis]

During his opening talk of the theme session Empirical Evidence. Converging approaches to constructional meaning to the Third International Conference of the German Cognitive Linguistics Association on September 25th-27th 2008, Glynn points out the fast growing interest in empirical cognitive research, particularly in the field of Cognitive Semantics:
Cognitive Linguistics has recently witnessed a new and healthy concern for empirical methodology. Using such methods, important in-roads have been made in the study of near-synonymy, syntactic alternation, syntactic variation and lexical licensing.
Further,
Empirical methods, and methodology generally, are one of the most important concerns for any descriptive science and the recent blossoming of research in this respect in Cognitive Linguistics can be seen as a maturing of the field. A range of recent anthologies on the issue, including Gries & Stefanowitsch (2006), Stefanowitsch & Gries (2006), Gonzales-Marquez & al. (2007), Andor & Pelyvas (forth.), Newman & Rice (forth.), and Glynn & Fischer (in preparation), can be seen as testimony to the importance attached to this issue. Despite the advances in this regard, how the different methods and the results they produce inform each other remains largely ill-understood. Although this question of how elicited, experimental and found data relate has been addressed in the work of Schonefeld (1999,2001), Gries & al. (2005, in press), Goldberg (2006), Arppe & Jarvikivi (in press), Gilquin (in press), Divjak (forth.), and Wiechmann (subm.), it warrants further investigation.
The fast development of data-driven investigation methods within the field of Cognitive Linguistics is further pointed out by Glynn in his opening talk to the theme session Empirical Approaches to Polysemy and Synonymy, at the Cognitive and Functional Perspectives on Dynamic Tendencies in Languages event, on May 29th-June 1 2008. In that particular address, Glynn presents empirical cognitive approaches as a way to address existing issues in the cognitive treatment of polysemy:
Within the cognitive tradition, both the study of polysemy and synonymy have rich traditions. Brugman (1983) and Vandeloise (1984) began the study of sense variation in spatial prepositions that evolved into the radial network model applied to a wide range of linguistic forms, especially grammatical cases and spatial prepositions (Janda 1993, Cuyckens 1995). (...) Despite the success of this research, studies such as Sandra & Rice (1995) and Tyler and Evans (2001) identified serious shortcomings. In light of this, empirical cognitive approaches to semantic structure do not question the validity of the radial network model, but seek to develop methods for testing proposed semantic variation and relation. (abs.) [my emphasis]
In relation to my project (which includes a Cognitive Linguistics treatment of polysemous may,can and pouvoir via an investigation of corpus data), it is with much excitement that I begin to explore the work if Dylan Glynn.

Below is a selected bibliography of Glynn's work and that will be of interest for my research (unfortunately, several references are still in press or in preparation!):

  • Glynn, D. In press (6pp). Multifactorial Polysemy. Form and meaning variation in the complex web of usage. R. Caballero (ed.). Lexicología y lexicografía. Proceedings of the XXVI AESLA Conference. Almería: University of Almería Press.
  • Glynn, D. 2008. Polysemy, Syntax, and Variation. A usage-based method for Cognitive Semantics. V. Evans & S. Pourcel (eds). New Directions in Cognitive Linguistics. Amsterdam: John Benjamins.
  • Glynn, D. 2006. Conceptual Metonymy - A study in cognitive models, reference-points, and domain boundaries. Poznan Studies in Contemporary Linguistics 42: 85-102.
  • Glynn, D. 2006. Cognitive Semantics and Lexical Variation. Why we need a quantitative approach to conceptual structure. O. Prokhorova (ed.). Edinstvo sistemnogo i functionalnogo andliza yazykov (Systemic and Functional Analysis of Language). 53-60. Belgorod: Belgorod University Press.
In preparation:

  • Glynn, D., Multidimensional Polysemy. A case study in usage-based cognitive semantics. Will be submitted to Cognitive Linguistics.
  • Glynn, D., Geeraerts, D., & Speelman, D. Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics. D. Glynn & K. Fischer (ed.). Usage-Based Cognitive Semantics. Corpus-Driven methods for the study of meaning. Berlin: Mouton de Gruyter.
  • Glynn, D. & Fischer, K. (eds). Usage-Based Cognitive Semantics. Corpus-Driven methods for the study of meaning. Berlin: Mouton de Gruyter.
  • Glynn, D. Mapping Meaning. Toward a usage-based methodology in Cognitive Semantics. Will be submitted to Mouton de Gruyter.

Sunday 14 June 2009

Behavioral Profiling and polysemy

In their paper entitled In defense of corpus-based methods: A behavioral profile analysis of polysemous 'get' in English (presented at the 24th North West Linguistics Conference, 3-4th May 2008), Andrea L. Berez and Stefan Th. Gries make a general case for the use of corpus data. their paper serves as a response to Raukko's (1999,2003) proposal to disregard corpus data investigations in favour of experimentally motivated studies. Berez and Gries conclude that:

[A] rejection of corpus-based investigations of polysemy is premature: our BP approach to get not only avoids the pitfalls Raukko mistakenly claims to be inherent in corpus research, it also provides results that are surprisingly similar to his own questionnaire-based results, and Divjak and Gries (to appear) show how predictions following from a BP study are strongly supported in two different psycholinguistic experiments." (P.165)
Before conducting a case study of polysemous get -- the results of which are compared , in the second part of the paper, to those presented in Raukko's An "intersubjective" method for cognitive semantic research on polysemy: the case of 'get' (1999), the authors briefly state the advantages of corpus data:

- (...) the richness of and diversity of naturally-occurring data often forces the researcher to take a broader range of facts into consideration;
- the corpus output from a particular search expression together constitute an objective database of a kind that made-up sentences or judgements often do not. More pointedly, made-up sentences or introspective judgements involve potentially non-objective (1) data gathering, (ii) classification, (iii) interpretive process on the part of the researcher. Corpus data, on the other hand, at least allow for an objective and replicable data-gathering process; given replicable retrieval operations, the nature, scope and the ideas underlying the classification of examples can be made very explicit (...) (p.159)

Methodologically, Berez and Gries attempt to make their case by targeting 'polysemy' as their domain of investigation and by applying the Behavioral profiling method (described here):

Given the recency of this method, the number of studies that investigate highly polysemous items is still limited. We therefore apply this method to the verb to get to illustrate that not only does it not suffer from the problems of the intersubjective approach, but it also allows for a more bottom-up/data-driven analysis of the semantics of lexical elements to determine how many senses of a word to assume and what their similarities and differences are. (p.157)

Generally, the results encountered in both Berez and Gries' study and Raukko study are very similar. However, Berez and Gries' BP approach allows for a finer grained investigation:
we show that some of our results are incredibly close to Raukko's, but also provide an illustration of how the BPs can combine syntactic and semantic information in a multifactorial way that is hard to come by using the kinds of production experiments Raukko discusses. (p.159)

With regard to my project, broadly concerned with a corpus-driven investigation of polysemous lexical items , Berez and Gries' paper provides, methodologically, a useful illustration of how to exploit corpus data optimally for the retrieval of semantic information.

Tuesday 9 June 2009

Behavioral Profiles, snake plots and cross-linguistic comparisons

This post complements this earlier post: The corpus-based Behavioral Profile approach to cognitive semantics as it revisits the Behavior Profile (BP) methodology and reports how, according to Divjak and Gries, snake plot representations can graphically reveal the relative significance of ID tags thus allowing for cross-linguistic ID tag-level comparisons. In this post I make reference to Divjak and Gries recent paper: Corpus-based cognitive semantics: a contrative study of phasal verbs in English and Russian (to appear).

Overall, Divjak and Gries demonstrate that the BP methodology not only allows to pick up dissimilarities between polysemous and near synonyms but it also allows to recognise and simultaneously process dissimilarities that are characteristically different:

"Because these dissimilarities are of an entirely different order, they can only be picked up if a methodology is used that adequately captures the multivariate nature of the phenomenon. The Behavioral Profiling approach we have developed and apply here does exactly that." (p.273, abst.).


For their investigation of polysemous and near synonymous lexical items the authors assume the existence of networks of words/senses. They also assume that the investigated lexical items in their study are included in such networks. Further, these networks demonstrate internal structure in the sense that "elements which are similar to each other are connected and the strength of the connection reflects the likelihood that the elements display similar syntactic and semantic behaviour" (p.281)


Divjak and Gries' paper achieves three goals:


1/ Presents the BP methodology as a means to provide a usage-based characterisation of the lemma under investigation by identifying individual syntactic and semantic characteristic features.

2/ Demonstrates that a snake plot graphic representation of those syntactic and semantic characteristic features allows to rank them in order of significance and therefore contributes to the identification of clusters of senses "on the basis of distributional characteristics collected in BPs" (p.292). Consequently, snake plots representations allow for the recognition of prototypical features of the investigated lexical items.
3/ Illustrates that semantically the BP approach allows for more rigorous investigation of translational cross-linguistic equivalents.

Overall, the authors are testing the BP approach for a simultaneous treatment of both language-specific data and cross-linguistic data.

"The (...) purpose is to show that this approach can also be applied to the notoriously difficult area of cross-linguistic comparisons. (...) [T]he approach will be put to the test by attempting a simultaneous within-language description and across-languages comparison of polysemous and near-synonymous items belonging to different subfamilies of Indo-European, i.e., English and Russian" (p.277)

Generally, Divjak and Gries' paper encourages to put the BP methodology further to the test by applying it to an interlanguage type of data where the investigated lexical items in language x and carving a specific conceptual space
is used by a native speaker of language y whose conceptual space for the translational equivalent of the investigated item in language x is potentially different. In other words and with regard to the application of the BP methodology to my project, while the paper raises questions about the nature of conceptual spaces in interlanguage, it convincingly offers a methodology that would allow for the computation of my three-way data (including native English, native French and Fr-English interlanguage, details of the three sub-corpora can be found here). Simultaneous treatment of may, can and pouvoir can be carried out within language -- taking into account the native English data vs. the Fr-English interlanguage data, and across language -- taking ito account the native French vs. native English vs. Fr-English interlanguage data. Finally, the BP approach also provides the opportunity to investigate the possibilty of a correlation between the word class membership of may, can and pouvoir and their semantic BPs.












Sunday 24 May 2009

R training at the University of Uppsala



Finally ... back after too long! ...

In previous posts I tried to point out the advantages of using R as a methodological tool for my research project (here and here). Since the publication of Gries's Quantitative Linguistics with R: A Practical Introduction at the end of March, I have started familiarising myself with the R language and working on possible scripts for the application of R to my data. The process has been taking longer than anticipated and is still at an initial stage -- hence the long absence from the blog!

On the 18-19 May 2009, the linguistics department at the University of Uppsala organised an R training workshop led by Stefan Gries (Statistics for linguistics with R: monofactorial tests and beyond), along with a research seminar on 20 May 2009, also given by Stefan Gries. I am extremely grateful to the Linguistics department at the University of Uppsala, and particulalry to Christer Geisler and Merja Kito for welcoming me so warmly during the occasion and letting me attend Stefan Gries' workshop and research seminar.

The experience was extremely enriching and motivating; I am now planning to put my new skills to the test within the next few days ...

Friday 27 March 2009

Image-Schema transformations and cross-linguistic polysemy: a matter of terminology

In her 2004 paper (Transformation on image schemas and cross-linguistic polysemy), Lena Ekberg is generally concerned with diachronic semantic change across different languages and she argues that cross-linguistic semantic change is cognitively motivated. She recognises that "[m]odern research within the field of historical lexical semantics and grammaticalization in fact has provided arguments that meaning change is motivated by cognitive principles independent of specific languages" (p.42). Although Ekberg (2004) links with my project in the sense that it takes a cross-linguistic approach to investigate polysemous lexical items while trying to incorporate a Cognitive Semantics approach, it differs from my project in two major ways: i) it identifies specific semantic changes in specific languages and then compares those changes cross-linguistically; and ii) it considers semantic variance diachronically. My project, on the other hand, is concerned with cross-linguistic semantic change in terms of word senses in language x affecting the senses of corresponding words in language y. Further, my project is concerned with on-line cross-linguistic semantic interference and is not concerned with the development of word senses overtime. Despite these differences, Ekberg (2004) is of interest to me because it raises a number of terminology-, methodology- and theoretical framework-related issues.

Ekberg's overall stand on semantic change is stated in Construal operations in semantic change: the case of abstract nouns):

"The prerequisites of meaning variation of a lexeme are intrinsic in the underlying schematic structure as well as in the construal operations that may apply to that structure. Thus every instance of semantic change and variation - either resulting in polysemy or contextual meaning variation, is motivated by the possibilities of varying a given schematized structure by means of general and cognitively motivated construal operations" (p.63)

Further,

"[T]he processes generating semantic variation and change operate on the schematized structure underlying the lexical representation of a linguistic expression" (p. ).

Ekberg investigates cross-linguistic semantic change by considering and trying to bring together two theoretical approaches with different theoretical assumptions: the lexical semantics approach and the cognitive semantics approach. In her investigation of "the potential polysemy of lexemes based on a common schema" (p.25), Ekberg (2004) attempts to deal simultaneously with lexical patterns, conceptual processes and cognitive mechanisms. Overall, the paper highlights the limitations of such an inclusive methodology that ultimately relies on loose use of terminology.

Ekberg's (2004) working assumption is that:
  • "semantic structures at a certain level of abstraction, as well as the principles of meaning change, are universal devices for generating new lexical meaning variants" (p.26)
Ekberg (2004) claims that:
  • polysemy results from a process of image-schema transformation which itself results from a mental construal process
  • polysemy refers to meaning variants of the same lexeme related by means of image-schema transformations and which are regarded as separate senses, i.e. instantiation of polysemy
  • lexical meaning extensions reflecting transformations of image-schematic structure are cognitively motivated and thus arise cross-linguistically
  • image-schema transformations are motivated by mental construal processes

Raising issues:
  • Ekberg recognises the image schema transformation as a central process in the emergence of new senses. However, in the paper, the term image schema lacks a reliable working definition. The term is first defined on page 28, in the sense of Johnson (1987) as " a recurring dynamic pattern [...] that gives coherence and structure to our experience". The term is then later referred to on page 36 as being "the most abstract basis of lexical meaning", and on page 43 as an "underlying abstract semantic structure". In other words, throughout the paper, it is unclear whether the term refers to schematic representations of word senses or whether it refers to schematic representations of physical experiences. In the first case, the approach to cross-linguistic semantic change and polysemy is lexically based. In the second case, the approach is experientially based and therefore conceptual in nature (i.e. pre-linguistic). Distinguishing between the two cases is important because they both ultimately refer to different stages/levels in the construction of meaning. The author's attempt to bridge lexical matters (i.e. linguistic in nature) and conceptual matters (i.e. pre-linguistic in nature) creates a degree of confusion about the level of abstraction targeted in the discussion.
  • Similarly, the term cognitively motivated ("lexical meaning extensions reflecting transformations of image-schematic structure are cognitively motivated and thus arise cross-linguistically") calls for clarification. Assuming that lexical meaning extensions do reflect transformations of image-schematic structure (as understood in the CL framework) then those meaning extensions are by definition cognitively motivated and the phrase quoted above is redundant and therefore not useful. Alternatively, the term (in the context of the example) could be referring to a speaker's specific cognitive ability which could be applied to the process of lexical meaning extensions.Under the term cognitive, it is unclear whether the author refers to a cognitive ability allowing speakers to extend lexical meanings in similar ways in different languages or whether the author refers to a conceptual process (i.e. image-schema, as understood in the CL framework). Without a solid working definition of the term image schema, it is difficult to recognise that polysemy results from a process of image schema transformation. It is also difficult to recognise what exactly is being transformed in the process of meaning extension: the schematic representation of lexical meanings or the image schema as an analog representation of a physical experience.
Ekberg (2004) raises questions about the possibility of/feasibility in bridging the lexical and the conceptual via the cognitive process of image schema. As far as my study is concerned, even though an overall CL approach to may/can in French-English IL will allow for an analysis of how the senses of may/can are represented in the French-English bilingual mind, the study may well be restricted to show just that! Talmy, Sweetser and Johnson have investigated the English modals in terms of linguitsic tools referring to the image schema of Force Dynamic. Although I cannot ignore such studies, the question is now how can they be exploited empirically?

Monday 23 March 2009

From corpus to clusters: Gries and Divjak's suggested methodology

In Behavioral profiles: a corpus approach to cognitive semantic analysis (to appear), Gries and Divjak propose a methodology to approach polysemy both using an empirical approach and following the Cognitive Linguistics (CL) framework. The author's methodology is of interest for my project because of I adopt an empirical approach, I follow the CL framework and my investigated words (i.e. may, can and pouvoir) are all polysemous lexical items.


In their introduction, the authors review:

i) The treatment of polysemy in CL

ii) Present existing issues behind the identification of the prototypical sense(s) of a word

iii) Claim that a more sophisticated quantitative approach to corpus investigation would provide cognitive-linguistically relevant results.


Gries and Divjak’s methodology is based on the assumption that it “is radically corpus-based because it relies on the correlation between distributional patterns and functional characteristics to a much larger extent than most previous cognitive-linguistic work” (p.60). The authors claim that their methodology “aims at providing the best of both worlds, i.e. a precise, quantitative corpus-based approach that yields cognitive-linguistically relevant results” (p.60)

Method:

Four-step method based on the concepts of ID tags (cf. Atkins 1987) and the notion of Behavioral Profile (cf. Hanks’s 1996).

The method assumes that “the words or sense investigated are part of a network of words/senses”:

“In this network, elements which are similar to each other are connected in such a way that the strength of the connection reflects the likelihood that the elements display similar behavior with respect to phonological, syntactic, semantic or other type of linguistic behaviour” (p.61)

The four stages:

Stages 1-3 are concerned with data processing.

Stage 4 is concerned with meaningful data evaluation.

  1. The retrieval of all instances of a word’s lemma from a corpus
  2. A manual analysis of many properties of the word form (i.e. the annotation of the ID tags)
  3. The generation of a co-occurrence table
  4. The evaluation of the table by means of exploratory and other statistical techniques

Data processing:

Stage 1: use of a concordance program to retrieve all hits of a lemmata of a word

Stage 2: all hits are annotated for ID tags

Results from step 2 are displayed in a co-occurrence table where each row contains:

· one citation of the word in question

· each column contains an ID tag

· each cell contains the level of the ID tag for this citation

Stage 3: The co-occurrence table is turned into a frequency table (every row contains a level of an ID tag while every column contains a sense of the polysemous word. Each cell in the table provides the frequency of occurrence of the ID tags with the word sense(s)

[NB: to compare senses that occur at different frequencies, absolute frequencies need to be turned into relative frequencies (i.e. within ID tag percentages)]

Step 3 results in the Behavioral profile for a word sense: “each sense of a word (…) is characterized by one co-occurrence vector of within-ID tag relative frequencies” (p.63)

Stage 4 of Gries and Divjak’s methodology evaluates the vector-based behavioural profiles identifies in stage 3.

Data evaluation

The evaluation can be carried out using quantitative approaches (i.e. standardized statistical tests).

Gries and Divjak recognise two types of evaluations: monofactorial and multifactorial evaluations:

  • Monfactorial evaluation: looks at token frequency and type frequency. “A useful strategy to start with is identifying in one’s corpus the most frequent senses of the word(s) one is investigating” (p.64)

  • Multifactorial evaluation: The authors specifically focus on the exploratory technique of hierarchical agglomerative cluster analysis. The Hierarchical agglomerative cluster analysis (HAC) is a family of methods that aims at identifying and representing (dis)similarity relations between different items.

How to do a Hierarchical agglomerative cluster analysis:

i) Relative co-occurrence frequency table needs to be turned into a similarity/dissimilarity matrix (need to settle on a specific measure)

ii) Selection of an amalgamation strategy ( =algorithm that defines how the elements that need to be clustered will be joined together on the basis of the variables or the ID tags that they were inspected for (most widely used amalgamation strategy is Ward’s rule)

iii) Results appear in the form of a hierarchical tree diagram representing distinguishable clusters with high within-cluster similarity and low between-cluster similarity


Detailed analysis of the clustering solution

i) Assessment of the ‘cleanliness’ of the tree diagram

ii) Assessment of the clearest similarities emerging from the tree diagram

iii) Between-cluster differences can be assessed using t-values

NB: “the fact that a cluster analysis has grouped together particular sense/words does not necessarily imply that these senses or words are identical or even highly similar – it only shows that these sense/words are more similar to each other than they are to the rest of the senses/words investigated. By means of standardized z-scores, one can tease apart the difference between otherwise highly similar senses/words and shed light on what the internal structure of a cluster looks like” (p.67)

The author's methodology and my project:

  • Can the authors' method lead to the identification of semantic clusters between the different senses of may, can and pouvoir?
  • If so, what semantic features characterise each cluster? Can between-cluster differences be identified?
  • How useful is the proposed methodology for the elaboration of a cross-linguistic semantic network of the senses of may, can and pouvoir?
  • How useful is the proposed methodology for both the identification of cross-linguistic between cluster differences and the identification of within-cluster characterics?
Overall, the exploration of the authors' proposed methodology using my data should prove a useful exercise because it provides the opprotunity to investigate the mental semantic organisation of word senses at cross-linguistic level.

Friday 6 March 2009

Approaching the data statistically: what to test, how and why ?

At this point in the project, the investigation of the data is broadly anticipated to include two separate stages, each one of those stages bearing different methodological assumptions. The first stage is purely quantitative in nature and follows a traditional trend in corpus linguistics to assess "the distribution of a single variable such as word frequency" (Oakes: 1998). The literature refers to that type of approach as univariate By adopting the traditional approach in the first stage of the investigation of the data, my aim is to provide a preliminary overview of the behaviour of may, can and pouvoir in all three subcorpora. However, although that stage will provide general patterns of uses of the modals in the different subcorpora, the weight of the results gathered from frequency tests will need to be handled cautiously on the basis of variability within and between corpora. The second stage of the data investigation process includes the computation of qualitative information such as word senses and contextual/pragmatic information. That stage is anticipated to consist mainly of cluster analyses. A description of that type of analysis and its implications for my study will be presented in a later post.

This post is only concerned with the first stage of investigation. I present an overview of the range of statistical tests available and that I judge suitable for word-frequency motivated investigations. I then show the relevance of those tests in the context of my data. The information presented below is drawn from Michael P. Oakes's Statistics for Corpus Linguistics.

As a first step into the quantitative stage, the central tendency of the data needs to be identified. The central tendency measure represents the data of a group of items in a single score and as being the most typical score for a data set (p.2). There are three possible types of measure to identify the central tendency of a data set: the median (the central score of the distribution with half of the scores being above the median and the other half falling below), the mode (the most frequently obtained score in the data set) and the mean (the average of all scores in the data set).The mode measure is recognised to have the disadvantage to be easily affected by chance scores in smaller data sets. The disadvantage of the mean, on the other hand, is that it is affected by extreme values and might not be reliable in cases where the data is not normally distributed. In the context of my data, the mean is judged to be the most appropriate central tendency measure (a preliminary investigation of the frequency of the occurrences of may, can, may not, cannot and can't did not reveal cases of extremely low/high number of uses; parametric tests (described below) assume that the mean is an appropriate measure of central tendency). The mean measure is also necessary for the calculation of z scores (statistical measure of the closeness of an element to the mean value for all the elements in a group) and standard deviation (measure which takes into account the distance of every data item from the mean).

Once the central tendency of individual data sets is identified, specific statistical tests will allow for the comparison of those data sets. Broadly, there are two types of tests: parametric tests and non-parametric tests. Parametric tests assume that: i) the data is normally distributed, ii) the mean and the standard deviation (described below) are appropriate measures of central tendency and dispersion, iii) observations are independent and scores assigned to one case must not bias the score given to any other. Non-parametric tests work with frequencies and ranked-ordered scales and they do not depend on the population being normally distributed.

Generally, parametric tests are considered to be more powerful and are recommended to be the tests of choice if all the necessary assumptions apply.

Parametric tests:

t test: statistical significance test based on the difference between observed and expected results. In other words, the t test allows for the comparison of the mean of two different data sets. In that way, the t test assesses the difference between two groups for normally distributed intervals of data where the mean and standard deviation are appropriate measures of central tendency and variability of the scores.

T tests are used rather then z score tests whenever the analyst is dealing with a small sample. (i.e. where either group has less than 30 items). A z-score + 1 indicates one standard variation above the mean. A z-score of -1.5 indicates 1.5 SDs below the mean.Once the standard deviation is calculated, the Z-score indicates how far off the mean a particular data item is located.

In the context of my data, a t test would establish whether there is any significant statistical difference (i.e. certainty that a result is unlikely to be purely due to chance) between:

-the uses of may and can in ICLE FR and LOCNESS.
-the uses of may not, cannot and can't in ICLE FR and LOCNESS
-the uses of may and can in ICLE FR and LOCNESS, in argumentative texts
-the uses of may and can in ICLE FR and LOCNESS, in literary texts

Based on the calculation of the mean, applying the standard variation test to the ICLE FR subcorpus would allow to identify the overall proportion of that data set not showing expected results and consequently being typical of that data set. Further, a calculation of the z scores in ICLE FR would allow to identify the uses of may/can that are the most typical of native French speakers (those would be represented by the z scores the closest to the mean) and the least typical uses (those would be represented by the z scores the furthest away from the mean).

The calculations will be useful because they will also enable to establish whether there are statistically significant differences in the uses of may/can between individual native French speakers. Such information will ultimately be useful at the qualitative stage of the investigation while examining the possible motivation for such possible differences at cognitive level.

Non-parametric tests:

In the above section, I pointed out the usefulness of parametric tests for the purpose of my study. However, it is worthy to note that as a non-parametric test, the Chi-Square test assesses the relationship between frequencies in a display table. That test allows for an estimation of whether the frequencies in a table differ significantly from each other. Oakes (1998) notes that when working with frequency data, the Chi-Square test is a good technique for modelling a two-variable table. In my study, the Chi-Square test could perhaps be used as an additional test to confirm results found from the standard deviation test.

So what's next?:
  • calculate the mean of the uses of may/can in ICLE FR
  • calculate the mean of the uses of may/can in LOCNESS
  • calculate the mean of the uses of may not/cannot/can't in ICLE FR
  • calculate the mean of the uses of may not/cannot/can't in LOCNESS
  • calculate the standard deviation in all of the above
  • carry out a t test in all of the above
  • calculate the z scores in all of the above