Friday, 16 October 2009

"can" and "may" in present-day English, Yvan Lebrun, 1965

In that post, I referred to a corpus-based study of "can" and "may" by Yvan Lebrun, namely "can" and "may" in present-day English (1965). In this post, I briefly present -- or rather log, the scope of Lebrun's study (I will present his general conclusions in a later post):

  • the study is corpus-based and includes data from both British and American English
  • a variety of genres are featured in the data: short stories, novels, plays, newspapers, scientific texts
  • all texts featuring in the data were published between 1955 and 1962
  • the study includes occurrences of might and could
  • numbers of occurrences:
  1. Total number of occurrences, including may, can, might, could: 4765
  2. Total number of occurrences of can: 2024
  3. Total number of occurrences of could: 1745
  4. Total number of occurrences of may: 491
  5. Total number of occurrences of might: 505
  • Methodologically, Lebrun scanned each instance of the modals to ascertain lexical meanings. The modals were considered to convey the same lexical meaning whenever their semantical contents proved identical once such significant oppositions as "present"vs. "past" or "indicative" vs. "conditional" had been discarded (p.11)
  • In order to decide on the semantical content of the modals, Lebrun relies on the context for each instance
  • Lebrun first carries out a recognition process of all the lexical senses and then attempts to define them
  • the process of defining the lexical senses was first motivated by Sommerfelt's recommendation (i.e. 'the definition be able to replace the word in an ordinary sentence'). Such 'replacement' process was abandoned on the basis that:

"In none of the lexical meanings CAN, COULD, MIGHT, MAY can be equated with a substitutable word or phrase. In fact, each of their lexical senses is so wide that only a long series of 'synonyms' can cover it" (p.11)

Further,

"Instead of defining CAN, COULD,MIGHT, MAY by means of longish strings of juxtaposed partial equivalents and thus blurring out the internal unity of the lemma's meaning, I renounced Sommerfelt's principle and aimed at definitions that (a) embrace every facets of the sense they are meant to cover, (b) bring out the internal unity of each meaning, and (c) emphasize what the various significations of a lemma have in common." (p.11)

  • Overall methodological strategy :
  1. Based on the three recognised lexical meanings and that are common to CAN, COULD, MIGHT, MAY, calculated how often each of these three meanings were expressed by MAY rather than by CAN and by MIGHT rather than by COULD.
  2. Lebrun examines cases where MAY and CAN are synonyms
  3. Based on the discovery that some collocations exclude the use of one of the two synonyms , Lebrun calculated the frequency of MAY relatively to CAN in kinds of clauses where either word can be used idiomatically and tried to find out if this relative frequency is independent of the context [my emphasis, this part of Lebrun's methodology reinforces the idea of including, in my study, two separate variables (i.e. SENSES and CONTEXT) for a treatment of the meanings of MAY and CAN, as featuring in my data. For more details on this, see this previous post).

Further reading (of early studies):

Lebrun, Y., Can and May, A Problem of Multiple Meaning, in Proceedings of the Ninth International Congress of Linguistics, 1962 (The Hague, Mouton, 1964)

Ten Bruggencate, K, The Use of Can and May, in Taalstudie 3 (1882), 94-106

Wood, F., May and Might in Modern English, Moderna Sprok 49 (1955), 247-253

Senses vs. context in the coding of the semantics of MAY and CAN

This brief post is a continuation on the theme of the previous post where I raise the issue of coding the senses of "may" and "can" most effectively for the purpose of statistical analysis. It provides a short update on my current line of thinking re the design an optimal coding system for the meanings of "may" and "can".

With regards to my project, I am at a stage now where I am about to start the annotation of the senses of "may" and "can" as featuring in my data and I am currently concerned with the issue of defining an appropriate degree of granularity for that stage of the coding. In other words, I need to establish how much of contextual information should be included in the coding of senses of the modals. Further, and with regards to the above, I'm now considering the inclusion of an extra variable (in addition to a SENSES variable) for the investigation of the behaviour of "may" and "can", namely that of CONTEXT. The motivation behind the inclusion of the CONTEXT variable would be to ultimately assess/quantify contextual weight on the semantics of the modals. Also, including a CONTEXT variable in the study would allow for the exclusion of 'contextuality' as a level of the SENSES variable. I would therefore approach, with SENSES, each occurrence of the modals according to their generally recognised "core" meanings. The advantages of dealing with the senses of the modals from the perspectives of both context and core meanings is that firstly, the number of levels included for each variable will be smaller than if only one variable was considered, which would facilitate the recognition of possible patterns in the data. Also, the two variables CONTEXT and SENSES could then be tested for possible mutual interaction which ultimately could be quantified statistically. Such a design of the data would also allow me to address a whole chunk of literature in the English modals that tries to assess what, semantically, belongs to the modals and what belongs to the context and the situation of utterance, and to what measure. To my knowledge, that line of work still remains to be experimentally challenged. Identifying/differentiating two meaning-related variables such as SENSES and CONTEXT could facilitate the possible inclusion of an experimental task that would aim to assess potential statistical results. I'm currently exploring the feasibility of that possibility.

Sunday, 11 October 2009

Coding the English modals for senses : Leech & Coates (1980), Coates (1983) and Collins (1988)

Despite the overwhelming literature on the semantics of the English modals and the numerous attempts by many scholars to identify their core meanings and related senses, very few studies have in fact used a corpus-based approach for the purpose of their classification. The current record that I have of such studies counts the following publications, in chronological order of publication:

  • Joos, M. (1964) The English Verb: Form and Meaning. Madison and Milwaukee
  • Lebrun,Y. (1965) "CAN" and "MAY" in present-day English. Presses Universitaires de Bruxelles
  • Ehrman,M.E. (1966) The meanings of the Modals in Present-Day American English. The Hague and Paris
  • Hermeren,L.(1978) On Modality in English: A study of the Semantics of the Modals, Lund:CWK Gleerup
  • Leech,G.N & Coates, J. (1980) Semantic Indeterminacy and the modals. In Greenbaum, S. & al. (eds) Studies in English Linguistics. The Hague: Mouton.
  • Coates, J. (1983) The Semantics of the Modal Auxiliaries. London & Canberra: Croom Helm.
  • Collins, P. (1988) The semantics of some modals in contemporary Australian English. Australian Journal of Linguistics 8, p.261-286
  • Collins, P. (2009) Modals and Quasi-Modals in English. Rodopi
The work of Peter Collins is of particular interest to me, being the most recent in time and therefore benefiting from the latest developments both in the field of modality and corpus linguistics:

"Modals and Quasi-modals in English" reports the findings of a corpus-based study of the modals and a set of semantically-related 'quasi-modals' in English. The study is the largest and most comprehensive to date in this area, and is informed by recent developments in the study of modality, including grammaticalization and recent diachronic change. The selection of the parallel corpora used, representing British, American and Australian English, was designed to facilitate the exploration of both regional and stylistic variation." (11/10/09)


In his 1988 paper, Collins proposes to investigate possible differences in the distribution and the semantics of can, could, may and might in three varieties of English, namely Australian English, British English and American English. Below, I specifically refer to Collins' 1988 paper.

In terms of theoretical framework, Collins adopts a framework based on Leech and Coates (1980) and Coates (1983), two studies that count amongst the most influential corpus-based studies on the English modals. Collins' motivations behind borrowing an already existing framework are twofold:
  1. To facilitate comparisons between results from his study and those encountered in Coates (1983)
  2. According to Collins, the framework proposed in Leech and Coates (1980) and Coates (1983) "accounts more adequately than any other so far proposed for the complexity and indeterminacy of modal meaning, and is therefore particularly useful in handling the recalcitrant examples that one is forced to confront in a corpus-based study" (p.264)
Considering that Collins' methodological and theoretical approaches are anticipated to feature in my study at one stage or another, I report here his overall framework as well as his taxonomy of the senses of MAY/CAN.

Collins' (borrowed) taxonomy includes the notions of "core" meanings, "periphery" meanings and graded degrees of membership:

A central concept is that of a fuzzy semantic set, whose members range from the "core" (representing the prototypical meaning) to the "periphery" of the set, with continually graded degrees of membership (the phenomenon of "gradience", as explored by Quirk 1965)" p.264

In the case of CAN, the core meaning of the modal is recognised to be that of ability and the periphery meaning that of possibility. More explicitly:

CAN in the sense of ability is paraphrasable as "be able to" or "be capable of". In prototypical, or "core" cases CAN refers to permanent accomplishment, and is more or less synonymous with "know how to".

Collins further notes that core ability cases are "characterised by the presence of animate, agentive subject, a dynamic main verb, and determination of the action by inherent properties of the subject referent". Generally, the more an occurrence lacks these properties, the less prototypical it becomes. In other words, depending on the number of those characteristics present in a given occurrence, the meaning of CAN will be more or less prototypical, depending on its position between the core and the periphery.

So to sum up, gradience has to do with the nature of class membership.

Collins (borrowed) theoretical framework also includes two other cases, namely ambiguity and merger which are two different sorts of indeterminacy. Ambiguity refers to cases where "it is not possible to decide from the context which of two (or more) categorically distinct meanings is the correct one" (p.265) and merger refers to cases "where there are two mutually compatible meanings which are neutralised in a certain context" (p.265)

Including both the notions of gradience and indeterminacy, the theoretical framework adopted in Collins (1988) is thus both categorical (i.e. it includes semantic categories such as ability, permission, possibility) "on the grounds that:

  • "they co-occur with distinct syntactic and semantic features" (p.266) [see paper for a listings of which syntactic and semantic features typically occur in specific semantic uses of the modals]
  • "they involve distinct paraphrases" (p.266)
  • "ambiguous cases can occur" (p.266)

and fuzzy as the framework allows for gradience.

Semantic categories for CAN in Collins (1988)
  • Root meanings, including ability (possible paraphrase: 'able to', 'capable of'), permission (possible paraphrase: 'allowed', 'permitted'), possibility (possible paraphrase: 'possible for')
Collins notes that

Root Possibility may be regarded (...) as an 'unmarked' meaning, where there is no clear indication either of an inherent property of the subject or of a restriction. The meaning is simply that the action is free to take place, that nothing in the state of the world stands in its way (...). Root Possibility is sometimes difficult to distinguish from ability because ability implies possibility. (...). Because ability CAN and permission CAN normally require a human or at least animate subject, Root Possibility is generally the only sense available when the subject is inanimate" (p.270)

Semantic categories for MAY in Collins (1988)

  • Epistemic Possibility (possible paraphrase: 'it is possible that ...')
  • Permission
  • Root Possibility

Collins notes that

Epistemic Possibility is to be distinguished from Root Possibility in terms of its commitment to the truth of the associated proposition. Whereas Epistemic Possibility expresses the likelihood of an event's occurrence, Root possibility leaves open the question of truth and falsehood, presenting the event as conceivable, as an idea (p.274)

At this point, it will be interesting to see if the theoretical framework adopted in Collins (2009) has remained the same as the one chosen in Collins (1988) or if any amendments were made. In the next few days I will investigate Collins latest framework before starting coding the senses of MAY/CAN as featuring in my data.

Saturday, 26 September 2009

Polysemy, syntax, and variation -- a usage-based method for Cognitive Semantics (contribution by Dylan Glynn, 2009)

Hello again, after three months of quietude during which I have been exclusively concentrating on setting up my data for statistical analysis. I have also recently temporarily relocated to UCSB, Santa Barbara from where I will continue to work on my project as a visiting scholar as well as attend Stefan Gries' courses in statistics for linguists with R.

This brief post acknowledges Dylan Glynn's contribution to New Directions in Cognitive Linguistics (2009) entitled 'Polysemy, syntax, and variation -- a usage-based method for Cognitive Semantics'. Also, this post mainly deals with the issue of polysemy in relation to Quantitative Multifactorial method and does not cover Glynn's chosen statistical technique of Correspondence Analysis proper.

In the interest of time, this post does not engage in any discussion that could arise from Glynn's contribution but rather serves as a personal log of potentially useful quotations and points that I will investigate at a later stage.

Glynn's contribution provides a thorough overview of the treatment of polysemy in Cognitive Linguistics. Glynn's overall premise in relation to polysemy is:

to conserve the network model but to complement [it] with another method: a corpus-driven quantified and multifactorial method (p.76)
Further, Glynn points out that with such multifactorial method inevitably requires to approach polysemy in a non-theoretical fashion:

Such an approach employs a kind of componentional analysis that identifies clusters of features across large numbers of speech events. In other words, rather than analyse the possible meanings of a lexeme, a polysemic network should 'fall out' from an analysis that identifies clusters of the cognitive-functional features of a lexeme's usage. These features do not in any way resemble those of the Structuralist componentional analyses, since they are not based on a hypothetical semantic system, but describe instances of real language usage and are based upon encyclopaedic semantics of that language use in context (p.76)

In relation to the syntagmatic and paradigmatic dimensions of polysemy, Glynn recognises that the interaction between the schematic and/or morpho-syntactic semantics and lexical semantics is yet to be established. Within a dichotomous CL context where 'one position is that syntactic semantics override lexical semantics' and the other position is that 'there exists a complex interaction between all the various semantic structures in all degrees of schematicity', Glynn makes the working assumption that

syntactic variation affects a polysemy network and that its effect cannot be satisfactorily predicted by positing meaning structure associated with grammatical forms and classes a priori. We must therefore account for this variable as an integral part of semantic description. (...) It means that for a given lemma, or root lexeme, there will be semantic variation depending on its syntagmatic context, in other words, its collocation, grammatical, and even tense or case will necessarily affect the meaning of the item" (p.82)

In his approach to polysemy, Glynn treats each lexeme 'as a onomasiological field, or set of parasynonyms' (p.82).


Further reading:

Zelinsky-Wibbelt, C. (1986). An empirically based approach towards a system of semantic features. Proceedings of the 11th International Conference on Computational Linguistics 11:7-12

Thursday, 18 June 2009

Profile-based methodology for the comparison of language varieties

In this post, I would like to briefly point out the usefulness of 'profiling' methods for corpus data investigation. For that purpose I specifically refer to a paper entitled Profile-based linguistic uniformity as a generic method for comparing language varieties (2003), authored by Dirk Speelman, Stefan Grondelaers and Dirk Geerearst. The authors' paper is inspired by studies in language varieties and research methods currently used in dialectometry. For my own purposes, it is interesting to note that the authors make a case for the validity of profile-based linguistic methodology for corpus-data investigation as the annotation process of my data will include profiling occurrences of may, can and the lemma pouvoir.

In their paper, the authors present "the 'profile-based uniformity', a method designed to compare language varieties on the basis of a wide range of potentially heterogeneous linguistic variables" (abst.) The aim of the authors is to show that profiling investigated lexical items contributes to the identification of global dissimilarities between language varieties on the basis of individual variables which are ultimately summarised in global dissimilarities. Such process allows language varieties to be clustered or charted via the use various multivariate techniques.

Unlike standard methods of corpus investigation, namely frequency counts, the profile-based method "implies usage-based, but add another criterion. The additional criterion is that the frequency of a word or a construction is not treated as an autonomous piece of information, but is always investigated in the context of a profile."(p.11)

The profile-based approach assumes that mere frequency differences in a corpus contribute to the identification of differences between language varieties. According to the authors, the profile-based approach presents two advantages: the avoidance of thematic bias and the avoidance of referential ambiguity.

For the purpose of my project, the author's paper generally supports my methodological choice to semantically profile the occurrences of may, can and lemma pouvoir as found in my data. However, in their case study (see paper on p.18) the authors choose to take an onomasiological perspective (i.e. to use a concept as a starting point, and then investigate which words are associated with that concept). My project, on the other hand, takes on the opposite perspective, namely the semasiological approach which in the first instance considers individual words and looks at the semantic information that may be associated with those words. Inevitably, such difference in approaching the word/sense/concept interface leads to differing acceptations of the term 'profile' as both onomasiological and semasiological perspectives have different starting points. In that respect, the authors consider '[a] profile for a particular concept or linguistic function in a particular language variety [to be] the set of alternative linguistic means used to designate that concept or linguistic function in that language variety, together with their frequencies" (p.5)

For the purpose of my project, the term profile necessarily needs to be defined at word level and needs to incorporate the elements of sense and morpho-syntactic information. In that regard, the Behavioural Profile methodology proposed by Gries and Divjak in Quantitative approaches in usage-based cognitve semantics: myths, erroneous assumptions, and a proposal (in press) is an appropriate methodology for my project. Broadly, the BP methodology involves the identification of both semantic and morpho-syntactic features characteristic of the investigated lexical item, as found in the data. Ultimately, these identified features are used as linguistic variables and are investigated statistically. In the BP model, the identified features are referred to and processed as ID tags, each one of which contributes to the profiling of the lexical item under investigation.

To sum up, Speelman, Grondelaers and Geeraerst's paper provides me here not only with the opportunity to reflect on the notion of 'profiling' in the context of corpus-data investigation but also with the opportunity to consider the notion in the perspective of my own study.







Tuesday, 16 June 2009

Comparing exploratory statistical techniques for semantic descriptions

As Glynn, Geeraerst and Speelman state in Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics (paper presented at the 10th International Cognitive Linguistics Conference in Cracow in July 2007):

Current trends in the study of polysemy have focused on exploratory techniques such as Cluster Analysis and Correspondence Analysis. (abst.)
Broadly, exploratory techniques "identify and visualise patterns in the data". This technique "does not permit inferences about the language, only the sample, or dataset, investigated" (abst.)

On the occasion of the Quantitative Investigations in Theoretical Linguistics 3 event in Helsinki on June 3rd 2008, Dylan Glynn presented a comparison of both the Cluster and Correspondence Analysis statistical methods for the purpose of semantic description (Clusters and Correspondences. A comparison of two exploratory statistical techniques for semantic description) [the powerpoint presentation for this paper can be found here].

Over the past fifteen years, corpus-based research in the field of Cognitive Linguistics has produced a number of studies demonstrating the wide use of both statistical techniques. In his paper, Glynn compares both techniques on the grounds on quality/accuracy of graphic representation of the data and accuracy of relative associations of variables as revealed in the data. The assessment of the accuracy of relative associations of variables for each statistical method is based on a regression analysis which takes into consideration "the relationship between the mean value of a random variable and the corresponding values of one or more variables" (OED).

For the purpose of his investigation, Glynn carried out a case study examining the semantic structure of the lexeme annoy in comparison with hassle and bother in a large non-commercial corpus of English specified for the American vs. British English regional difference (for the purpose of that case study Glynn identified the working variables of morpho-syntax and Frame Semantic argument structure). Glynn points out that the Cluster Analysis and Multivariate Correspondence Analysis methods involve different types of graphic representations which in turn, present a number of shortcomings:

One important difference between the two techniques is that Cluster Analysis is primarily designed to present its results in the form of dendograms where Correspondence Analysis relies on scatter plots. The dendograms of HCA offer clear representations of both the groupings of features and the relative degree of correlation of those features. (...) The principle shortcoming of this representation is that it gives the false impression that all the data falls into groups, where in fact this may not be the case. (...) The scatter plots of Correspondence Analysis, although at times difficult to interpret, offer a much more "analogue" representation of correlation. (...) [T]he representation of the plot is (...) much more approximative than the dendogram. (p.2)
Through his case study, Glynn confirms the usefulness of both statistical methods as exploratory techniques. He also points out the possibility of unreliability of both methods to accurately process complex multivariate data and cautions analysts about the use of those methods for the specific purpose of confirmatory analysis. However, in the context of exploratory analysis, "the contrast in the result of the complicated analysis across the three lexemes [annoy,hassle and bother] suggests that MCA [Multivariate Correspondence Analysis] is better suited to a truly multivariate exploratory research" (p.2)

With regard to my project, Glynn's paper raises a couple of points:

i) the need to decide on the statistical nature of my overall project analysis -- exploratory, confirmatory or perhaps both possibly following a comparative format (?)

ii) the urgency to clearly identify the number and the nature of the variables through which I intend to investigate my data sets as those will be influential in the choice of statistical method -- at exploratory stage at least.

Statistical techniques for an optimal treatment of polysemy

In this post, I introduced the work of Dylan Glynn who is broadly concerned with developing methodology for corpus-data investigation. Glynn adheres to the Cognitive Linguistics/Semantics framework. Of interest here is a research project he contributed to with the collaboration of Dirk Geeraerts and Dirk Speelman, and concerned with the assessment of the efficacity of two statistical techniques, namely exploratary vs. confirmatory techniques of statistical analysis. Glynn, Geeraerst and Speelam presented the results of their study at the 10th International Cognitive Linguistics Conference in Cracow in July 2007, in a paper entitled Testing the hypothesis. Confirmatory statistical techniques for multifactorial data in Cognitive Semantics [the abstract is accessible from page 11 of the link]. For the purpose of this post I can unfortunately only summarise the content of that paper based on its abstract. As I do not have access to the full paper, I am not in a position to critically assess the arguments proposed by Glynn, Geeraearst and Speelman.

According to the authors, the two main -- and actively currently used by Cognitive Linguists, statistical techniques for corpus-data investigation are i) exploratory techniques (i.e. the Cluster Analysis, used in Gries 2006; the Correspondence Analysis, used in forthcoming Glynn) and confirmatory techniques (i.e. Linear Discriminant Analysis, used in Gries 2003 and Wulff 2004; Logistic Regression Analysis, used in Heylen 2005 and De Sutter & al. in press)

The authors define the aim of each technique as follows:

The goal of (...) exploratory statistics is to identify and visualize patterns in the data. These patterns are argued to represent patterns of usage (...). Exploratory statistics analysis does not permit inferences about the language, only the sample, or dataset, investigated. However, in confirmatory statistics, inference is made from the sample to the population. In other words, one claims that what is seen in the data is representative of the language generally. (abst.)


In the light of my own project, the author's study is of particular relevance because it identifies the case of polysemy, as an object of investigation, as requiring specific methodological attention:
Current trends in the study of polysemy have focused on exploratory techniques.
However,
[t]he importance of these techniques notwithstanding, the cognitive framework needs to deepen its use of quantitative research especially through the use of confirmatory multivariate statistics.
Further,

Within Cognitive Linguistics, [Linear Discriminant Analysis technique and Logistic Regression Analysis technique] have been successfully used to capture the various conceptual, formal, and extralinguistics factors that lead to the use of one construction over another. However, the study of polysemy differs at this point. Instead of examining the variables that effect the use of one parasynonymous forms to another, we are examining the interaction of a range of formal variables (the lemma and its syntagmatic and inflectional variation), semantic variables, and extralinguistic variables, in the search of correlations across all of these. One possible multivariate technique for this type of data is Log-Linear Modelling. (abst.)
In the course of their study, the authors identified complex sets of correlations between formal and semantic variables through exploratory studies and then modelled these correlations using The Log-Linear Analysis technique.

At this point, Glynn, Geeraerst and Speelman's paper calls for a comparative study of specific polysemous lexical items contextualised in different language varieties and using, in turn, both the Cluster Analysis exploratory technique and the Log-Linear Modelling confirmatory technique. Such study would contribute to the identification of a possible optimal statistical technique for the investigation of corpus-data.