Wednesday 25 February 2009

Bridging the conceptual and the contextual using evidence from corpus data

Zhuo and Gries (to appear) (Schematic meaning and pragmatic inference: The Mandarin adverbs 'hai', 'you', and 'zai' ) could contribute to the development of a method to investigate my data in view of identifying the semantic and image-schematic profiling characteristics of Fr-Engl IL 'may' and 'can'.

In their paper, Zhuo and Gries are concerned with the relation between the abstract schematic meaning of specific lexical items and the variety of concrete contextual messages those lexical items give rise to. The authors use the case of Mandarin adverbs 'hai', 'you' and 'zai', all loosy translating into 'again', to demonstrate that although the three lexical items belong to a common semantic system [the term is understood here as refering to a 'semantic notion'; the authors also refer to the term semantic substance to express the same idea], each individual lexical item refers to a specific facet of the semantic system it is a member of. Within specific systems, all members contrast semantically with one another:

"[W]e shall treat the three adverbs as signs in semantic opposition. We shall assign each word a schematic meaning as a salient component of a semantic system in which they contrast" (p.7)

Although schematic meanings can be contextually enriched (via idiosyncratic lexical input, encyclopaedic knowledge of a particular word or the human factor), they are considered by the authors as semantic values and are to be dissociated from contextual inference. Further, Zhuo and Gries recognise that "the human ability to utilize all kinds of knowledge including knowledge of language as well as world and cultural knowledge and the ability to pick up contextual cues in discourse" (p.8) is part of the contextual enrichment process of the schematic meaning of a particular lexical item.

One of the core issues addressed in the paper is that of semantic compatibility bewteen a particular lexical item and its discourse environment. On the basis of discourse coherence and semantic compatibility, the authors predict and confirm that due to their individual schematic meaning, the three lexical items 'hai', 'you' and 'zai' show in discourse different collocation preferences, thus bringing evidence that different lexical items from a common semantic system do profile, semantically, different facets of that system.

Methodologically, the authors investigated a small corpus made out of two subcorpora. Their data are multifactoral, based on 4 variables: CORPUS: narrative vs. non-narrative, TEMP_REF: non-past vs. past, ADVERB: 'hai' vs. 'you' vs. 'zai'.

In principle, Zhuo and Gries' s study reminds me of a paper by Clausner and Croft (1999) and that I briefly mentioned in this post. Despite the fact that Clausner and Croft are concerned with image-schemas and Zhou and Gries are concerned with schematic -- but yet linguistic, meaning, both studies have in common the idea of a general category including various contrasting members. Clausner and Croft (1999) make a case for image-schematic domains and they argue that image-schemas are a subtype of domain. They also argue that image-schematic domains show internal structure and that the image-schemas included within a specific image-schematic domain stand in various relationships and profile different aspects of the image-schematic domain they belong to. This parallel between the two studies raises the question of whether Zhou and Gries' s methodology (i.e. investigating discourse collocations as a way to differentiate members of a semantic system) could be applied at image-schema level.

Should Zhuo and Gries's methodology be applied to my project, one may speculate that:

Preffered collocation sets for 'can' and 'may' would generally allow for the identification of individual image-schemas. In the case of 'may' and 'can' as produced in native English , the preferred collocation sets would be expected to confirm Talmy's (1998) finding. In the case of 'pouvoir', the preferred collocation set would be expected to be in line with Achard's (1996) finding. Ultimately, the investigation of the preferred collocation sets for native English 'may'/'can' and native French 'pouvoir' would contribute to the identification of the image-schematic representation of 'may' and 'can' in Fr-Eng IL. In other words, this method could be useful in the identification of the profiling characteristics of IL 'may'/'can' at conceptual level [by 'conceptual level' I mean pre linguistic level], and in contrast with native English 'may'/'can' and native French 'pouvoir'.

So in sum, an analysis of the cross-linguistic collocation patterns of 'may', 'can' and 'pouvoir' in native and second language English and native French corpora could provide a way of bridging the contextual, the linguistic and the conceptual in bilingual mental meaning representation.

Tuesday 24 February 2009

The point of variability assessment within and between corpora

Gries (2006) ( Exploring variability within and between corpora: some methodological considerations ) generally shows that the reliability of corpus findings results from corpus variability/homogeneity assessments both within and between corpora. Variability/homogeneity assessments can be carried out via a range of statistical tests involving the identification, by the analyst, of parameters whose variability will be measured and a chosen level of granularity at which the corpus will be investigated. Ultimately, Gries's method aims to improve descriptive accuracy in corpus-based studies.

On the basis that "no corpora are alike, and that sometimes different results are reported for very similar corpora (or even the same corpus)" (abstract), Gries addresses three core issues:

i) "how to identify and quantify the degree of variation coming with one's results" (abst.)
ii) "how to investigate the source of the observed variation in corpora" (abst.)
iii)"how homogeneous one's corpus is with respect to a particular phenomenon" (abst.)

Although Gries recognises that many quantitative studies limit themselves to reporting word frequencies, such methodology is not the most useful way to approach phenomenon X in corpus Y. Gries points out that those approaches are not sufficient and argues that statistical testing (more on the specific statistical tests relevant to my study in a later post) and subsequent interpretation of the data summarised are necessary to reach reliable corpus findings:

"This [methodological] choice seriously limits the range of applicability of these approaches [ word frequency approaches]. First, an approach to corpus homogeneity based on word frequency is much more likely to produce biased results when applied to corpora containing text samples focusing on a particular topic. " [A primary investigation of ICLE FR and LOCNESS has verified the point that depending on the nature of the topics discussed in the corpus, 'may' and 'can' are used more or less frequently. Note: although the topics discussed in ICLE FR and LOCNESS independently, are similar, they are not systematically identical.]



Within and between corpora variability:

- Gries provides a brief literature review of the studies addressing both types of corpora variability.

- provides statistical evidence that different corpus-based studies on the overall frequency of the present perfect in English bring different results. Such evidence suggests that i) a word frequency approach is not reliable and ii) alternative and more reliable ways to approach the corpus are needed.

- case studies are presented as detailed exemplifications of the above


Assessment of variability: limitations for ICLE FR, LOCNESS and CODIF:

Although Gries convincingly argues in favour of corpora variability assessment, practically and in the context of learner corpora, the application of Gries' methodology is not straight forward and can only be applied partially. This leads to the necessity to integrate the word frequency approach and the corpora variability approach to a single corpus investigation. The nature corpus investigation in learner language involves the comparison of (at least) two sub data sets: one set compiled of language produced by non-native speakers of language x in language y (the investigated corpus), and one set compiled of language produced in language y by native speakers of language y (used as the control corpus). In terms of background information on the participants, the amount of information provided alongside the investigated and control data sets is not even. ICLE FR, for instance, my investigated corpus, provides background information that would, in the context of a corpora variability assessment, allow me to identify a range of parameters such as: female/male writers, literary/argumentative texts, writing conditions (e.g. exam condition, timed/not timed conditions, use of reference tools/reference tools not allowed). LOCNESS and CODIF, on the other hand, as my control data sets, provide very limited background information on the participants. In the case of LOCNESS, the only identifiable working parameters are genre (i.e. literary/argumentative texts), individual essays/files and negation (although genre is not a very reliable parameter as literary texts are generally under represented in the corpus). As for CODIF, although I have the information that the data set is directly comparable with ICLE FR and LOCNESS, I have no detailed information allowing me to identify workable parameters in view of a corpora variability assessment. So in sum, there are limitations to the application of Gries' methodology to the corpora I am specifically using. Some degree of assessment can, however, be carried out:

- corpora variability between ICLE FR and LOCNESS:
  • possible parameters: individual essays, negation, genre
- corpora variability between the ICLE corpus and its French subsection ICLE FR:
  • possible parameters: male/female writers, genre, individual essays/files, writing conditions, negation
- corpora variability between ICLE FR and CODIF:
  • possible parameter: individual essays/files, negation
- corpora variability between LOCNESS and CODIF:
  • possible parameter: individual essays/files, negation
-corpora variability between ICLE FR, LOCNESS, CODIF:
  • possible parameter: individual essays/files
The above shows that in the case of ICLE FR, LOCNESS and CODIF corpora, the within corpus variability assessment would prove much more thorough and useful than the between corpora variability assessment (incl. ICLE FR vs LOCNESS, ICLE FR vs CODIF, LOCNESS vs CODIF). Ultimately and following Gries, we may speculate that results of the overall investigation could be affected by the limited applicability of his methodology to the project.


More useful quote:

"One of the most important concepts within corpus linguistics is variability. Variability is a key issue on several levels, simultaneously. First, variability of always of prime importance when reporting one's results: without an indication of the variability found in one's data, the interpretation of, say, aggregated frequencies/percentages or measures of the central tendency of a single study is usually quite difficult, and the comparison of results between different studies is seriously impaired" (p.110)


Monday 23 February 2009

The corpus-based Behavioral Profile approach to cognitive semantics

Gries and Divjak (in press) (Quantitative approaches in usage-based cognitive semantics: myths, erroneous assumptions, and a proposal) generally argue in favour of quantitative corpus-linguistics methods in cognitive linguistics. At this stage of my project, Gries and Divjak's paper provides me with methodological tools to combine numbers (i.e. frequency of occurrences of 'may' and 'can') and word senses (i.e. frequency of occurrences of the various senses of 'may' and 'can'). Of particular interest is the attention that the authors pay to cases of polysemy and to cross-linguistic studies.

Gries and Divjak point out that "cognitive linguistics can only benefit from reducing the subjective element in its methods as much as is feasible" (p.4). For that purpose, the authors propose the Behavioral Profile approach (BP). Behavioral profiling of lexical items is based in distributional properties captured by percentages and "allows researchers to analyze the BP data using statistical techniques as well as to compare the results to data/results from other studies" (p.8)

The BP approach is based on two assumptions:

i) "corpus data provides (nothing but) distributional frequencies" (p.4)
ii) "distributional similarity reflects, or is indicative of, functional similarity" (p4)

[functional similarity = any function of a particular expression, ranging from syntactic to discourse-pragmatic]

Methodological steps involved in the BP approach:

1) Retrieval of all instances of a word's lemma from a corpus in their context.

2) Semi-manual analysis of many properties of the use of the word forms (following Atkins (1987): morphological characteristics, syntactic characteristics, semantic characteristics. The identification of those features allows to compile ID tags for the word forms).

3) Generation of a co-occurrence table that specifies which ID tag level is attested how often in percent with each sense of a polysemous word. The columns containing the percentages for each sense is referred to as the sense's behavioral profile.

Application of the BP approach to polysemy

Gries and Divjak show how the BP approach can assist in answering questions related to the phenomenon of POLYSEMY, such as the identification of prototypical senses of specific lexical items, the connection of a particular sense of a polysemous word to the network of already identified senses, the usefulness of a cluster-analytic approach in the domain of POLYSEMY.


Application to cross-linguistic studies

This section is of particular interest to me because of the recent addition of the CODIF corpus to my data set. So a semantic study of French and English sub data sets will be carried out.

Gries and Divjak recognise that "[c]ross-linguistic semantic studies are notoriously difficult given that different languages carve up conceptual space(s) in different ways (cf. Janda, to appear for discussion); for that reason, linguistic dimensions are difficult to compare across languages" (p7)

[what is meant here exactly by 'linguistic dimensions'?]

For Gries and Divjak, because the BP approach is based on operationalizable distributional properties, it can be applied to cross-linguistic studies : "concordance lines from different languages can be annotated for a number of common characteristics while at the same time doing justice to any individual languages characteristics and avoiding overly subjective intuitions regarding cross-linguistic semantic differences" (p.7)

The BP approach seems that it could provide a unified model to investigate the semantic domain of POSSIBILITY both cross-linguistically and via polysemous 'may', 'can' and 'pouvoir'.


References to check out:

Janda, Laura A. (to appear) What is the role of semantic maps in cognitive linguistics? In Piotr Stalmaszczyk and Wieslaw Oleksy (eds.). Festschrift for Barbara Lewandowska-Tomaszczyk.


More useful quotes:

"(...) the concordance lines of a particular search expression and the uses of the word and their frequencies constitute an objective database of the kind that made-up sentences do not since researchers cannot invent all uses of an expression in a corpus let alone their frequencies of occurrence" (p.3)

" (...) corpus-linguistics studies meaning in terms of use, which in turn is made tangible through distribution, and hence lends itself better to quantification." (p.4)

Friday 20 February 2009

ICLE and LOCNESS welcome CODIF -- the latest addition to the database

Finally coming back after a two-week immersion in the depth of ICLE and LOCNESS!

The quantitative investigation of the data started with a pilot study comparing the frequency of occurrences of 'may' and 'can' across ICLE and LOCNESS, including comparisons with the frequency of occurrences of the other central modals ('could', 'might', 'must', 'shall', 'should', 'will' and 'would') both in LOCNESS as well as in the other subsections of the ICLE corpus. In the later case, the purpose of the investigation was to find out to what extent the use patterns of 'may' and 'can' in French-English IL reflect those observable in second language English in general. The results from the pilot study proved useful as it became clear that 'may' and 'can' play a role in the profiling of French-English interlanguage through different use patterns. The findings of the pilot study are now recorded in the form of a paper entitled Investigating the typicality of 'may' and 'can' in a corpus of learner English.

I am now at a stage where I am trying to zoom into my first general findings to see if there is anything striking there. In order to do that, I have had to laboriously count and record the occurrences of 'may' and 'can' in LOCNESS file by file, pretty much manually by copying and pasting each file into Word and then finding each occurrence in the 324 304 words data set! An exercise that I only wanted to carry out once! So at that point, having made no decision about whether to consider 'may' and 'can' as individual modals or as lemmas -- which would then have included 'may not', 'cannot, 'can't' and (?)'can not' in the study, all forms of the two modals were accounted for (the decision to include 'can not' as an acceptable spelling is still being debated). So far, these are the data sets that I am able to work from:


- LOCNESS: MAY and CAN (as featuring per essay)
- LOCNESS: MAY NOT, CANNOT, CAN'T (as featuring per essay)
- LOCNESS: MAY and CAN in argumentative and literary texts (as featuring per essay)

-ICLE FR: MAY and CAN (as featuring per essay)
-ICLE FR: MAY NOT, CANNOT, CAN'T (as featuring per essay)
-ICLE FR: MAY and CAN in argumentative texts (as featuring per essay)
-ICLE FR: MAY and CAN in literary texts ( as featuring per essay)

-ICLE FR, ICLE (excl FR), LOCNESS: MAY, MAY NOT, CANNOT, CAN'T (as featuring generally across the three data sets -- this count does not include the distinction between individual files/essays)

- ICLE FR, ICLE (excl FR), LOCNESS: control variable AND

NB: Tables indicating occurrences of 'may' do not included cases of 'may not'. Cases of 'may not's are only included in relation to cases of 'cannot's and 'can't's. That allows to consider negation as a variable and to investigate its interaction with modality.


Recently, the issue of the usefulness of a native French comparison data set was raised in discussion. Such data set would be particularly helpful at the qualitative stage of the data analysis process and create opportunities for cross-linguistic collocation searches. That way, I would be able to identify what contextual features are generally lexicalised via 'pouvoir' and assess whether those features are also lexicalised via 'may' and 'can' in French-English IL. In other word, it would allow me to establish whether 'may'/'can' in Fr-English IL carry over some semantic features of 'pouvoir' and if so, in what measure. In order to carry out those collocation searches I was recently granted access to the COrpus de DIssertations Francaises (CODIF) database which is a corpus of native French essay writing (dissertations written by French undergraduates at the University of Louvain, Belgium). The CODIF database was compiled by the Centre for English Corpus Linguistics (CECL) at the Universite Catholique de Louvain, Belgium. The data set counts around 100 000 words.

From the perspective of the Cognitive Semantics framework, a three-way database (ICLE FR, LOCNESS and CODIF) allows for an investigation of the conceptual domains recruited by 'may', 'can' and 'pouvoir'. As members of the same semantic domain (i.e. POSSIBILITY), do the three modals recruit the same conceptual domains/frames? What is the nature of the relation between those domains? Does the nature of those relations vary cross-linguistically?

Sunday 8 February 2009

ICLE and LOCNESS: words and figures

A little bit about the data I am using for my project:

The data is drawn from two corpora: the International Corpus of Learner English (ICLE) and the Louvain Corpus of Native English Essays (LOCNESS).

ICLE is a corpus of written learner English including essays written by native speakers of Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish and Swedish. ICLE counts a total number of 2,500,353 words distributed evenly across the eleven national subcorpora. My project focuses specifically on the French subcorpus (namely here ICLE FR) which counts a total of 228,081 words. The French subcorpus comprises a further two subcorpora: a subcorpus of argumentative texts -- counting 177,963 words, and a subcorpus of literary texts – counting 50,118 words. The French subcorpus comprises 347 essays averaging 500 words each. All participants in the ICLE corpus “are university undergraduates in English (usually in their third or fourth year)”, and “the proficiency level ranges from higher intermediate to advanced” (Granger et al., 2002)

LOCNESS is a corpus of native English essays comparable with ICLE (i.e. the participants are also university undergraduates, essays are averaging the same length and are dealing with similar topics). LOCNESS counts a total of 324,304 words and comprises three subcorpora: a British pupils’ A level essays subcorpus of 60,209 words, a British university students’ essays subcorpus of 95,695 words and an American university essays subcorpus of 168,400 words. Similarly to ICLE, LOCNESS also includes argumentative and literary texts.

What are the figures telling us so far?:

Early results of quantitative analyses of ICLE(FR) and LOCNESS have allowed me to establish that the patterns of uses of may and can in the French subsection of ICLE do play a role in the profiling of French-English IL. With the help of my kind friend B, statistics expert, I am now planning to continue to approach the data quantitatively and to dig deeper into it by running a number of variance tests that should i) consolidate nicely the results I have so far and ii) provide a much sharper picture of the uses of may/can in ICLE FR. Results from the variance tests should be ready to be analysed by the end of this week, early next week max.


Soon, the data will be looked into qualitatively -- counting up occurrences of specific meanings of may/can instead of occurrences of the actual words (I've already started to think about about manual searches of image-schemas in ICLE FR and LOCNESS). However, before I start the process, it might be usefull to check out whether I can pick up a few tips from Adam Kilgarriff(http://www.kilgarriff.co.uk/). Particular papers he wrote that could be of interest to me:

"I don't believe in word senses" (1997). Computers and the Humanities 31: 91-113.
Reprinted in Practical Lexicography: a Reader. Fontenelle, editor. Oxford University Press. 2008.
Reprinted in Polysemy: Flexible patterns of meaning in language and mind Nerlich, Todd, Herman and Clarke, editors. Walter de Gruyter. Pp 361-392.
To be reprinted in Readings in the Lexicon Pustejovsky and Wilks, editors. MIT Press.


Grammar is to meaning as the law is to good behaviour (2007) Corpus Linguistics and Linguistic Theory 3 (2): 195-198.

Comparing Corpora (2001) International Journal of Corpus Linguistics 6 (1): 1-37.
Reprinted in Corpus Linguistics: Critical Concepts in Linguistics. Teubert and Krishnamurthy, editors. Routledge. 2007.

How dominant is the commonest sense of a word? (2004) In: Text, Speech, Dialogue. Lecture Notes in Artificial Intelligence Vol. 3206. Sojka, Kopecek and Pala, Eds. Springer Verlag: 103-112.
Reprinted in Lexicology: Critical concepts in Linguistics Hanks, editor. Routledge, 2007

Busy week ahead!

Wednesday 4 February 2009

Getting started ...

A year and a half into my PhD project, this blog is long overdue! It will, I hope, serve the purpose of helping me keeping track of my readings and ongoing thoughts as well as helping me to remain focused and ultimately achieve a real sense of direction -- at last!

Here is a little bit of background for my research:

The specificity of my project lies in that it brings together the fields of Interlanguage and Cognitive semantics.

First, interlanguage:

Interlanguage, as defined by the OED refers to ‘a linguistic system typically developed by a student before acquiring fluency in a foreign language, and containing elements of either his or her native tongue and of the target language’. So broadly, interlanguage could be considered as a sort of hybrid of two linguistic systems. Effectively, there are many types of interlanguage, depending on the native language of the speaker and his/her second language. My research focuses particularly on the French-English type of interlanguage where the speakers’ first language is French and their second language is English.

The case of Interlanguage is currently raising some interest in the fields of psycholinguistics and neuroscience as researchers are trying to identify the nature of the relations between L1 and L2 in the bilingual mind (e.g. Obler 1993, Snellings 2002, Finkbeiner, Almeida, Janssen and Caramazza 2006, Kovelman, Baker and Pettito 2008). Recent research in neurolinguistics (Kovelman, Baker and Pettito 2008) supports the existing view that “bilinguals have differentiated neural representations of their two languages” (p. 165). Further, another recent study concerned with the selection of lexicon in bilingual speech production, Finkbeiner, Almeida, Janssen and Caramazza (2006), recognises the potentially complicated process of bilingual lexical access in which “concept selection serves to activate two lexical representations to an equal extent” (p. 1075). In other words, there is the possibility of interference between the bilingual’s two linguistic systems. This view is generally recognised in cross-linguistic investigations on interlanguage and second language (L2) knowledge organisation. However, the issue of cross-linguistic interference from a semantic perspective remains under-investigated.


My project offers to investigate first language (L1) and L2 interferences using the cognitive semantics framework. So I am looking at possible interferences of L1 and L2 at conceptual level. Cognitive linguists, generally, are concerned with language use in relation to conceptual representation (conceptual structure (i.e. knowledge representation) and conceptualisation processes (i.e. meaning representation), and they postulate that our bodily experiences contribute to the way we conceptualise the physical world.

On that basis, Image-Schemas have been recognised as one possible cognitive process that reinterprets sensory information as conceptual representation. So Image-schemas are like analogue representations of perceptual states from which lexical meanings can derive and they profile word meanings. Talmy (1981) argues that the meanings of the English modals (may, can, must, etc.) derive from the experiential domain of force dynamics which itself includes a number of Image-Schemas: compulsion, restraint, enablement, blockage, counterforce, attraction, resistance. The literature recognises MAY as referring to the Image-Schema of ‘removal of restraint’ and CAN as referring to that of ‘enablement’. The semantic domain of force dynamics is also applied to the French modal verb POUVOIR in Achard (1996). It is worthy to note here that French doesn’t differentiate lexically between MAY and CAN. Both lexical forms are included under the umbrella of POUVOIR.


According to Lakoff (1987), Image-Schemas can be transformed which means that shifts in the profiling of specific lexical items can take place and thus allow for semantic shifts to be observed. I here question whether those shifts (image-schematic and ultimately semantic shifts) can be equally observed in French-English interlanguage, on the basis of the cognitive economy principle. One way to start tackling the question is to carry out a quantitative analysis of the corpus to find out whether the schemas of ‘enablement’ and ‘restraint’ are activated in equal frequency by French English learners and native English speakers.


Generally, within the Cognitive Linguistics framework it is assumed that the meanings of linguistic forms are understood relative to background/encyclopaedic knowledge. In other words, they are understood as part of a specific experiential domain. Clausner and Croft (1999) make a case for Image-Schematic domains and they argue that Image-Schemas are a subtype of domain. They also argue that Image-Schematic domains show internal structure and that the Image-Schemas included within a specific Image-Schematic domain stand in various relationships. Their argument leads to the speculation that i) POUVOIR, MAY and CAN are all included in the same experiential domain of force dynamics, ii) as separate lexical items, they profile word meanings in different ways and iii) as part of a common structured domain they stand in various relationships. This implies that theoretically, image-schematic shifts could take place cross-linguistically, thus allowing to speculate that cross-linguistic semantic shifts take place at conceptual level.

At this point, a question would be: how do image-schematic shifts (i.e. transformations) take place? Clausner and Croft argue that Image-Schema transformations are the result of the mapping of one image-Schema onto another (1999:23). A few months ago, as an experiment, I started exploring the idea of possible Image-Schema mappings between MAY, CAN and POUVOIR. Although the idea needs to be further investigated, early results seemed to prompt towards a possible metonymic relation between the Image-Schema profiled by POUVOIR and those profiled by MAY/CAN.


Now the corpus data really needs to be scrutinised quantitatively and qualitatively! More on that in the next post …