Friday, 27 March 2009

Image-Schema transformations and cross-linguistic polysemy: a matter of terminology

In her 2004 paper (Transformation on image schemas and cross-linguistic polysemy), Lena Ekberg is generally concerned with diachronic semantic change across different languages and she argues that cross-linguistic semantic change is cognitively motivated. She recognises that "[m]odern research within the field of historical lexical semantics and grammaticalization in fact has provided arguments that meaning change is motivated by cognitive principles independent of specific languages" (p.42). Although Ekberg (2004) links with my project in the sense that it takes a cross-linguistic approach to investigate polysemous lexical items while trying to incorporate a Cognitive Semantics approach, it differs from my project in two major ways: i) it identifies specific semantic changes in specific languages and then compares those changes cross-linguistically; and ii) it considers semantic variance diachronically. My project, on the other hand, is concerned with cross-linguistic semantic change in terms of word senses in language x affecting the senses of corresponding words in language y. Further, my project is concerned with on-line cross-linguistic semantic interference and is not concerned with the development of word senses overtime. Despite these differences, Ekberg (2004) is of interest to me because it raises a number of terminology-, methodology- and theoretical framework-related issues.

Ekberg's overall stand on semantic change is stated in Construal operations in semantic change: the case of abstract nouns):

"The prerequisites of meaning variation of a lexeme are intrinsic in the underlying schematic structure as well as in the construal operations that may apply to that structure. Thus every instance of semantic change and variation - either resulting in polysemy or contextual meaning variation, is motivated by the possibilities of varying a given schematized structure by means of general and cognitively motivated construal operations" (p.63)

Further,

"[T]he processes generating semantic variation and change operate on the schematized structure underlying the lexical representation of a linguistic expression" (p. ).

Ekberg investigates cross-linguistic semantic change by considering and trying to bring together two theoretical approaches with different theoretical assumptions: the lexical semantics approach and the cognitive semantics approach. In her investigation of "the potential polysemy of lexemes based on a common schema" (p.25), Ekberg (2004) attempts to deal simultaneously with lexical patterns, conceptual processes and cognitive mechanisms. Overall, the paper highlights the limitations of such an inclusive methodology that ultimately relies on loose use of terminology.

Ekberg's (2004) working assumption is that:
  • "semantic structures at a certain level of abstraction, as well as the principles of meaning change, are universal devices for generating new lexical meaning variants" (p.26)
Ekberg (2004) claims that:
  • polysemy results from a process of image-schema transformation which itself results from a mental construal process
  • polysemy refers to meaning variants of the same lexeme related by means of image-schema transformations and which are regarded as separate senses, i.e. instantiation of polysemy
  • lexical meaning extensions reflecting transformations of image-schematic structure are cognitively motivated and thus arise cross-linguistically
  • image-schema transformations are motivated by mental construal processes

Raising issues:
  • Ekberg recognises the image schema transformation as a central process in the emergence of new senses. However, in the paper, the term image schema lacks a reliable working definition. The term is first defined on page 28, in the sense of Johnson (1987) as " a recurring dynamic pattern [...] that gives coherence and structure to our experience". The term is then later referred to on page 36 as being "the most abstract basis of lexical meaning", and on page 43 as an "underlying abstract semantic structure". In other words, throughout the paper, it is unclear whether the term refers to schematic representations of word senses or whether it refers to schematic representations of physical experiences. In the first case, the approach to cross-linguistic semantic change and polysemy is lexically based. In the second case, the approach is experientially based and therefore conceptual in nature (i.e. pre-linguistic). Distinguishing between the two cases is important because they both ultimately refer to different stages/levels in the construction of meaning. The author's attempt to bridge lexical matters (i.e. linguistic in nature) and conceptual matters (i.e. pre-linguistic in nature) creates a degree of confusion about the level of abstraction targeted in the discussion.
  • Similarly, the term cognitively motivated ("lexical meaning extensions reflecting transformations of image-schematic structure are cognitively motivated and thus arise cross-linguistically") calls for clarification. Assuming that lexical meaning extensions do reflect transformations of image-schematic structure (as understood in the CL framework) then those meaning extensions are by definition cognitively motivated and the phrase quoted above is redundant and therefore not useful. Alternatively, the term (in the context of the example) could be referring to a speaker's specific cognitive ability which could be applied to the process of lexical meaning extensions.Under the term cognitive, it is unclear whether the author refers to a cognitive ability allowing speakers to extend lexical meanings in similar ways in different languages or whether the author refers to a conceptual process (i.e. image-schema, as understood in the CL framework). Without a solid working definition of the term image schema, it is difficult to recognise that polysemy results from a process of image schema transformation. It is also difficult to recognise what exactly is being transformed in the process of meaning extension: the schematic representation of lexical meanings or the image schema as an analog representation of a physical experience.
Ekberg (2004) raises questions about the possibility of/feasibility in bridging the lexical and the conceptual via the cognitive process of image schema. As far as my study is concerned, even though an overall CL approach to may/can in French-English IL will allow for an analysis of how the senses of may/can are represented in the French-English bilingual mind, the study may well be restricted to show just that! Talmy, Sweetser and Johnson have investigated the English modals in terms of linguitsic tools referring to the image schema of Force Dynamic. Although I cannot ignore such studies, the question is now how can they be exploited empirically?

Monday, 23 March 2009

From corpus to clusters: Gries and Divjak's suggested methodology

In Behavioral profiles: a corpus approach to cognitive semantic analysis (to appear), Gries and Divjak propose a methodology to approach polysemy both using an empirical approach and following the Cognitive Linguistics (CL) framework. The author's methodology is of interest for my project because of I adopt an empirical approach, I follow the CL framework and my investigated words (i.e. may, can and pouvoir) are all polysemous lexical items.


In their introduction, the authors review:

i) The treatment of polysemy in CL

ii) Present existing issues behind the identification of the prototypical sense(s) of a word

iii) Claim that a more sophisticated quantitative approach to corpus investigation would provide cognitive-linguistically relevant results.


Gries and Divjak’s methodology is based on the assumption that it “is radically corpus-based because it relies on the correlation between distributional patterns and functional characteristics to a much larger extent than most previous cognitive-linguistic work” (p.60). The authors claim that their methodology “aims at providing the best of both worlds, i.e. a precise, quantitative corpus-based approach that yields cognitive-linguistically relevant results” (p.60)

Method:

Four-step method based on the concepts of ID tags (cf. Atkins 1987) and the notion of Behavioral Profile (cf. Hanks’s 1996).

The method assumes that “the words or sense investigated are part of a network of words/senses”:

“In this network, elements which are similar to each other are connected in such a way that the strength of the connection reflects the likelihood that the elements display similar behavior with respect to phonological, syntactic, semantic or other type of linguistic behaviour” (p.61)

The four stages:

Stages 1-3 are concerned with data processing.

Stage 4 is concerned with meaningful data evaluation.

  1. The retrieval of all instances of a word’s lemma from a corpus
  2. A manual analysis of many properties of the word form (i.e. the annotation of the ID tags)
  3. The generation of a co-occurrence table
  4. The evaluation of the table by means of exploratory and other statistical techniques

Data processing:

Stage 1: use of a concordance program to retrieve all hits of a lemmata of a word

Stage 2: all hits are annotated for ID tags

Results from step 2 are displayed in a co-occurrence table where each row contains:

· one citation of the word in question

· each column contains an ID tag

· each cell contains the level of the ID tag for this citation

Stage 3: The co-occurrence table is turned into a frequency table (every row contains a level of an ID tag while every column contains a sense of the polysemous word. Each cell in the table provides the frequency of occurrence of the ID tags with the word sense(s)

[NB: to compare senses that occur at different frequencies, absolute frequencies need to be turned into relative frequencies (i.e. within ID tag percentages)]

Step 3 results in the Behavioral profile for a word sense: “each sense of a word (…) is characterized by one co-occurrence vector of within-ID tag relative frequencies” (p.63)

Stage 4 of Gries and Divjak’s methodology evaluates the vector-based behavioural profiles identifies in stage 3.

Data evaluation

The evaluation can be carried out using quantitative approaches (i.e. standardized statistical tests).

Gries and Divjak recognise two types of evaluations: monofactorial and multifactorial evaluations:

  • Monfactorial evaluation: looks at token frequency and type frequency. “A useful strategy to start with is identifying in one’s corpus the most frequent senses of the word(s) one is investigating” (p.64)

  • Multifactorial evaluation: The authors specifically focus on the exploratory technique of hierarchical agglomerative cluster analysis. The Hierarchical agglomerative cluster analysis (HAC) is a family of methods that aims at identifying and representing (dis)similarity relations between different items.

How to do a Hierarchical agglomerative cluster analysis:

i) Relative co-occurrence frequency table needs to be turned into a similarity/dissimilarity matrix (need to settle on a specific measure)

ii) Selection of an amalgamation strategy ( =algorithm that defines how the elements that need to be clustered will be joined together on the basis of the variables or the ID tags that they were inspected for (most widely used amalgamation strategy is Ward’s rule)

iii) Results appear in the form of a hierarchical tree diagram representing distinguishable clusters with high within-cluster similarity and low between-cluster similarity


Detailed analysis of the clustering solution

i) Assessment of the ‘cleanliness’ of the tree diagram

ii) Assessment of the clearest similarities emerging from the tree diagram

iii) Between-cluster differences can be assessed using t-values

NB: “the fact that a cluster analysis has grouped together particular sense/words does not necessarily imply that these senses or words are identical or even highly similar – it only shows that these sense/words are more similar to each other than they are to the rest of the senses/words investigated. By means of standardized z-scores, one can tease apart the difference between otherwise highly similar senses/words and shed light on what the internal structure of a cluster looks like” (p.67)

The author's methodology and my project:

  • Can the authors' method lead to the identification of semantic clusters between the different senses of may, can and pouvoir?
  • If so, what semantic features characterise each cluster? Can between-cluster differences be identified?
  • How useful is the proposed methodology for the elaboration of a cross-linguistic semantic network of the senses of may, can and pouvoir?
  • How useful is the proposed methodology for both the identification of cross-linguistic between cluster differences and the identification of within-cluster characterics?
Overall, the exploration of the authors' proposed methodology using my data should prove a useful exercise because it provides the opprotunity to investigate the mental semantic organisation of word senses at cross-linguistic level.

Friday, 6 March 2009

Approaching the data statistically: what to test, how and why ?

At this point in the project, the investigation of the data is broadly anticipated to include two separate stages, each one of those stages bearing different methodological assumptions. The first stage is purely quantitative in nature and follows a traditional trend in corpus linguistics to assess "the distribution of a single variable such as word frequency" (Oakes: 1998). The literature refers to that type of approach as univariate By adopting the traditional approach in the first stage of the investigation of the data, my aim is to provide a preliminary overview of the behaviour of may, can and pouvoir in all three subcorpora. However, although that stage will provide general patterns of uses of the modals in the different subcorpora, the weight of the results gathered from frequency tests will need to be handled cautiously on the basis of variability within and between corpora. The second stage of the data investigation process includes the computation of qualitative information such as word senses and contextual/pragmatic information. That stage is anticipated to consist mainly of cluster analyses. A description of that type of analysis and its implications for my study will be presented in a later post.

This post is only concerned with the first stage of investigation. I present an overview of the range of statistical tests available and that I judge suitable for word-frequency motivated investigations. I then show the relevance of those tests in the context of my data. The information presented below is drawn from Michael P. Oakes's Statistics for Corpus Linguistics.

As a first step into the quantitative stage, the central tendency of the data needs to be identified. The central tendency measure represents the data of a group of items in a single score and as being the most typical score for a data set (p.2). There are three possible types of measure to identify the central tendency of a data set: the median (the central score of the distribution with half of the scores being above the median and the other half falling below), the mode (the most frequently obtained score in the data set) and the mean (the average of all scores in the data set).The mode measure is recognised to have the disadvantage to be easily affected by chance scores in smaller data sets. The disadvantage of the mean, on the other hand, is that it is affected by extreme values and might not be reliable in cases where the data is not normally distributed. In the context of my data, the mean is judged to be the most appropriate central tendency measure (a preliminary investigation of the frequency of the occurrences of may, can, may not, cannot and can't did not reveal cases of extremely low/high number of uses; parametric tests (described below) assume that the mean is an appropriate measure of central tendency). The mean measure is also necessary for the calculation of z scores (statistical measure of the closeness of an element to the mean value for all the elements in a group) and standard deviation (measure which takes into account the distance of every data item from the mean).

Once the central tendency of individual data sets is identified, specific statistical tests will allow for the comparison of those data sets. Broadly, there are two types of tests: parametric tests and non-parametric tests. Parametric tests assume that: i) the data is normally distributed, ii) the mean and the standard deviation (described below) are appropriate measures of central tendency and dispersion, iii) observations are independent and scores assigned to one case must not bias the score given to any other. Non-parametric tests work with frequencies and ranked-ordered scales and they do not depend on the population being normally distributed.

Generally, parametric tests are considered to be more powerful and are recommended to be the tests of choice if all the necessary assumptions apply.

Parametric tests:

t test: statistical significance test based on the difference between observed and expected results. In other words, the t test allows for the comparison of the mean of two different data sets. In that way, the t test assesses the difference between two groups for normally distributed intervals of data where the mean and standard deviation are appropriate measures of central tendency and variability of the scores.

T tests are used rather then z score tests whenever the analyst is dealing with a small sample. (i.e. where either group has less than 30 items). A z-score + 1 indicates one standard variation above the mean. A z-score of -1.5 indicates 1.5 SDs below the mean.Once the standard deviation is calculated, the Z-score indicates how far off the mean a particular data item is located.

In the context of my data, a t test would establish whether there is any significant statistical difference (i.e. certainty that a result is unlikely to be purely due to chance) between:

-the uses of may and can in ICLE FR and LOCNESS.
-the uses of may not, cannot and can't in ICLE FR and LOCNESS
-the uses of may and can in ICLE FR and LOCNESS, in argumentative texts
-the uses of may and can in ICLE FR and LOCNESS, in literary texts

Based on the calculation of the mean, applying the standard variation test to the ICLE FR subcorpus would allow to identify the overall proportion of that data set not showing expected results and consequently being typical of that data set. Further, a calculation of the z scores in ICLE FR would allow to identify the uses of may/can that are the most typical of native French speakers (those would be represented by the z scores the closest to the mean) and the least typical uses (those would be represented by the z scores the furthest away from the mean).

The calculations will be useful because they will also enable to establish whether there are statistically significant differences in the uses of may/can between individual native French speakers. Such information will ultimately be useful at the qualitative stage of the investigation while examining the possible motivation for such possible differences at cognitive level.

Non-parametric tests:

In the above section, I pointed out the usefulness of parametric tests for the purpose of my study. However, it is worthy to note that as a non-parametric test, the Chi-Square test assesses the relationship between frequencies in a display table. That test allows for an estimation of whether the frequencies in a table differ significantly from each other. Oakes (1998) notes that when working with frequency data, the Chi-Square test is a good technique for modelling a two-variable table. In my study, the Chi-Square test could perhaps be used as an additional test to confirm results found from the standard deviation test.

So what's next?:
  • calculate the mean of the uses of may/can in ICLE FR
  • calculate the mean of the uses of may/can in LOCNESS
  • calculate the mean of the uses of may not/cannot/can't in ICLE FR
  • calculate the mean of the uses of may not/cannot/can't in LOCNESS
  • calculate the standard deviation in all of the above
  • carry out a t test in all of the above
  • calculate the z scores in all of the above

Monday, 2 March 2009

The semantic map model raises an issue for the comparison of 'may', 'can' and 'pouvoir'

In this post, I briefly touch on the difficulty to carry out cross-linguistic studies:

'Gries and Divjak recognise that "[c]ross-linguistic semantic studies are notoriously difficult given that different languages carve up conceptual space(s) in different ways (cf. Janda, to appear for discussion); for that reason, linguistic dimensions are difficult to compare across languages" (p7)'

Here, I raise a methodological difficulty involved in the cross-linguistic comparison of 'may', 'can' and 'pouvoir' on the basis of a forthcoming paper by Laura Janda : What is the role of semantic maps in cognitive linguistics?
(here is the Powerpoint verion) .

In her paper, although Janda grants some degree of usefulness to the semantic map model (helps identify patterns across languages, helps visualise complex data), she, nevertheless, identifies the limitations of the model, particularly in the context of the cognitive linguistics analysis.

Broadly, semantic maps are designed to compare large numbers of languages. The semantic map model assumes that:

i) a single universal conceptual space exists
ii) the grammar of each language is the sum of the 'lines' drawn by that language across this single shared space
iii) all languages are based on the same parameters

The semantic map model implies a conceptual space, that is the "universal backdrop of possible distinctions that human beings can recognise (and might grammaticalise)" and a conceptual map, that is "a distribution of actual distinctions made by one or a number of languages across the parameters of conceptual space" (p.5)



Janda follows Langacker (2006) in her distinction between discrete and continuous linguistic models :

"I would like to frame this discussion of semantic maps in terms of Langacker’s (2006) concerns about continuity and discreteness in linguistic models. As Langacker points out, all models are metaphorical, and all metaphors are potentially misleading, particularly if one forgets that the metaphor may be suppressing some information, and/or if the metaphor is excessively discrete or continuous. Most phenomena, including linguistic phenomena, are complex enough to justify applying both discrete and continuous models in their interpretation (Langacker 2006:107). Imposing discreteness on a system means that grouping and reification facilitate the identification of units that would not be available in a continuous description, such as galaxies, archipelagos, villages, and discrete (yet related) languages. Continuity has the advantage of facilitating focus on the relationships among parts of a system, making it possible to identify fields of similarity that discreteness ignores, such as dialect continua and all manner of gradients. We have the option of choosing various models, some of which will be relatively discrete and some of which will be relatively continuous." (p.12)


In other words, semantic maps only show distances and are not semantically meaningful. Further, as the semantic map model focuses on the discrete points, it ignores the continuous zones and the relations between each point. Janda insists that in cross-linguistic studies, these characteristics are amplified. Further, according to Janda, semantic maps fail to capture in detail "differences in metaphor, construal and scalability, all of which are key to a cognitive analysis" (p.30)

Finally, Janda points out that the semantic map model fails to take into consideration the qualitative differences between languages. Indeed she notes that a concept can be expressed by a grammatical category in one language but be expressed lexically in another language (p. 21). That point is of particular relevance to my project as English, 'may' and 'can' are grammatical words and therefore belong to the closed word-class. French 'pouvoir' on the other hand, is a lexical verb which belongs to an open word-class and which takes on inflections. So 'may'/'can' and 'pouvoir' show different degrees of grammaticalisation. Such difference in the lexicalisation process of the semantic domain of POSSIBILITY raises the issue of a possible cross-linguistic lexico-grammatical continuum which naturally contradicts the discrete quality of the semantic map model. Indeed, in her paper, Janda mainly uses the case of cross-linguistic polyfunctional grams to illustrate that there is no direct correlation between grams and concept and that cross-linguistic studies will reveal overlaps between markers and what they express. Janda's crosslinguistic illustrations mainly include languages that share similar grams and the discussion is centred around the various senses of those grams. In the case of 'may', 'can' and 'pouvoir', French and Englsih are not comparable in that way. As mentioned above in English, the forms are fully grammticalised whereas in French, 'pouvoir' inflects. Janda's paper raises the issue of the comparability of the three modals and the necessity to identify clear comparison criteria.

So in sum, the semantic map model is not attractive for the purpose of my study because as discrete by nature, it does not allow to infer on construal mechanisms. Further, it is mainly concerned with quantitative external differences and does not address qualitative properties.










Sunday, 1 March 2009

A case for using R for the statistical computation of my data

As a usage-based study my project involves quantitative data analyses. This post makes a brief case for the use of R as the chosen statistical computation program for the quantitative analyses of my data.

R is:
- a language and environment for statistical computing and graphics
- a program providing a variety of statistical and graphical techniques
- a free open-source program

The use of R is rapidly growing in the fields of statistics, engineering and science. This article from The New York Times (07/01/2009) provides an overview of the various uses of R by data analysts from differing professional backgrounds.

In corpus linguistics, the use of R is confidently spreading as it allows analysts to carry out multifactoral searches and approach data with fine degrees of granularity. Stefan Gries is actively contributing to the development of R and its application to the field of corpus linguistics and is the author of recently published Quantitative Corpus Linguistcs with R. As an open-source program, R is continually being improved and updated with new codes. In that respect, Gries provides linguists using R with downloadable updated codes on a regular basis.

Generally, the use of R has been praised in the literature concerned with analysis of linguitsic data. As Larson-Hall writes in her review of Baayen's (2008) Analysing linguistic data: A practical introduction to statistics using R : "(...) the statistical program you use guides the way you think about statistical analysis, and I do think R is far superior to any menu-driven program in this way"(p.472).

In the field of cognitive semantics, Dogmar Divjak and Stefan Gries (2008) (Clusters in the mind? Converging evidence from near synonymy in Russian) (The Mental Lexicon 3.2:188-213) provide illustrations of the use of R. Further, in her CMLLP-2008 [Corpus Methods in Linguistics and Language Teaching] Masterclass material used at the University of Chicago, Dogmar Divjak provides a suggested procedure to approach semantic issues via the use of R. Divjak uses the case of the semantics of be and have as a case study. The suggested methodology is as follows:

  1. Identify problem
  2. Come up with a list of variables
  3. Operationalize variables: ensure assigning unique value during manual annotation process
  4. Annotate corpus extractions
  5. ? hypothesis:
  • no > exploratory analysis
  • yes > confirmatory analysis
Considering all of the above, the use of R, for the purpose of my project, would methodologically place my investigation in line with other recognised current studies . However, it should be noted that the actual use of R is not recognised as straight forward. As Larson-Hall notes in her above-mentioned review:

"While I myself have become fairly familiar with R and think it is an excellent statistical program, I have to admit that there is something of a learning curve when it comes to using it for one's own data. (...) Although R is elegant and useful, I would not label it as an 'easy to learn' program (...)" (p. 472)