Tuesday 24 February 2009

The point of variability assessment within and between corpora

Gries (2006) ( Exploring variability within and between corpora: some methodological considerations ) generally shows that the reliability of corpus findings results from corpus variability/homogeneity assessments both within and between corpora. Variability/homogeneity assessments can be carried out via a range of statistical tests involving the identification, by the analyst, of parameters whose variability will be measured and a chosen level of granularity at which the corpus will be investigated. Ultimately, Gries's method aims to improve descriptive accuracy in corpus-based studies.

On the basis that "no corpora are alike, and that sometimes different results are reported for very similar corpora (or even the same corpus)" (abstract), Gries addresses three core issues:

i) "how to identify and quantify the degree of variation coming with one's results" (abst.)
ii) "how to investigate the source of the observed variation in corpora" (abst.)
iii)"how homogeneous one's corpus is with respect to a particular phenomenon" (abst.)

Although Gries recognises that many quantitative studies limit themselves to reporting word frequencies, such methodology is not the most useful way to approach phenomenon X in corpus Y. Gries points out that those approaches are not sufficient and argues that statistical testing (more on the specific statistical tests relevant to my study in a later post) and subsequent interpretation of the data summarised are necessary to reach reliable corpus findings:

"This [methodological] choice seriously limits the range of applicability of these approaches [ word frequency approaches]. First, an approach to corpus homogeneity based on word frequency is much more likely to produce biased results when applied to corpora containing text samples focusing on a particular topic. " [A primary investigation of ICLE FR and LOCNESS has verified the point that depending on the nature of the topics discussed in the corpus, 'may' and 'can' are used more or less frequently. Note: although the topics discussed in ICLE FR and LOCNESS independently, are similar, they are not systematically identical.]



Within and between corpora variability:

- Gries provides a brief literature review of the studies addressing both types of corpora variability.

- provides statistical evidence that different corpus-based studies on the overall frequency of the present perfect in English bring different results. Such evidence suggests that i) a word frequency approach is not reliable and ii) alternative and more reliable ways to approach the corpus are needed.

- case studies are presented as detailed exemplifications of the above


Assessment of variability: limitations for ICLE FR, LOCNESS and CODIF:

Although Gries convincingly argues in favour of corpora variability assessment, practically and in the context of learner corpora, the application of Gries' methodology is not straight forward and can only be applied partially. This leads to the necessity to integrate the word frequency approach and the corpora variability approach to a single corpus investigation. The nature corpus investigation in learner language involves the comparison of (at least) two sub data sets: one set compiled of language produced by non-native speakers of language x in language y (the investigated corpus), and one set compiled of language produced in language y by native speakers of language y (used as the control corpus). In terms of background information on the participants, the amount of information provided alongside the investigated and control data sets is not even. ICLE FR, for instance, my investigated corpus, provides background information that would, in the context of a corpora variability assessment, allow me to identify a range of parameters such as: female/male writers, literary/argumentative texts, writing conditions (e.g. exam condition, timed/not timed conditions, use of reference tools/reference tools not allowed). LOCNESS and CODIF, on the other hand, as my control data sets, provide very limited background information on the participants. In the case of LOCNESS, the only identifiable working parameters are genre (i.e. literary/argumentative texts), individual essays/files and negation (although genre is not a very reliable parameter as literary texts are generally under represented in the corpus). As for CODIF, although I have the information that the data set is directly comparable with ICLE FR and LOCNESS, I have no detailed information allowing me to identify workable parameters in view of a corpora variability assessment. So in sum, there are limitations to the application of Gries' methodology to the corpora I am specifically using. Some degree of assessment can, however, be carried out:

- corpora variability between ICLE FR and LOCNESS:
  • possible parameters: individual essays, negation, genre
- corpora variability between the ICLE corpus and its French subsection ICLE FR:
  • possible parameters: male/female writers, genre, individual essays/files, writing conditions, negation
- corpora variability between ICLE FR and CODIF:
  • possible parameter: individual essays/files, negation
- corpora variability between LOCNESS and CODIF:
  • possible parameter: individual essays/files, negation
-corpora variability between ICLE FR, LOCNESS, CODIF:
  • possible parameter: individual essays/files
The above shows that in the case of ICLE FR, LOCNESS and CODIF corpora, the within corpus variability assessment would prove much more thorough and useful than the between corpora variability assessment (incl. ICLE FR vs LOCNESS, ICLE FR vs CODIF, LOCNESS vs CODIF). Ultimately and following Gries, we may speculate that results of the overall investigation could be affected by the limited applicability of his methodology to the project.


More useful quote:

"One of the most important concepts within corpus linguistics is variability. Variability is a key issue on several levels, simultaneously. First, variability of always of prime importance when reporting one's results: without an indication of the variability found in one's data, the interpretation of, say, aggregated frequencies/percentages or measures of the central tendency of a single study is usually quite difficult, and the comparison of results between different studies is seriously impaired" (p.110)


No comments:

Post a Comment