Friday 6 March 2009

Approaching the data statistically: what to test, how and why ?

At this point in the project, the investigation of the data is broadly anticipated to include two separate stages, each one of those stages bearing different methodological assumptions. The first stage is purely quantitative in nature and follows a traditional trend in corpus linguistics to assess "the distribution of a single variable such as word frequency" (Oakes: 1998). The literature refers to that type of approach as univariate By adopting the traditional approach in the first stage of the investigation of the data, my aim is to provide a preliminary overview of the behaviour of may, can and pouvoir in all three subcorpora. However, although that stage will provide general patterns of uses of the modals in the different subcorpora, the weight of the results gathered from frequency tests will need to be handled cautiously on the basis of variability within and between corpora. The second stage of the data investigation process includes the computation of qualitative information such as word senses and contextual/pragmatic information. That stage is anticipated to consist mainly of cluster analyses. A description of that type of analysis and its implications for my study will be presented in a later post.

This post is only concerned with the first stage of investigation. I present an overview of the range of statistical tests available and that I judge suitable for word-frequency motivated investigations. I then show the relevance of those tests in the context of my data. The information presented below is drawn from Michael P. Oakes's Statistics for Corpus Linguistics.

As a first step into the quantitative stage, the central tendency of the data needs to be identified. The central tendency measure represents the data of a group of items in a single score and as being the most typical score for a data set (p.2). There are three possible types of measure to identify the central tendency of a data set: the median (the central score of the distribution with half of the scores being above the median and the other half falling below), the mode (the most frequently obtained score in the data set) and the mean (the average of all scores in the data set).The mode measure is recognised to have the disadvantage to be easily affected by chance scores in smaller data sets. The disadvantage of the mean, on the other hand, is that it is affected by extreme values and might not be reliable in cases where the data is not normally distributed. In the context of my data, the mean is judged to be the most appropriate central tendency measure (a preliminary investigation of the frequency of the occurrences of may, can, may not, cannot and can't did not reveal cases of extremely low/high number of uses; parametric tests (described below) assume that the mean is an appropriate measure of central tendency). The mean measure is also necessary for the calculation of z scores (statistical measure of the closeness of an element to the mean value for all the elements in a group) and standard deviation (measure which takes into account the distance of every data item from the mean).

Once the central tendency of individual data sets is identified, specific statistical tests will allow for the comparison of those data sets. Broadly, there are two types of tests: parametric tests and non-parametric tests. Parametric tests assume that: i) the data is normally distributed, ii) the mean and the standard deviation (described below) are appropriate measures of central tendency and dispersion, iii) observations are independent and scores assigned to one case must not bias the score given to any other. Non-parametric tests work with frequencies and ranked-ordered scales and they do not depend on the population being normally distributed.

Generally, parametric tests are considered to be more powerful and are recommended to be the tests of choice if all the necessary assumptions apply.

Parametric tests:

t test: statistical significance test based on the difference between observed and expected results. In other words, the t test allows for the comparison of the mean of two different data sets. In that way, the t test assesses the difference between two groups for normally distributed intervals of data where the mean and standard deviation are appropriate measures of central tendency and variability of the scores.

T tests are used rather then z score tests whenever the analyst is dealing with a small sample. (i.e. where either group has less than 30 items). A z-score + 1 indicates one standard variation above the mean. A z-score of -1.5 indicates 1.5 SDs below the mean.Once the standard deviation is calculated, the Z-score indicates how far off the mean a particular data item is located.

In the context of my data, a t test would establish whether there is any significant statistical difference (i.e. certainty that a result is unlikely to be purely due to chance) between:

-the uses of may and can in ICLE FR and LOCNESS.
-the uses of may not, cannot and can't in ICLE FR and LOCNESS
-the uses of may and can in ICLE FR and LOCNESS, in argumentative texts
-the uses of may and can in ICLE FR and LOCNESS, in literary texts

Based on the calculation of the mean, applying the standard variation test to the ICLE FR subcorpus would allow to identify the overall proportion of that data set not showing expected results and consequently being typical of that data set. Further, a calculation of the z scores in ICLE FR would allow to identify the uses of may/can that are the most typical of native French speakers (those would be represented by the z scores the closest to the mean) and the least typical uses (those would be represented by the z scores the furthest away from the mean).

The calculations will be useful because they will also enable to establish whether there are statistically significant differences in the uses of may/can between individual native French speakers. Such information will ultimately be useful at the qualitative stage of the investigation while examining the possible motivation for such possible differences at cognitive level.

Non-parametric tests:

In the above section, I pointed out the usefulness of parametric tests for the purpose of my study. However, it is worthy to note that as a non-parametric test, the Chi-Square test assesses the relationship between frequencies in a display table. That test allows for an estimation of whether the frequencies in a table differ significantly from each other. Oakes (1998) notes that when working with frequency data, the Chi-Square test is a good technique for modelling a two-variable table. In my study, the Chi-Square test could perhaps be used as an additional test to confirm results found from the standard deviation test.

So what's next?:
  • calculate the mean of the uses of may/can in ICLE FR
  • calculate the mean of the uses of may/can in LOCNESS
  • calculate the mean of the uses of may not/cannot/can't in ICLE FR
  • calculate the mean of the uses of may not/cannot/can't in LOCNESS
  • calculate the standard deviation in all of the above
  • carry out a t test in all of the above
  • calculate the z scores in all of the above

2 comments:

  1. This is fascinating stuff - I never really understood the point of the t test but you explain it so clearly :) that I totally get it now.

    I think we have a lot of general crossover in our research here, in that we're both trying to capture (and perhaps quantify) quite a subjective notion or theory - would you agree?

    Very useful post!

    ReplyDelete
  2. I do agree, Anna, increasingly so! In fact, I've had a 'blend' type idea which I'd like to suggest to you! Could be a really exciting project. I'll be in touch! Thanks for the comment!

    ReplyDelete