Sunday 1 March 2009

A case for using R for the statistical computation of my data

As a usage-based study my project involves quantitative data analyses. This post makes a brief case for the use of R as the chosen statistical computation program for the quantitative analyses of my data.

R is:
- a language and environment for statistical computing and graphics
- a program providing a variety of statistical and graphical techniques
- a free open-source program

The use of R is rapidly growing in the fields of statistics, engineering and science. This article from The New York Times (07/01/2009) provides an overview of the various uses of R by data analysts from differing professional backgrounds.

In corpus linguistics, the use of R is confidently spreading as it allows analysts to carry out multifactoral searches and approach data with fine degrees of granularity. Stefan Gries is actively contributing to the development of R and its application to the field of corpus linguistics and is the author of recently published Quantitative Corpus Linguistcs with R. As an open-source program, R is continually being improved and updated with new codes. In that respect, Gries provides linguists using R with downloadable updated codes on a regular basis.

Generally, the use of R has been praised in the literature concerned with analysis of linguitsic data. As Larson-Hall writes in her review of Baayen's (2008) Analysing linguistic data: A practical introduction to statistics using R : "(...) the statistical program you use guides the way you think about statistical analysis, and I do think R is far superior to any menu-driven program in this way"(p.472).

In the field of cognitive semantics, Dogmar Divjak and Stefan Gries (2008) (Clusters in the mind? Converging evidence from near synonymy in Russian) (The Mental Lexicon 3.2:188-213) provide illustrations of the use of R. Further, in her CMLLP-2008 [Corpus Methods in Linguistics and Language Teaching] Masterclass material used at the University of Chicago, Dogmar Divjak provides a suggested procedure to approach semantic issues via the use of R. Divjak uses the case of the semantics of be and have as a case study. The suggested methodology is as follows:

  1. Identify problem
  2. Come up with a list of variables
  3. Operationalize variables: ensure assigning unique value during manual annotation process
  4. Annotate corpus extractions
  5. ? hypothesis:
  • no > exploratory analysis
  • yes > confirmatory analysis
Considering all of the above, the use of R, for the purpose of my project, would methodologically place my investigation in line with other recognised current studies . However, it should be noted that the actual use of R is not recognised as straight forward. As Larson-Hall notes in her above-mentioned review:

"While I myself have become fairly familiar with R and think it is an excellent statistical program, I have to admit that there is something of a learning curve when it comes to using it for one's own data. (...) Although R is elegant and useful, I would not label it as an 'easy to learn' program (...)" (p. 472)

6 comments:

  1. You're going to be doing programming? In R? That's a brave move! A friend of mine writes a blog on programming for scientists. He wrote a brief introduction to R and a few pointers on where things may go awry for anyone new to the language:

    http://www.programming4scientists.com/2008/12/the-basics-ofr/

    Hope it's helpful!

    ReplyDelete
  2. Let me just say again that this blog is an excellent idea. Every doctoral student should have one!

    Do you know anyone at Sussex who is using R?

    ReplyDelete
  3. Thanks for the comments. I've contacted a couple of people in the area to find out if there are any R users close to Sussex.
    Leon, thanks for the link. It's nice to read more praises about the program.
    So what's next? My intention was to start learning the program via Gries's publication 'Quantitative Corpus Linguistics with R' but unfortunately, only the hardback copy is available in the UK at the moment! The paperback version won't be released before March 31st and then there will be more waiting time for the shipping. So it will be about mid-April before I can actually start learning R! Frustrating delay!

    ReplyDelete
  4. Perhaps it would be wise to consult the library for any other guides to R. If you can get a feel for the language before applying it to research you can save yourself time by making mistakes (read: learning) now, rather than let them crop up when you come to applying R directly to your research.

    ReplyDelete
  5. Following from the last post, you can request texts to be bought by the university library as well. I worked from the same office as library acquisitions and the turnaround time from ordering to recieving was, to memory, very fast. I don't know the procedure for this well since my department has a separate budget assigned to purchasing journals/books, which is rarely used. At any rate, ask the library help desk and they should be able to help.

    ReplyDelete
  6. Spoke to my supervisor today, he had heard of R and was thinking of looking into it. After our conversation a week or so ago, I'm going to try and use it as well - will let you know how I get on!

    ReplyDelete