Sunday, 8 February 2009

ICLE and LOCNESS: words and figures

A little bit about the data I am using for my project:

The data is drawn from two corpora: the International Corpus of Learner English (ICLE) and the Louvain Corpus of Native English Essays (LOCNESS).

ICLE is a corpus of written learner English including essays written by native speakers of Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish and Swedish. ICLE counts a total number of 2,500,353 words distributed evenly across the eleven national subcorpora. My project focuses specifically on the French subcorpus (namely here ICLE FR) which counts a total of 228,081 words. The French subcorpus comprises a further two subcorpora: a subcorpus of argumentative texts -- counting 177,963 words, and a subcorpus of literary texts – counting 50,118 words. The French subcorpus comprises 347 essays averaging 500 words each. All participants in the ICLE corpus “are university undergraduates in English (usually in their third or fourth year)”, and “the proficiency level ranges from higher intermediate to advanced” (Granger et al., 2002)

LOCNESS is a corpus of native English essays comparable with ICLE (i.e. the participants are also university undergraduates, essays are averaging the same length and are dealing with similar topics). LOCNESS counts a total of 324,304 words and comprises three subcorpora: a British pupils’ A level essays subcorpus of 60,209 words, a British university students’ essays subcorpus of 95,695 words and an American university essays subcorpus of 168,400 words. Similarly to ICLE, LOCNESS also includes argumentative and literary texts.

What are the figures telling us so far?:

Early results of quantitative analyses of ICLE(FR) and LOCNESS have allowed me to establish that the patterns of uses of may and can in the French subsection of ICLE do play a role in the profiling of French-English IL. With the help of my kind friend B, statistics expert, I am now planning to continue to approach the data quantitatively and to dig deeper into it by running a number of variance tests that should i) consolidate nicely the results I have so far and ii) provide a much sharper picture of the uses of may/can in ICLE FR. Results from the variance tests should be ready to be analysed by the end of this week, early next week max.


Soon, the data will be looked into qualitatively -- counting up occurrences of specific meanings of may/can instead of occurrences of the actual words (I've already started to think about about manual searches of image-schemas in ICLE FR and LOCNESS). However, before I start the process, it might be usefull to check out whether I can pick up a few tips from Adam Kilgarriff(http://www.kilgarriff.co.uk/). Particular papers he wrote that could be of interest to me:

"I don't believe in word senses" (1997). Computers and the Humanities 31: 91-113.
Reprinted in Practical Lexicography: a Reader. Fontenelle, editor. Oxford University Press. 2008.
Reprinted in Polysemy: Flexible patterns of meaning in language and mind Nerlich, Todd, Herman and Clarke, editors. Walter de Gruyter. Pp 361-392.
To be reprinted in Readings in the Lexicon Pustejovsky and Wilks, editors. MIT Press.


Grammar is to meaning as the law is to good behaviour (2007) Corpus Linguistics and Linguistic Theory 3 (2): 195-198.

Comparing Corpora (2001) International Journal of Corpus Linguistics 6 (1): 1-37.
Reprinted in Corpus Linguistics: Critical Concepts in Linguistics. Teubert and Krishnamurthy, editors. Routledge. 2007.

How dominant is the commonest sense of a word? (2004) In: Text, Speech, Dialogue. Lecture Notes in Artificial Intelligence Vol. 3206. Sojka, Kopecek and Pala, Eds. Springer Verlag: 103-112.
Reprinted in Lexicology: Critical concepts in Linguistics Hanks, editor. Routledge, 2007

Busy week ahead!

1 comment:

  1. How does the sampling work? How and what do you sample? Is the corpus considered the sample? Are they sampled in the same way?

    ReplyDelete