Introducing corpora

What’s a corpus?

In linguistics, a corpus (plural: corpora) is defined as a collection of texts used for analyses of various language phenomena. Nowadays, corpora are usually stored in the form of electronic databases and are easily searchable. They contain natural, authentic language that occurs naturally, instead of made-up examples, which is what makes them so useful for studying language. Most corpora available today are balanced or systematic, i.e. they cover a variety of genres, registers, or styles. Also, they are very large, often encompassing millions of words, which makes conclusions drawn from them fairly reliable. Thus, a corpus is a systematic, computerized collection of authentic language that is used for linguistic analysis.

What’s corpus linguistics?

Following the above definition of a language corpus, corpus linguistics means to study language with the help of the language samples corpora contain, often making use of specialized software. Therefore, corpus linguistics should be seen as a method to obtain data and analyse it both qualitatively and quantitatively. It is, however, neither a separate branch of linguistics (like sociolinguistics) nor a theory of language, but rather a tool for analysis. For example, historic corpora make it possible to track developments in language like the emergence of the contracted forms gonna and wanna, corpora that contain sociodemographic information allow for an analysis of language features such as the discourse marker like by gender or age, and corpora that contain data from several regions let their users compare the frequency of usage of words like boot and bonnet in British versus American English, for example.

Types of corpora

There are many types of corpora available, and a single corpus can sometimes even fulfill multiple functions or belong to several categories. Therefore, these types of corpora are not mutually exclusive – corpora can be both synchronic and regional, for example.

General corpora contain a large number of words of both written and spoken language and data from different genres or text types. This data stems from a variety of people with different social backgrounds, regions, and ages. If this sociodemographic information is available, the corpora can also function as a sociolinguistic resource.
Examples: British National Corpus (BNC), Bank of EnglishCorpus of Contemporary American English (COCA)

Synchronic corpora offer language data from only one specific point in time.
Examples: Corpus of Contemporary American English (COCA), F-LOB and Frown

Historical/diachronic corpora usually span several decades and thus allow for an analysis of developments in language over time.
Examples: Corpus of Historical American English (COHA), ARCHER, Helsinki

Learner corpora collect data (for example written exams or essays) that has been produced by foreign language learners.
Examples: International Corpus of Learner English (ICLE), Cambridge Learner Corpus

Corpora of different varieties provide different regional varieties of one language so that dialectal variation can be studied.
Examples: Corpus of Global Web-Based English (GloWbE)International Corpus of English (ICE), Freiburg English Dialect Corpus (FRED)

Specialised corpora are also available and represent fairly clearly delimited subsets of language.
Examples: Michigan Corpus of Academic Spoken English (MICASE), British Academic Spoken English (BASE) and British Academic Written English (BAWE)

Refer to the Other corpora tab, Corpus-based Linguistics Links and the Corpus Resource Database(CoRD) for a longer list of corpora and a corpus finder, but since new corpora are constantly being compiled, no list can ever be exhaustive.

Choosing the right corpus for your research

A large corpus (1 million words or more) tends to give good results even if the feature you’re investigating is fairly rare. A small corpus (200,000 to 500,000 words), on the other hand, should preferably be used for frequent words or syntactic structures. Corpora of spoken language are usually comparatively small since their data must first be painstakingly transcribed.

Get to know the corpus you’re using: How large is it? Does it incorporate written or spoken language, or both? Which language variety/varieties does it encompass, and which genres or text types? Is it balanced?

Google is your friend! There might be a specialized corpus out there that’ll perfectly suit your needs.

Take care not to overgeneralise your results. For instance, if you worked with an American corpus and found that one feature is more commonly used than another, you can’t claim that this is true for the English language as a whole, but only for the variety you researched. Similarly, if you’ve used a written corpus, you can’t assume that your results hold true for spoken language, or if you’ve researched a specialized register, such as academic language, it would be presumptuous to expand your findings to other registers.