Below, you’ll find some exercises to get you started. Scroll further down for suggested search strings and solutions.

Please note: These exercises were written before COCA was updated in December 2017. Thus, your results might differ slightly from mine.

Exercise 1: neologisms

Which of these neologisms (or new words) can be attested in COCA?

(a) adorbs as a shortened form of adorable – in which file(s) does it appear?

(b) amazeballs instead of amazing – in which file(s) does it appear?

(c) Find all forms of binge-watch or binge watch (for excessively watching a series). Is the variant with or without the hyphen more common?

(d) Which forms of mansplain (for men condescendingly explaining something very obvious to women) appear in COCA, and how many of them in quotation marks?

Exercise 2: nerd vs. geek

(a) Which corpus could you use in order to find the first instances of nerd and geek, respectively? Hint: remember to check the context your search term appears in, especially if the first occurrences seem suspiciously early…

(b) Compare the collocates of nerd and geek in COCA. How could you execute each search? Which subtle differences in meaning do the results hint at?

Exercise 3: American vs. British English – lighted or lit up, collective nouns

(a) In American English, is it more common to say that someone lighted up (a cigarette, for example) or that someone lit up? How does this compare to British English?

(b) How could you investigate possible differences between British and American English in terms of how they treat their collective nouns (“the team is winning” or “the team are winning”)?

Exercise 4: bottle

Using COCA: Which strings could you use in order to find bottle used as (a) a verb and (b) a noun? Which words does the list view give you, respectively?

(c) Which usage seems to be more common?

(d) How could you modify the search in order to find all examples of the idiom to bottle up one’s emotions? How often does this expression appear, and in which form?

(e) Change the search string so that examples of the same idiom, but with any other noun instead of emotions are listed. What do you find?

Exercise 5: get-passive

Which search string could you use to find all (or at least many) instances of the get-passive (e.g. they got married)?

(a) Execute the search in COCA.

(b) In which section does this construction appear the most and in which the least often?

(c) What are the 10 most commonly collocated words?

Exercise 6: emotions

(a) In BNC, search for synonyms of happy. In which section do they appear the most and the least frequently? Give the raw and normalized frequencies for both sections.

(b) In which subsection of the BNC (i.e. interview, drama, meeting, university essay, etc.) are synonyms of happy most commonly found by raw frequency, in which by normalized frequency?

(c) Save all the synonyms of happy in a word list and label it happySynonyms. Execute a search for all lemmas in that list plus any subordinating conjunction in the KWIC display, ordering first by the words in the list and then by the words  to the right. What do you notice when you look at sentences that contain content?

Exercise 7: English world-wide

Use GloWbE to determine in which varieties of English the following words or expressions are the most common. Can you figure out their meaning from the context?

  1. lah
  2. vex
  3. dunny
  4. tuque or toque
  5. lekker
  6. dinkum
  7. good on (pronoun), e.g. good on you
  8. sleveen
  9. speed money
  10. tai tai
  11. habitual forms with “to do”, e.g. I do be tired (as in: I am often/habitually tired)

Exercise 8: -ite and -ish

(a) How productive is the ending -ite? Hint: check how many hapax legomena that end in -ite you can find in BNC.

(b) How institutionalised is the ending -ish as in green-ish? Can it also be used as a word on its own?

Exercise 9: tracing semantic change

(a) In COHA, how could you establish if the meaning of a word has changed over time, and in which way it might have changed?

(b) Apply this method to nice, comparing data from 1810 – 1839 to data from 1980 – 2009.

Exercise 10: going to vs. gonna

(a) Using CORE, find out in which sections the non-contracted (going to [verb]) and in which sections the contracted form (gonna [verb]) is more commonly used.

(b) Utilise the KWIC display to order first by the word to the left and then the word to the right for each search string. Which sentences contain probably before gonna/going to?


Exercise 1: neologisms

(a) At first glance, adorbs doesn’t appear, but with a wildcard search adorbs*, one hit can be found with a punctuation mark (Adorbs.), in 2011, spoken, NBC_Today.

(b) Similarly for amazeballs, there’s one hit with and one without a punctuation mark (2012, spoken, NBC again).

(c) With the hyphen, 5 variants and 22 total hits can be attested. Without it, there are only 4 variants but 28 hits.

(d) Mainsplain only features as mansplaining with 4 hits in total of which 2 are in quotation marks.

Exercise 2: nerd vs. geek

(a) A historical corpus such as COHA would work well. Pay attention to possible typos that might be rendered as nerd although this is not what they were supposed to spell. The example from 1933, for instance, reads “all we nerd is the hundred dollars”, which, presumably, should spell out “need”. The 1982 example, however, seems genuine: “The chief executiveis [sic] what the kids call a nerd.” As for geek, the 1914 example seems more like a name, but for 1957, we find the sentence “all you could do was keep smiling and thank God he wasn’t a geek”.

(b) Use the compare display in order to analyse collocates associated with nerd and geek. Enter your two words (as lemma searches) and the part of speech in the third box. Here, it would make sense to look for compound nouns, so you could enter _nn*and set the search to include words one or two to the left.

Again, thoroughly check your results to avoid misinterpreting them (for instance, there’s a late 90s TV show called “Freaks and Geeks” that appears in the results). It looks like the term nerd is more commonly used in connection with younger people (high-school and kids are associated more strongly with nerd than with geek). Furthermore, although they both have technical connotations (computer, science, and tech are terms that appear frequently for both), geeks don’t necessarily seem to have that connection but can instead be interested in a wide range of things and activities like beer, drama, wine, fantasy, theatre, film, sci-fi…

Exercise 3: American vs. British English – lighted or lit up, collective nouns

(a) A useful search string could look something like this: lighted|lit up (the | stands for “or”)

In COCA, lit up is far more common with 2226 as compared to 55 hits for lighted up. A similar picture emerges in the BNC with 367 as compared to 4 hits.

Due to the corpora differing in size, however, those results have to be normalised. COCA encompasses 520 million and BNC 100 million words. Thus, lit up appears 11.54 times per 1 million words in COCA whereas for lighted up, the value is 0.11 (so fairly rare). In the BNC, lit up can be attested 3.67 times per 1 mill words and lighted up is very rare at 0.04 occurrences per 1 million words.

If we want to analyse the results statistically, a calculator such as the one available here gives the following result:

The chi-square statistic is 2.6068. The p-value is .106403. This result is not significant at p < .05.

This shows that the differences in this table are not statistically significant.

(b) To avoid having to calculate the sums by hand, it’s advisable to search for singular and plural forms separately, so, for instance: team|police|group|class is|was and team|police|group|class are|were.

In COCA, the singular forms yield 15486 hits altogether and the plural forms 7272. Similarly, the singular forms are more frequent than the plural in the BNC, but by a much smaller margin (2708 as compared to 2496). A chi-square test gives the following result: X2 (2, N = 27962) = 476.89, p < 0.001. Thus, the differences are significant in this case.

Exercise 4: bottle

(a) old syntax: [bottle]_v*
new syntax: BOTTLE_v*
The list display gives you: bottled, bottle, bottling, bottles

(b) old syntax: [bottle]_nn*
new syntax: BOTTLE_nn*
The list display shows: bottle, bottles

(c) Bottle as a verb can be attested a total of 861 times whereas as a noun, it appears a staggering 28832 times. Thus, it is far more common as a noun.

(d) The search string might look something like this: _p* [bottle] up _app* [emotion]

This means the system is looking for any pronoun _p* (like I, she, you), any realisational variants of bottle (the verb tag shouldn’t be necessary here), up, possessive pronouns _app* (like my, her, your), and any variant of emotion.

This search string finds one hit each for “who bottle up their emotions” and “who bottled up their emotions”. You could also replace the _p* by _n* to look for nouns.

(e) Using an excluding search, simply change [emotion] to -[emotion].

The two hits are: “You bottle up your fury” and “They bottle up their expressiveness”.

Exercise 5: get-passive

(a) Possible search string: _p* [get] _v?n*
= pronoun + all variants of get + past participle

(b) Using the chart display, it looks like it is most commonly used in spoken language (67.57 words per mil) and only appears rarely in academic registers (4.86 words per mil).

(c) Using the collocates display, keep the search string and select up to 4 positions on the right (the more you select, the slower the search will be and the likelier an error message is).

Most common words: married, up, caught, rid, paid, started, hit, done, elected, fired.

Again, you could repeat the search replacing _p* with _n*.

Exercise 6: emotions

(a) Search string: [=happy] using a synonym search in the chart view

By normalised frequency, synonyms of happy are most common in the fiction section (647.99 pmw) and least common in the non-academic section (199.94). The raw frequencies are 4067 and 3298, respectively.

(b) No need to re-do the search, simply click on “see all sections at once” and use the headers to order by normalized frequency (# per million) or raw frequency (# tokens).

By raw frequency: W_fict_prose (10179 hits)
By normalized frequency: W_let_pers (1832.56 pmw)

(c) When executing the synonym search =happy, make sure to enable saving the results as a word list in the options, then tick all the boxes in the list display and name the list happySynonyms.

For a lemma search, type @HAPPYSYNONYMS and _cs* for subordinating conjunctions (or simply pick the right entry from the drop-down list).

Content also appears as a noun in the list, e.g. “a higher sulphur content”, showing that the synonym search doesn’t always take parts of speech into account and thus should be taken with a grain of salt.

Exercise 7: English world-wide

  1. lah: a discourse marker used mainly in Malaysia and Singapore
  2. vex (annoyed): frequent in Jamaica and Nigeria
  3. dunny (toilet):Australian and New Zealand English
  4. tuque|toque (beanie): Canada (and apparently also a type of monkey in Sri Lanka)
  5. lekker (nice, great – not necessarily referring to food!): South Africa and Ghana
  6. dinkum (genuine): Australian English
  7. good on (pronoun): mainly Australian andalso New Zealand English
  8. sleveen (sly fellow), only in Ireland, 4 hits
  9. speed money (bribe): mainly used in India
  10. tai tai (rich, ostentatious wives of leisure): frequent in Hong Kong, Malaysia, Singapore
  11. habitual forms (search string: _p* [do] be _v*): Ireland

Exercise 8: -ite and -ish

(a) A high number of hapax legomena (or words that only appear once in a corpus) indicate that a morpheme is highly productive, i.e. new words can be formed easily with it.

Using the search string *ite_nn*, 411 hapax legomena can be found, and although by far not all of them are genuine examples, it can still be argued that the -ite ending is very productive. Note that you’ll have to increase the number of hits that are displayed (in the options tab).

(b) With the string *-ish, 114 hits in total are found in the BNC. Although there are 26 hits for ish, most of these are due to misspellings, but at least one genuine example can be attested:

“You get back from work about (pause) tenish?
(pause) (SP:PS1R9) (sigh) (SP:PS1R8) Ish?
(SP:PS1R9) (whispering) (unclear) half past ten.”

Exercise 9: tracing semantic change

(a) In the sections menu, you can select which years or decades are displayed. Additionally, use the collocates display to compare results from the two time periods you’ve chosen.

(b) The results will vary depending on the settings, but don’t come as much of a surprise for the 1980s to 2000s: nice guy, really nice, nice people, nice day… However, some hits from the 1810s to 1830s (like discrimination, distinction, questions, observation, calculation) don’t make much sense with today’s meaning but fit better with the outdated and now rare meaning of meticulous, scrupulous, and exact.

Exercise 10: going to vs. gonna

(a) search string: gon na _v?i* (for infinitives)

For the non-contracted form, a wide distribution across most genres, except maybe the informational texts, can be found, but it appears most frequently in the interview and spoken sections. The contracted form is mainly attested in the lyrical section and TV/movie scripts. The overall distribution is very narrow, mainly in the oral genre.

(b) I’m probably gonna be a loser after he uploads this
It’s probably going to involve a massive government database
You’re probably going to have to file bankruptcy