Descriptive statistics

Frequency counts

There are different ways to express how often your search term appears in a corpus (or several corpora).

A token is defined as each occurrence of a word/ construction.

The raw frequency refers to the number of tokens, i.e. how often does a word or search string appear in total.

Often, it is necessary to normalise that frequency, especially when comparing data from two corpora that contain a different number of words. Thus, the normalised frequency is the number of tokens per amount of text (e.g. per 1 million words)

Here’s how to calculate normalized frequency (more specifically, per 1 million words):

  • raw frequency x 1,000,000 / number of words in corpus
  • e.g. a word appears 2534 times in a corpus that contains 453870201 words:
  • 2543 x 1,000,000 / 53870201 = 47.04

This calculation also works for a normalisation for a different amount of text, e.g. per 10,000 words, or 100,000 words:

  • raw frequency x amount of text (10,000; 100,000; etc.) / number of words in corpus

Sometimes, it’s easier to use this formula:

  • raw frequency / number of million words
  • e.g. a word appears 184 times in a corpus that contains 6.9 million words:
  • 184 / 6.9 = 26.67 pmw

    Type frequency refers to the number of different instantiations of a type (e.g. how many different adjectives can be found in “the Xer the better”).

Finally, relative frequency is defined as the frequency of one variant relative to the other(s) (e.g. “not” is contracted to “n’t” X% of the time). A good way to calculate this is to first add up the raw frequencies for both variants. This amounts to 100%. Thus, the result divided by 100 would be 1%. Each raw frequency divided by the equivalent of that 1% gives the relative frequency.

In the table below, for example, both variants (lighted up and lit up) together can be found 2513 times (100%) in COCA. Thus, 1% corresponds to 25.13.

Therefore: 56 / 25.13 = 2.23%, and: 2457 / 25.13 = 97.77%

Often, multiple frequency counts are used alongside each other. In the following contingency table, for example, the frequencies of lighted up and lit up in BNC and COCA are compared. The first column for each variant contains the raw frequency, the next column, the normalised frequency (per 1 million words), and the third column shows the relative frequency.

  COCA     BNC    
lighted up 56 0.1 pmw 2.23% 4 0.04 pmw 1.08%
lit up 2457 4.39 pmw 97.77% 367 3.67 pmw 98.92%
  2513 4.49 pmw 100% 371 3.71 pmw 100%

Continue reading here about inferential statistics.