frequency: normalised vs raw

When analysing corpus data, we distinguish between different types of frequencies: raw, normalised, type, and relative frequency.

Raw frequencies are simply token counts: how often does a feature appear in a corpus? However, if frequencies are compared across two corpora that contain a different number of words, they have to be normalised: how often does a feature appear per 1 million words?

The chart display of the BYU corpora already calculates this number and you can add it in the list display by going to “options” and “display”, but here’s how to do it manually:

(number of occurrences x 1,000,000)/size of the corpus = relative frequency (pmw)

The type frequency refers to the number of different instantiations of a type (e.g. how many different adjectives can be attested in the construction “the Xer the better”)

Finally, relative frequency describes the frequency of one variant relative to the other(s) (e.g. “not” is contracted to “n’t” X% of the times).

Also refer to the data analysis and statistics tab.