Inferential statistics

This page explains the basic principles of inferential statistics and tests that are relevant for corpus linguistic data. No programming knowledge is required. For an excellent introduction to analysing corpus data using R, a programming language for statistics, refer to the Statistics page at the University of Mannheim’s Anglistik Toolbox IV.


When we test for statistical significance, we determine whether the differences (e.g. between two different varieties of English) we found are genuine differences or whether they are just due to chance. Another way of thinking about inferential statistics is to establish whether the independent variable influences the dependent variable. The dependent variable is what we measure or count. In corpus linguistics, they’re usually frequency counts, but in other types of research, dependent variables can for example be reaction times or ratings on Likert scales from questionnaires. The independent variable, on the other hand, is a factor that (might) influence the dependent variable (for example region, register/situation, time frame, gender, age…). Thus, we say that the value of the dependent variable ‘depends’ on the value of the independent variable. Variables can have two or more variants. The variable “going-to future” , for instance, has the realisational variants “going to”, “gonna”, and even “ona”.

Significance tests calculate a likelihood of the results arising only due to chance and present the result in the form of a p-value. The smaller this value, the lower the likelihood that the differences only came about randomly. In other words, those tests evaluate the effect of an independent variable on the dependent variable. They test the null hypothesis (there is no effect/no relationship between X and Y) against the alternative hypothesis (there is a relationshop between X and Y) and calculate a likelihood of the results arising only due to chance (that is, a likelihood of the null hypothesis being true).

The results are expressed as p-value, a number between 0 and 1 that corresponds to 0%-100%. A p-value of 1 would therefore mean a 100% chance of the null hypothesis being true. In that case, we could conclude that the independent variable doesn’t influence the dependent variable at all.

Therefore, the smaller the p-value, the better: the smaller the p-value, the lower the likelihood that the differences only came about randomly. If the p-value is small enough, we can reject the null hypothesis and conclude that the independent variable has an effect on the dependent variable.

The conventional cut-off points for significance are:

  • p < 0.05 (this is the highest p-value that still allows you to call the differences you found significant. Anything higher than that and your results can’t be called statistically significant).
  • p < 0.01
  • p < 0.001

…although it is often recommended that you simply give the exact p-value, unless it is smaller than 0.001 (then just write p < 0.001). Sometimes, an asterisk rating system (*, **, ***) is used as an abbreviation for the levels of significance.

Chi-squared test

To use the chi-squared test ( χ2 test), you need the raw frequencies, not the normalised numbers (refer to the descriptive statistics page for an explanation of the different frequency counts). Also, the table needs to be at least 2 x 2 cells.

For each cell of the contingency table, an expected value is calculated. Those are the values that one would find if the null hypothesis were true. The test then calculates how much the actual values differ from the expected values and expresses the difference in the chi-squared value. The higher this value, the more the actual (or observed) frequency differs from the expected frequency. The p-value is derived from the chi-squared value.

Chi square is an ‘omnibus test’: it gives out a result for the whole table, not individual rows or columns. However, each cell’s contribution to χ2 can be established by looking at the individual chi values (i.e. a cell’s deviation from the expected values – the cells with higher numbers contribute more to the overall chi-squared value).

You need one additional number here: the degrees of freedom (df). They are calculated like this: df = (number of columns – 1) x (number of rows – 1)
Thus, for a 2 x 2 table, df = 1. The higher the degree of freedom is, the less useful your data probably is (e.g. 18 degrees of freedom would not be very useful).

The results of the chi-square test are often expressed like this: χ2 = (df, N = sum) = chi square value, p-value. The sum stands for the grand total of the observed frequencies (all frequency counts added up).

The chi-squared is sensitive to data size: it becomes unreliable with very low numbers (<5) and overestimates effects with very high numbers (‘alpha error’, i.e. it might find an effect where no effect is actually present). As an alternative for low numbers, the Fisher’s Exact Test should be used if one of the observed values is smaller than 5.

Online calculators


For instance, the frequency of truck and lorry in American as compared to British English is an often-cited lexical difference between the two varieties, so let’s see if this claim stands up to scrutiny in the corpora.

truck lorry
American English (COCA) 41075 300
British English (BNC) 1696 1948

Please note if replicating the search: a lemma search in combination with a noun PoS tag was used in each case.

From the raw frequencies, it’s obvious that truck is overwhelmingly preferred over lorry in COCA. In BNC, the picture is flipped on its head, although the differences between the two words are not as pronounced as they are in American English.

A chi-square test for this data gives the following result: X2 (2, N = 45019) = 19619.14, p < 0.001

Here, 19619.14 is the chi-squared value. In the table above, the numbers in round brackets are (the expected cell totals or the numbers one would find if there was no significant difference) and the numbers in square brackets are [the chi-square statistic for each cell]. The chi-square statistics shows how much the actual number differs from the expected one, i.e. how much each cell contributes to the overall chi-square value. In this case, this would be the cell lorry in British English.

Log likelihood

This test allows you to compare the frequency of a search term across two corpora. Again, the raw frequency as well as the sizes of the two corpora need to be entered into a calculator such as the UCREL log-likelihood wizard. This test works similarly to a chi-square test but is commonly regarded as more reliable because it does not assume that the data is normally distributed. Furthermore, it’s not possible to use the chi-square test if you’re comparing the frequency of just one word or grammatical construction in two corpora (instead of two words/constructions in two corpora).


For instance, if we’d like to find out whether the word “bloody” (often used as a swearword) is more prominent in the British or the Irish portion of the GloWbE corpus and whether the results are statistically significant, we’d enter the absolute (raw) frequency along with the size of the corpora, like this:

(Corpus 1 stands for the UK and 2 for Ireland here).

The results look like this:

O1 and O2 are simply the frequency counts we entered and %1 and %2 the percentages. The plus indicates that the search term (in this case, “bloody”) is more frequent (or overrepresented) in corpus 1 (or British English) than in corpus 2 (Irish English). The reverse case would be indicated by a minus.

The log-likelihood score, or LL, reveals whether the differences are significant or not. Contrary to the p-value, the higher the LL, the better. A LL of 3.84 or higher corresponds to a p-value of <0.05, or what is usually labelled as statistically significant.

Thus, the finding that “bloody” can be attested more frequently in the British section when compared to the Irish section is significant here, with LL=22.98.

LL and corresponding p-values

  • 5% level; p < 0.05; LL = 3.84
  • 1% level; p < 0.01; LL = 6.63
  • 0.1% level; p < 0.001; LL = 10.83
  • 0.01% level; p < 0.0001; LL = 15.13

Mutual information score (MI)

The mutual information score is a measure of collocational strength – how strongly to two words occur together? It also compares the observed frequency of the collocates to the expected frequency in the span (e.g. 4 words). The higher the MI score, the stronger the link between two items. An MI score of 3.0 or higher to be taken as evidence that two items are collocates. An MI score close to 0 means it’s likely that the two items co-occur by chance. A negative MI score, on the other hand, indicates that the two items tend to shun each other.