Data analysis and statistics

When writing a paper or preparing a presentation that uses corpus data, there are a couple of points you should adhere to when presenting your data and describing how you acquired it.

First, in your methodology section, make sure to mention which corpus/which corpora you used, in which interface (or, if you’ve created your own corpus, with which programme(s)). Refer to the list of BYU corpora for advice on how to cite them correctly or, if you’re using a different corpus, the site you download it from usually provides similar information. Describe their specifications, i.e. their size, their variety of English, their genre, etc.

Second, which words or phrases were you looking for, and how? Describe your search strategy and reasons for choosing it. Be honest about experiments and possible disadvantages and why you still believe your search string maximised precision as well as recall.

In the BYU corpora, it’s even possible to share links to the search strings you used, including the precise settings, in a paper or presentation. To do that, access your search history via the clock symbol on the top right. This provides a list of all your search queries. They can be repeated, hidden, or deleted. Furthermore, it’s possible to add notes to them so they can be sorted by topic or paper, for example. When using the NOW corpus (which is updated daily), it’s advisable to highlight the dates up to the day you’re doing the search so that it can be reproduced accurately. Otherwise, the results will be different later.


In the results section of your paper, you should use statistical analyses to interpret your data.

In general, there are two types of statistics: descriptive and inferential statistics.

Descriptive statistics includes anything that can be used to describe, show, or summarise data so that patterns emerge. That includes tables of frequency counts, graphs and charts. However, one cannot make conclusions beyond one’s data using only descriptive statistics or reach conclusions regarding a hypothesis.

That’s what inferential statistics is for. Its tests provide an answer to the questions:
(a) Are the differences in the data statistically significant or not?
(b) If a new study were conducted under the same conditions, how sure can I be that the results would be the same (reproducibility)?
(c) Can I make generalisations beyond my data? For example, if I’ve found differences in a corpus of American English, such as COCA, could I argue that those differences are present in American English as a whole?

Both types of statistics are important and should be used in research.

Continue reading about descriptive statistics here first, then inferential statistics.