List of BYU corpora

In a paper, you should take care to cite the corpora you used correctly, as you would with any other resources, like books or articles. Additionally, write the full name of the corpus the first time it is mentioned. Afterwards, you can use its abbreviation for the sake of brevity.

Corpus of Contemporary American English (COCA)

  • 520 million words
  • American English
  • 1990 – 2015
  • only large and balanced corpus of American English
  • 20 million words per year
  • equally divided between the genres: spoken, fiction, popular magazines, newspapers, academic
  • How to cite: Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online at http://corpus.byu.edu/coca/.

Corpus of Historical American English (COHA)

  • 400 million words
  • American English
  • 1810-2009
  • largest structured corpus of historical English
  • balanced by genre for each decade
  • How to cite: Davies, Mark. (2010-) The Corpus of Historical American English (COHA): 400 million words, 1810-2009. Available online at http://corpus.byu.edu/coha/.

News on the Web (NOW)

  • more than 4 billion words
  • 20 countries, from web-based newspapers and magazines
  • 2010 – now (= still growing by about 5-6 million words each day)
  • How to cite: Davies, Mark. (2013) Corpus of News on the Web (NOW): 3+ billion words from 20 countries, updated every day. Available online at http://corpus.byu.edu/now/.

Global Web-Based English (GlowbE)

  • 9 billion words
  • 20 countries, web-based (-> comparisons between different varieties of English)
  • 2012 & 2013
  • How to cite: Davies, Mark. (2013) Corpus of Global Web-Based English: 1.9 billion words from speakers in 20 countries (GloWbE). Available online at http://corpus.byu.edu/glowbe/.

Wikipedia Corpus

  • 9 billion words (contains the full text of Wikipedia – more than 4.4 million articles)
  • English
  • – 2014
  • How to cite: Davies, Mark. (2015) The Wikipedia Corpus: 4.6 million articles, 1.9 billion words. Adapted from Wikipedia. Available online at http://corpus.byu.edu/wiki/.

Hansard Corpus (British Parliament)

  • 6 billion words (contains nearly every speech given in British Parliament)
  • British English
  • 1803 – 2005
  • created as part of the SAMUELS project (2014-2016), which was funded by the UK Arts and Humanities Research Council

Corpus of US Supreme Court Opinions

  • 130 million words from 32,000 Supreme Court decisions
  • American English
  • 1790s – present

TIME Magazine Corpus

  • 100 million words from 275,000 articles
  • American English
  • 1923 – 2006
  • How to cite: Davies, Mark. (2007-) TIME Magazine Corpus: 100 million words, 1920s-2000s. Available online at http://corpus.byu.edu/time/.

Corpus of American Soap Operas

  • 100 million words from 22,000 transcripts
  • American English – very informal language
  • 2001-2012
  • How to cite: Davies, Mark. (2011-) Corpus of American Soap Operas: 100 million words. Available online at http://corpus.byu.edu/soap/.

British National Corpus (BYU-BNC)

  • originally created by Oxford University press in the 1980s – early 1990s
  • 100 million words
  • British English
  • 1980s – 1993
  • genres: spoken, fiction, magazines, newspapers, academic
  • also available at Lancaster University: http://bncweb.lancs.ac.uk/
  • How to cite: Davies, Mark. (2004-) BYU-BNC. (Based on the British National Corpus from Oxford University Press). Available online at http://corpus.byu.edu/bnc/.

Strathy Corpus (Canada)

  • produced by the Strathy Language Unit at Queen’s University
  • 50 million words from more than 1,100 texts
  • Canadian English
  • 1970s – 2000s
  • genres: spoken, fiction, magazines, newspapers, academic

Corpus of Online Registers of English (CORE Corpus)

  • 50 million words, categorized into 33 different registers
  • web
  • – 2014
  • How to cite: Davies, Mark. (2016-) Corpus of Online Registers of English (CORE). Available online at http://corpus.byu.edu/core/.

Google Books

  • 155 billion words for American and 34 billion for British English
  • based on data from Google books, but not a Google product
  • 1500s – 2000s
  • How to cite: Davies, Mark. (2011-) Google Books Corpus. (Based on Google Books n-grams). Available online at http://googlebooks.byu.edu/. Based on:
    Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden*. Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (2011) [Published online ahead of print 12/16/2010].