Virtual corpora at BYU

You can search a whole corpus (such as the entire COCA or BNC), but alternatively, personalised virtual corpora can be created and searched. Using the Wikipedia corpus, for example, sub-corpora of different technical fields like Biology, Engineering, Medicine, Chemistry, Physics etc. can be created, or virtual corpora relating to the world religions, or pertaining to different music or film or literature genres, or whatever else you’re interested in.

Creating virtual corpora

There are two ways of creating virtual corpora:

(1) Click on Texts/Virtual, then on Create Corpus.

This allows you to specify parameters and search by words in the title. For example if you want to create a Biology corpus from Wikipedia articles you could search for biolog*. This wildcard search gives you articles whose headlines contain the terms biology, biological, biologist

Permissible search strings:

  • wildcards at the ends of words (*)
  • lemmas
  • alternatives (or)
  • exact title in quotation marks
  • two or more words (and)
  • combinations: (relig* or psych*) and (journal* or research*)

The number of pages that are considered is automatically set to 100 but can be changed. It’s also possible to exclude words from the title or page and include words in article (it’s often helpful to look through articles to get ideas for words to exclude). However, words just need to occur once to get excluded which might exclude potentially relevant pages.

At this point, you can decide to not include articles by unticking the boxes. You can then save your choice as a corpus and are prompted to enter a name for it.  Click on any title to see the full article in Wikipedia.

(2) You can also search for tokens in the corpus text (as you usually would) and save the results as a virtual corpus or add them to an already existing one. Again, you can change the options to modify how many hits you want to include.

Choose the “Find articles” (or “Find texts”, depending on the corpus) option in the Texts/Virtual menu before executing your search. “Save list” allows you to save the results to a new corpus.

Editing virtual corpora

It is possible to edit virtual corpora, to rename them, delete them or to get rid of articles that don’t seem to be relevant. In the edit tab, a list of articles or pages is displayed. Clicking on an entry leads to the Keyword in Context display mode or, in the case of the Wikipedia corpus, to the corresponding article.

You can tick articles in order to delete them from your virtual corpus, move them to a different corpus (which deletes them from the current one) or add them to another corpus (which creates a copy instead of deleting the page from the original corpus). The entries are only added if they are not already in the corpus. It’s also possible to manually add articles or pages from the list of search results.

Furthermore, corpora can be deleted entirely (by clicking on the trash can), hidden (by clicking on the lock), and categorised. For instance, the biology corpus could be subsumed under science, or a volleyball corpus could be categorised as sports.

Keywords in virtual corpora

You can look at lists of keywords from corpus by clicking on noun, verb, etc. Some multi-word expressions, like adjective + noun, are also possible. Examining the keywords is a convenient way of checking if the corpus is well-designed and has little junk, e.g. allows you to find articles that do not fit into your scope. Click on words that stem from articles that don’t fit and delete their article (with the red x).

The results are usually sorted by frequency, but can change that by clicking on specific, then the part of speech or combination you would like to investigate. This will show words that are over-represented in your virtual corpus when compared to the corpus as a whole. The results are now sorted by which words are more frequent in your virtual corpora. Click on + or – to make the display more or less specific.

Searching in virtual corpora

In order to search within one virtual corpus, mark only that corpus in your Virtual Corpora list (you need to be logged in and might have to refresh the list). You can compare the frequency of words across your virtual corpora by choosing “My Corpora”. This’ll give you the number of tokens and frequencies in words per million. Click on the corpus name to see the word in context. Generally, if a search term doesn’t appear at all in one corpus, this corpus will not appear in the results list. Similarly, a hidden corpus will not be searched and not appear in the list.

Possibilities in virtual corpora outside of the Wikipedia corpus

At the moment, it doesn’t seem to be possible to create virtual corpora with the Hansard corpus and Google Books. All other English BYU corpora allow you to create them and, depending on the information they provide, also offer different options to fine-tune your search.

In more detail, the corpora offer the following choices:

  • COCA: source, words in the title, years, genre/domain (i.e. spoken, fiction, etc), and words in text
  • COHA: source, words in the title, author, years, genre (i.e. fiction, magazine, newspaper, non-fiction), and words in text
  • NOW: web domain, article title, country, dates, words in text
  • GloWbE: web domain, article title, country, genre, words in text
  • BYU-BNC: title/source, keywords, genre/domain, individual texts (abbreviated), words in text
  • TIME Magazine: article title, years, words in text
  • SOAP: years, show, words in text
  • Strathy: source, title, years, genre/domain, words in text
  • CORE: Genre, web domain, title, country, words in text
  • SCOTUS: title, years, chief justice, issue, words in text (further improvements are planned or being executed at the moment)

Also refer to Mark Davies’ youtube videos on virtual corpora using the Wikipedia corpus.