Make your own corpus

In case you’re analysing language phenomena in a highly specialised context and a corpus for that purpose doesn’t exist already, you can create your own. If, for instance, you’re interested in modal verbs in Dickens’ novels, you could download them from Project Gutenberg or Literature Online and turn them into your very own corpus. Similarly, you could save the text of websites and then analyse their language, or you could compile the text of your own term papers into a corpus and work with them. Many corpora that are available for you to use but don’t have a search interface like the BYU corpora require you to download them and compile their files in order to use them properly.

 

Two commonly used programmes for the purpose of compiling a corpus (so-called concordancers) are AntConc and Wordsmith, although the latter is not free to use. AntConc, on the other hand, is free to download and offers a variety of functions, including:

  • the concordance view, which shows results in a KWIC format
  • the concordance plot tool, which highlights the position of the search results on the text files
  • the file view tool, which displays the text of individual files
  • clusters/N-Grams, which finds clusters of a length of n words
  • collocates
  • word list, which orders the most frequent words and presents them in a list
  • keyword list, which allows you to compare unusually (in)frequent words in your corpus to a reference corpus

More information and various tutorials, both as pdfs and youtube videos, are available on the AntConc site.

 

Additionally, there are options available that allow you to run automatic part of speech taggers over your text so that you can search for specific parts of speech later on. CLAWS, for example, is a free tagger that automatically assigns PoS tags to words. In contrast, a parser (such as the Stanford one) creates and analyses sentence trees in order to improve tagging. AntConc offers options to hide the PoS tags so they don’t turn up in the concordance view. Be aware, though, that none of the automatic taggers or parsers can ever be 100% accurate.

WebLicht, a CLARIN-D service, offers several tools such as tokenisers, taggers and parsers in one site. Its next update (in early 2018) will also allow users to download their annotated texts which can then be loaded into AntConc.

 

When using compiled corpora, it’s even more important to check the distribution of a feature: just because it is common in your corpus as a whole doesn’t mean it is evenly distributed among all the search files. Maybe one speaker produced all or the majority of the tokens which would make the analysis less reliable.

Don’t forget to reference any software you’ve used in your paper. They usually provide “how to cite” sections on their websites.