Here’s a recommended strategy for accomplishing reliable results:
- pick the right corpus
- try a search string
- check your results for precision and recall
- if necessary: improve your search string
- maybe you’ll need to look up helpful search strategies such as:
- and/or how to deal with contracted forms and punctuation marks
- …and try again
Some more advice:
Play around with the different display modes in and functions of the BYU corpora. If it is diachronic variation (over time), that you’re after, try the chart view. Are you interested in the environment a word appears in (i.e. looking for the Keywords in Context (KWIC) display), nearby words (collocates), or aiming at comparing the collocates of two words (compare)?
Experiment with your search queries. Often, your first (or second, or even third and fourth) try won’t yield quality results, but keep trying and you’ll get there! If you’re stuck, try the wiki, and if even that doesn’t help, you can post in the forum.
Keep the two quality benchmarks of corpus research in mind: precision and recall. A high precision means only getting relevant examples and no false, accidental hits whereas a high recall means that you get all the relevant examples of the feature you’re investigating.
In other words, always check if you are getting the results you are after using the list display. If you want to find all instances of, say, the verb “swim”, you should choose a search strategy that returns all inflectional variants (i.e. swim, swims, swam, swimming…). If you only type in “swim”, that’ll give you a high precision (there can’t be any false hits), but you’ll miss out on a lot of other relevant examples, such as “swims”. Choosing sw* will give you a high recall but a very low precision since all words starting with “sw” will come up, including swimmer, swine, swivel, swerve, sword… Using “swim*” would be a bit better because you would get swim, swims, swimming, but you would not include the irregular past tense form and would still include the noun. The lemma search “[swim]” should work, but it could still include the noun variants. In conclusion, a combination of the lemma search and a part of speech tag (“[swim]_v*” or, using the new BYU syntax, “SWIM_v*”) should work best, giving you both a high recall and a high precision. But it is often a trade-off between the two, and you have to try to find a good balance.
Also keep in mind that, when looking for phrases such as “bake a cake”, the noun could be pre-modified, for example by an adjective (“bake a fluffy cake”) or a noun (“bake a chocolate cake”), so your search string might read “[bake] a * cake” using a wildcard search.
In general, be aware that many words are ambiguous and can even function as two different parts of speech, depending on the context. “Make up”, for instance, can mean apologising after a fight if used as a phrasal verb, but can also refer to mascara, concealer and powder, if used as a noun. It is highly recommended that you thoroughly check your results and the context in which your search string appears using the list view.