Using an online corpus to study more efficiently

Screenshot of the online Corpus do Português

Those stubborn usage questions

Very often when we’re learning a language, we have questions that a dictionary or textbook just can’t answer. “Which of these two similar verbs is the one I want in this context?” “What preposition do I use with this verb?” “This definition is so vague. I wish I could see an example of how this word is used.” “Is this verb used reflexively?” “Is this expression used only in formal contexts?”

I call these types of questions “usage questions” because they’re not really about grammar but about how people use the language. You might have a range of options for how to say something, all of which might be grammatically correct, but one of which might be more common in speaking, another in writing. How do you answer these questions?

The best thing to do, of course, is to ask a native speaker, be they a tutor or a friend. But what if you’re writing an email and need an answer right now? What if you don’t know any native speakers? Another option would be to see if you can find out the answer by doing a Google search of Portuguese pages. Sometimes this is helpful and sometimes it’s just not. A better idea might be to consult Linguee if you have a questions about a particular portuguese word. But another more sophisticated solution is to use what’s called a corpus.

So what’s a corpus?

This is a tool used in linguistics studies that is really nothing more than a huge database of written and spoken materials in a particular language. They could be newspaper articles, novels, television transcripts, nonfiction books, scientific papers and transcripts of casual speech. The better the corpus, the larger the database of materials. After gathering as much material as they can and feeding it into the database, the linguists then use an automated program to examine each word in the database and tag it according to part of speech (noun, verb, preposition, etc) and lemma.

A lemma is a keyword that groups together closely related words, like all the possible forms of a verb, or all possible inflections of a noun. For verbs, the lemma is always the infinitive, and for nouns and adjectives, the lemma is the masculine singular form. For example, gosto, gostam, gostamos, gostei, gostaram, gostado, gostando, gostarei, gostaremos, and gostarmos are all forms of the verb gostar, so they would be classified under the lemma “gostar”. Professor, professora, professores, professoras are all forms of the noun professor and they would all be grouped under the lemma “professor”. Got it? Grouping words by lemma makes it much easier to search the corpus for all forms of a particular word.

Now let me introduce you to the online Corpus do Português. It contains 45 million words from written and audio sources from the 1300s through the 1900s (45 million is actually quite small, as corpi go!). If this is your first time using a corpus, I highly suggest you take their five-minute tour just to get a sense of the powerful things you can do with this tool. After you run a few dozen searches, they will require you to register to continue using the corpus, but registration is free.

Here are four ways we can use the Corpus to help us become more sophisticated in our grasp of Portuguese:

  1. Generating frequency lists of vocabulary words for study
  2. Using collocates to better understand the meaning of a word
  3. Finding out what preposition to use with a verb
  4. Comparing the usage of two similar words (in progress)


3 Responses to Using an online corpus to study more efficiently

  1. Joe Masters says:

    Great idea to use a corpus to get more insight into the usage of particular words.

    I also really like using ReversoContext for getting a bit of an idea as to how words are actually used in Portuguese

    Another slightly unique tip I have if you want to quickly get the impression for the meaning of a certain word is to try a Google Images search!

  2. Stan says:

    Ola Joe

    Muito obrigado
    Wow! Reverso is really good.
    And thanks for the Google images idea

  3. Andrew says:

    Thanks for a great intro to what a corpus is and to the Corpus do Português!
    Just a word issue:
    Plural of corpus, as in “(45 million is actually quite small, as corpi go!)”
    Wait! Isn’t it ‘corpora’ direct from the Latin for ‘bodies’, or else ‘corpuses’ English style, like ‘campuses’ (instead of ‘campi’)?
    Example: Wikipedia “List of text corpora”.
    ‘Corpora vs corpuses’ is discussed in detail at:

Leave a Reply to Joe Masters Cancel reply

Your email address will not be published. Required fields are marked *