Using collocates to better understand the meaning of a word

One neat thing we can do with the Corpus do Português is to search for are what are called collocates. A collocate is a word that appears near another word in the corpus more often than would be expected through chance alone. It can appear before or after the target word, usually within two or three words of it. Words that are strongly collocated appear together frequently, and that suggests that together they form a commonly used idea or phrase. For example, some of the collocates of the English word “planning” might be “urban”, “family”, “public”, “prior”, “poor”.

Collocates can give you great insight into the meaning of a word, even moreso than the definition alone. For example, imagine that you are learning English and want to know what the word “sprawl” means. You go to a dictionary and read the following:


  1. to be stretched or spread out in an unnatural or ungraceful manner
  2. to sit or lie in a relaxed position with the limbs spread out carelessly or ungracefully
  3. to spread out, extend, or be distributed in a straggling or irregular manner, as vines, buildings, handwriting, etc.
  4. noun. the act or an instance of sprawling; a sprawling posture.

That’s an ok general definition, but it’s missing one of the main ways that sprawl is used today. Now, if we were to search an English corpus for the collocates of sprawl (and its associated words sprawling, sprawled, etc), we might get this list:

urban, suburban, city, metropolis, floor, bed, fight, prevent, prevention, out, anti, policy

There are a few words here that recall the dictionary definition: to sprawl out on the floor or on the bed. But other words suggest a different meaning – that sprawl is somehow associated with cities or suburbs, that sprawl is a bad thing and that people want to fight and prevent it. You can imagine how, if you were learning English, looking at these collocates would give you a much richer sense of what the word “sprawl” means and in what contexts you’re likely to encounter it. We as Portuguese learners can use a corpus in the same way when we’re confronted with a word whose dictionary definition is too vague or abstract.

Many verbs in portuguese, like corrercozinharassistir are relatively easy to remember because we can associate them with concrete actions for which we have obvious names in english – “to run”, “to cook”, “to watch”. But what about verbs that represent more vague, abstract ideas? Verbs like vincular, aproveitar, conseguir, reivindicar that maybe don’t have simple or easy translations? I find that I have a lot of trouble remembering what these verbs mean, because I can’t associate them with a simple action in my mind. Here, a corpus can help give us an idea of the range of meaning of these words in a way that a dictionary just can’t.

An example

Let’s take reivindicar. The dictionary definition is “to lay claim to; regain; assert”. Now, I think that’s a pretty nebulous definition that doesn’t really give me much insight into what the word means. So let’s take a look at it’s collocates. The screencast and the instructions below will show you how to do it.

 I’ll type the lemma [reivindicar] into the WORD(S) field. Then I’ll click on COLLOCATES to display the Collocates text box, and I’ll click inside the box. I only want to find noun collocates, so under POS List, I’ll select nouns, and it will automatically put [nn*] in the Collocates field. I’ll tell it to show me words that come 1 word before and 3 words after reivindicar. Finally, I’ll tell it to sort by Relevance, so that the words that are most strongly associated with reivindicar will appear at the top of my list.

Now I click the Search button and see this list:

These are all the words found within 1 word to the left and 3 words to the right of reivindicar in the Corpus. The TOT column tells you how many times the word appears collocated with reivindicar in the Corpus. The MI column is quite interesting – it quantifies how strong the correlation between the two words is – in other words, how statistically probably is it that reivindicar will occur collocated with this word? Words that have a MI over 3 are “semantically bonded”. Below 3 means that the words only occur together by chance and have no semantic relationship. When you ask to sort by Relevance, this is the field it uses, which is why the MI numbers go in descending order.

In the list of results, a few words pop out at me. Here are the things it is possible to “lay claim to” using reivindicar:

atentados (terrorist attacks / shootings / bombings),  autoria (authorship),  território (territory), posse (possession), direitos (rights), dinheiro (money)

From this list, I get the sense that reivindicar is used mostly in three contexts. It can refer to a group taking responsibility for an attack, a country laying claim to lands and territories, or to people asserting legal ownership/rights/money.

Besides having gotten a better understanding of the meaning of the word, I now have access to a plethora of mnemonic images that I could use as a shorthand for remembering what reivindicar means the next time I encounter it. I could, for example, imagine Sergio Cabral planting a flag on a beach and claiming the new territory of Brasil for the Portuguese king (rei). Or I could imagine the Argentine Navy asserting their rights to claim possession of the Falkland Islands during the war against the UK in 1982. I imagine how this act might feel “vindicating” to a rei (king), and there it is: “rei-vindicar”. (If there’s an obvious mnemonic that comes to mind based on the sound of the word, as in this case, I’ll go ahead and use it, but in general I find it tough to come up with good mnemonics. Instead I usually just think of an image that may not have anything to do with the sound of the word.)

At this point, I can also click on any of the collocates to view the source texts where it was found. Looking at examples of the word in context like this can be an excellent way to study difficult words.

Leave a Reply

Your email address will not be published. Required fields are marked *