cqp:frequency-lists
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
cqp:frequency-lists [2020/04/21 07:14] – created astefanowitsch | cqp:frequency-lists [2024/06/20 13:53] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **[ [[cqp: | ||
+ | ====== 4a. Frequency lists ====== | ||
+ | |||
+ | //This section explains how to create frequency lists from a concordance. It presupposes that you have read [[cqp: | ||
+ | |||
+ | ===== Simple word counts ===== | ||
+ | |||
+ | “My love for you is immeasureable”, | ||
+ | |||
+ | For example, if we have a concordance called '' | ||
+ | |||
+ | count Love by word%c | ||
+ | |||
+ | We get a result like the following: | ||
+ | |||
+ | 22315 | ||
+ | 4313 loved [# | ||
+ | 1341 loves [# | ||
+ | 510 | ||
+ | 1 | ||
+ | |||
+ | The first column gives the frequency, the second column the word form, and the third column the line number(s) in the sorted concordance corresponding to the form given in the second column. We can display just those lines by adding them to the '' | ||
+ | |||
+ | cat Love 27970 28479 | ||
+ | |||
+ | As just hinted at, applying the '' | ||
+ | |||
+ | We can also count the concordance by part of speech, by typing the following: | ||
+ | |||
+ | count Love by pos | ||
+ | |||
+ | This gives us a result like this: | ||
+ | |||
+ | 11437 | ||
+ | 3788 VVB [# | ||
+ | 3028 VVI [# | ||
+ | 2879 VVD [# | ||
+ | 2650 NN1-VVB | ||
+ | 1367 VVB-NN1 | ||
+ | 1149 VVZ [# | ||
+ | 1040 VVN [# | ||
+ | 265 | ||
+ | 252 | ||
+ | 245 | ||
+ | 113 | ||
+ | 81 NN2 [# | ||
+ | 59 NN2-VVZ | ||
+ | 52 VVZ-NN2 | ||
+ | 43 NP0 [# | ||
+ | 26 VVN-AJ0 | ||
+ | 3 | ||
+ | 2 | ||
+ | 1 | ||
+ | |||
+ | The structure is the same as before, except now we have the pos tags in the second column instead of the lemmas. | ||
+ | |||
+ | ===== More complex word counts ===== | ||
+ | |||
+ | As discussed in [[cqp: | ||
+ | |||
+ | [hw=" | ||
+ | |||
+ | This should find all instances of the lemma //drive//, followed by a personal pronoun (//me//, //you//, etc.), followed by an uninflected adjective. Try it, and you will see that it does indeed. We can now apply the '' | ||
+ | |||
+ | count Last by word%c | ||
+ | |||
+ | This will give us a result like this, … | ||
+ | |||
+ | 15 driving me mad [#96-#110] | ||
+ | 14 driving me crazy [#77-#90] | ||
+ | 8 drive you mad [#28-#35] | ||
+ | 8 drove him mad [#117-#124] | ||
+ | 7 drive me mad [#11-#17] | ||
+ | 7 | ||
+ | 6 drive you crazy [#21-#26] | ||
+ | 5 drive me crazy [#1-#5] | ||
+ | 5 | ||
+ | 4 | ||
+ | |||
+ | … which is not really what we want: we are interested in the adjectives and their frequencies, | ||
+ | |||
+ | count Last by word%c on match[2] | ||
+ | |||
+ | This will give us the following, which is exactly what we want: | ||
+ | |||
+ | 66 mad [#66-#131] | ||
+ | 37 crazy [#9-#45] | ||
+ | 13 insane | ||
+ | 5 | ||
+ | 3 | ||
+ | 2 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | 1 | ||
+ | |||
+ | ===== Even more complex word counts ===== | ||
+ | |||
+ | But what if we want to create a frequency list of an expression that is larger than a single word but smaller than the entire match? For example, we might notice that the expression //drive someone crazy// also has a variant //drive someone to distraction//, | ||
+ | |||
+ | [hw=" | ||
+ | |||
+ | Now, we would like a frequency list of the sequence of preposition and noun at the end of the match. Fortunately, | ||
+ | |||
+ | count Last by word%c on match[2] .. match[3] | ||
+ | |||
+ | Try it. The first few lines of the frequency list should look like this: | ||
+ | |||
+ | 7 to distraction | ||
+ | 5 to hospital | ||
+ | 5 to suicide | ||
+ | 2 into opposition | ||
+ | 2 to madness | ||
+ | 2 to school | ||
+ | 2 to victory | ||
+ | |||
+ | In this case, the range contains two tokens that are next to each other, but the same notation works for larger ranges. For example, we may notice the expression //drive someone up the wall//, and wonder if there are other cases like this, with an article between the preposition and the noun. We could construct a query like the following to capture such cases: | ||
+ | |||
+ | [hw=" | ||
+ | |||
+ | We can then produce a frequency list of the last three tokens of the match like this: | ||
+ | |||
+ | count Last by word%c on match[2] .. match[4] | ||
+ | |||
+ | This will give us a list like the following: | ||
+ | |||
+ | 13 to the airport | ||
+ | 13 up the wall [#132-#144] | ||
+ | 11 to the station | ||
+ | 4 into the arms [#24-#27] | ||
+ | 2 into the street | ||
+ | 2 to the conclusion | ||
+ | 2 to the hospital | ||
+ | 2 to the meeting | ||
+ | 2 to the police | ||
+ | |||
+ | The phrase //up the wall// seems to be the only case of the expression we are looking for (but at least //drive someone into the arms (of …)// reminds us of Smokie' | ||
+ | ===== Summary and outlook ===== | ||
+ | |||
+ | This section introduced frequency lists. You can now read the following sections in any order: | ||
+ | |||
+ | * [[cqp: | ||
+ | * [[cqp: | ||
+ | |||
+ | **[ Introduction to CQP: [[cqp: |