Differences

This shows you the differences between two versions of the page.

--- cqp:frequency-lists [2020/04/21 07:14] – created astefanowitsch
+++ cqp:frequency-lists [2024/06/20 13:53] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+**[ [[cqp:introduction|Collection: Introduction to CQP]] ]**
+====== 4a. Frequency lists ======
+//This section explains how to create frequency lists from a concordance. It presupposes that you have read [[cqp:corpus-structure|Section 1]], [[cqp:simple-queries|Section 2]] and [[cqp:advanced-querying|Section 3]] (in particular, [[cqp:complex-queries|Section 3c]] and [[cqp:sorting-sampling|Section 3g]]).//
+===== Simple word counts =====
+“My love for you is immeasureable”, sings Rod Stewart, but there are many things we can measure about the word //love// (and other expressions), and in corpus linguistics, measuring typically means counting. Many research projects will require us to export a concordance, categorize the examples in various ways and then quantify our categorization decisions, but sometimes we need simple frequency information about words or larger expressions, and CQP offers an easy way of getting this information: the ''count'' command that can be applied to the last concordance created (''count Last'') or to a concordance saved in a variable (e.g. ''count Love''). We always have to specify the column of our corpus that the count should be applied to -- e.g. ''word'', ''pos'', or ''hw'' (or ''lemma'' -- e.g. ''count Last by word''.
+For example, if we have a concordance called ''Love'' containing the lemma //love// (which we can create using the command ''Love = [hw="love"]'', see [[cqp:concordances#storing_concordances_internally|Section 3f]]), we can count the different word forms for this lemma (attaching ''%c'' to ignore the differences in upper and lower case):
+	count Love by word%c
+We get a result like the following:
+   love  [#0-#22314]
+    loved  [#22315-#26627]
+    loves  [#26628-#27968]
+     loving  [#27970-#28479]
+       lovest  [#27969]
+The first column gives the frequency, the second column the word form, and the third column the line number(s) in the sorted concordance corresponding to the form given in the second column. We can display just those lines by adding them to the ''cat'' command. For example, to display just the lines containing the form //loving//, we type:
+	cat Love 27970 28479
+As just hinted at, applying the ''count'' command automatically sorts the concordance (dispay it using ''cat Love'', you will see that it is sorted by word form). To undo this and put the concordance back into the original order (i.e., the order in which the examples occur in the corpus), simply type ''sort Love''.
+We can also count the concordance by part of speech, by typing the following:
+	count Love by pos
+This gives us a result like this:
+   NN1  [#1-#11437]
+    VVB  [#14273-#18060]
+    VVI  [#23072-#26099]
+    VVD  [#19428-#22306]
+    NN1-VVB  [#11438-#14087]
+    VVB-NN1  [#18061-#19427]
+    VVZ  [#27279-#28427]
+    VVN  [#26100-#27139]
+     VVG-AJ0  [#22807-#23071]
+     VVD-VVN  [#22310-#22561]
+     VVG  [#22562-#22806]
+     VVN-VVD  [#27166-#27278]
+      NN2  [#14088-#14168]
+      NN2-VVZ  [#14169-#14227]
+      VVZ-NN2  [#28428-#28479]
+      NP0  [#14228-#14270]
+      VVN-AJ0  [#27140-#27165]
+       VVD-AJ0  [#22307-#22309]
+       UNC  [#14271-#14272]
+       AJS  [#0]
+The structure is the same as before, except now we have the pos tags in the second column instead of the lemmas.
+===== More complex word counts =====
+As discussed in [[cqp:complex-queries|Section 3c]], we can (and often do) query expressions that are longer than one word. For example, inspired by the love song “She drives me crazy” by the Fine Young Cannibals, we may wonder what adjectives can occur in the pattern [//drive someone// ADJECTIVE]. We could construct a query like the following:
+	[hw="drive"] [pos="PNP"] [pos="AJ0"]
+This should find all instances of the lemma //drive//, followed by a personal pronoun (//me//, //you//, etc.), followed by an uninflected adjective. Try it, and you will see that it does indeed. We can now apply the ''count'' command:
+	count Last by word%c
+This will give us a result like this, …
+      driving me mad  [#96-#110]
+      driving me crazy  [#77-#90]
+       drive you mad  [#28-#35]
+       drove him mad  [#117-#124]
+       drive me mad  [#11-#17]
+       drives me mad  [#52-#58]
+       drive you crazy  [#21-#26]
+       drive me crazy  [#1-#5]
+       drives me crazy  [#44-#48]
+       drives you mad  [#68-#71]
+… which is not really what we want: we are interested in the adjectives and their frequencies, but due to the different forms of the verb //drive// and the different pronouns, we do not get a clean list. The problem of the different verb forms could be solved by counting by ''hw'' instead of ''word%c'', but the different pronouns would continue to confound the picture. Instead, we would like to be able to count //just the adjectives// -- and we can: CQP allows us to specify a position at which to produce a frequency list, which will then ignore the rest of the match. As described in [[cqp:sorting-sampling|Section 3g]], the first token of the match always has position 0 (''match[0]''), and the tokens on the right are numbered with increasing positive numbers, the tokens on the left with increasing negative numbers. Our query consists of three tokens, with the adjective in second position to the right of ''match[0]'', i.e., ''match[2]''. Thus, we can use the ''count'' command as follows:
+	count Last by word%c on match[2]
+This will give us the following, which is exactly what we want:
+      mad  [#66-#131]
+      crazy  [#9-#45]
+      insane  [#53-#65]
+       batty  [#1-#5]
+       daft  [#46-#48]
+       dotty  [#50-#51]
+       barmy  [#0]
+       blue  [#6]
+       clear  [#7]
+       crackers  [#8]
+       demented  [#49]
+       frantic  [#52]
+       onshore  [#132]
+       quackers  [#133]
+       spare  [#134]
+===== Even more complex word counts =====
+But what if we want to create a frequency list of an expression that is larger than a single word but smaller than the entire match? For example, we might notice that the expression //drive someone crazy// also has a variant //drive someone to distraction//, with a preposition and a noun instead of an adjective. We could construct the following query to find such cases:
+	[hw="drive"] [pos="PNP"] [pos="PRP"] [pos="NN1"]
+Now, we would like a frequency list of the sequence of preposition and noun at the end of the match. Fortunately, this is not a problem, as CQP allows us to select not just a single position from a match, but also a range of positions. This is done by giving the first and last position of the range, separated by two periods: ''on match[//x//] .. match[//y//]''. In our case:
+	count Last by word%c on match[2] .. match[3]
+Try it. The first few lines of the frequency list should look like this:
+       to distraction  [#19-#25]
+       to hospital  [#30-#34]
+       to suicide  [#41-#45]
+       into opposition  [#10-#11]
+       to madness  [#37-#38]
+       to school  [#39-#40]
+       to victory  [#46-#47]
+In this case, the range contains two tokens that are next to each other, but the same notation works for larger ranges. For example, we may notice the expression //drive someone up the wall//, and wonder if there are other cases like this, with an article between the preposition and the noun. We could construct a query like the following to capture such cases:
+	[hw="drive"] [pos="PNP"] [pos="PRP"] [pos="AT0"] [pos="N.*"]
+We can then produce a frequency list of the last three tokens of the match like this:
+	count Last by word%c on match[2] ..  match[4]
+This will give us a list like the following:
+      to the airport  [#68-#80]
+      up the wall  [#132-#144]
+      to the station  [#110-#120]
+       into the arms  [#24-#27]
+       into the street  [#42-#43]
+       to the conclusion  [#89-#90]
+       to the hospital  [#98-#99]
+       to the meeting  [#101-#102]
+       to the police  [#105-#106]
+The phrase //up the wall// seems to be the only case of the expression we are looking for (but at least //drive someone into the arms (of …)// reminds us of Smokie's famous love song “Lay back in the arms of someone”) -- all other cases seem to be instances where someone literally drives someone to some place. Although, come to think of it, //drive someone to the meeting// could be a great idiom for insanity, as anyone who has ever participated in a meeting can affirm.
+===== Summary and outlook =====
+This section introduced frequency lists. You can now read the following sections in any order:
+  * [[cqp:grouping-data|Section 4b]]: Grouping data
+  * [[cqp:collocates|Section 4c]]: Collocate lists and tables
+**[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]] -- [[cqp:exercises|Section 6]] ]**