Differences

This shows you the differences between two versions of the page.

--- cqp:regular-expressions-basics [2023/11/08 09:53] – astefanowitsch
+++ cqp:regular-expressions-basics [2024/06/20 13:53] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+**[ [[cqp:introduction|Collection: Introduction to CQP]] ]**
+====== 3e. Regular expressions (basics) ======
+//This section introduces regular expressions — sequences of symbols and characters that can describe classes of strings and shows how they can be used inside the attribute of an attribute-value pair in CQP. It presupposes that you have read Sections [[cqp:corpus-structure|1]], [[cqp:simple-queries|2]] [[cqp:extending-queries-combinations|3a]], [[cqp:extending-queries-alternatives|3b]], and [[cqp:complex-queries|3c]].//
+So far, the values we provided for the different attributes in our queries corresponded exactly to one string: ''[word="love"]'' will match tokens where the wordform is exactly ''love'', ''[pos="NN1"]'' will match tokens where the pos tag is exactly ''NN1'', etc. This is often sufficient, but sometimes we want to look for more than one string in a single query. For example, we may wan to search for the strings //love a little// and //live a little//. We could do this using two separate queries:
+	[word="love"%c] [word="a"%c] [word="little"%c]
+	[word="live"%c] [word="a"%c] [word="little"%c]
+But note that the queries only differ by a single character! Would it not be great if there were a way of searching for a string consisting of an ''l'' followed by an ''o'' //or// an ''i'' followed by the sequence ''ve''?
+This is where regular expressions come in: among other things, they allow us to specify a character //class// instead of a specific character.
+===== Character classes =====
+The largest character class is represented by the period: ''.'' -- this stands for “any character”. Using this character class, we can combine the two queries shown above:
+	[word="l.ve"%c] [word="a"%c] [word="little"%c]
+Try the query query using the BNC, and you will see that it finds all instances of the two strings //love a little// and //live a little//. It would theoretically also find stings with other characters in the positon of the period, such as //leve//, //lxve//, //l3ve// or //l;ve//, but these do not occur in the BNC. This character class is useful if you don't know what to expect in a particular position of a string. For example, you may be interested in what words differ from //love// in just the first character. The query ''[word=".ove"%c]'' will provide the answer (run it over the BNC).
+In many cases, you will want to be more specific than this. In these cases, you can define your own class containing exactly the characters you want it to contain, by listing them between square brackets. For example, if you want to find all cases of //love// and //dove//, but not //move//, //cove//, //hove//, etc., you can define a class containing just the characters ''l'' and ''d'': ''[ld]''. Try running the query ''[word="[ld]ove"%c]''. Using a self-defined character class ''[oi]'', you can make the query shown above more precise:
+	[word="l[oi]ve"%c] [word="a"%c] [word="little"%c]
+You can also define character classes negatively, by specifying what characters they should //not// contain. You do this by adding a caret (''^'') at the beginning of your list of characters: ''[^ld]ove'' will find all words beginning with any character //except// ''l'' or ''d'', followed by ''ove''.
+Many versions of regular expressions, including the one used in CQP, provide a range of predefined character classes, that are more specific than ''.''. Some useful examples of such classes are:
+  * ''%%[%%[:alpha:]]'' -- all alphabetic characters (roughly, all letters)
+  * ''%%[%%[:digit:]]'' -- all numeric characters (roughly, all numbers)
+  * ''%%[%%[:alnum:]]'' -- all alphanumeric characters (roughly, all letters and numbers)
+  * ''%%[%%[:upper:]]'' -- all upper-case alphabetic characters
+  * ''%%[%%[:lower:]]'' -- all lower-case alphabetic characters
+  * ''%%[%%[:punct:]]'' -- all punctuation marks
+  * ''%%[%%[:blank:]]'' -- all whitespace characters (e.g. spaces, tabstops)
+  * ''%%[%%[:space:]]'' -- all spaces
+You are unlikely to need the last two, since CQP does not allow whitespace within tokens, so none of the tokens in any of our corpora contain spaces, tabstops, etc.
+===== Quantification =====
+Using character classes in your queries will often make things easier for us, but they are still constrained in one respect: they only apply to a single position in a string, but we may want to specify a sequence of characters. For example, we may want to find all cases of the interjection //oh// (as in //Oh, my love//). The //o// in this interjection is often repeated number of times to indicate the length or intensity of the interjection — in the BNC, we find //oh//, //ooh//, //oooh//, //ooooh//, //oooooh//, //ooooooh//, //ooooooooh//, and //ooooooooooooooooh//. But //how// do we find them?
+Regular expressions provide ways of specifying how often a character (or character class) should occur. There are three general quantifiers:
+  * ''?'' -- the preceding character (class) may occur zero times or once
+  * ''*'' -- the preceding character (class) may occur between zero and infinitely many times
+  * ''+'' -- the preceding character (class) may occur between one and infinitely many times
+For example, the query ''[word="o+h"%c]'' will return all of the variants listed above (it corresponds to “one ore more occurrences of //o// followed by an //h//”). Note that the interjection is often spelled without a final //h// -- the query ''[word="o+h?"%c]'' would also return these cases (it corresponds to “one ore more occurrences of //o// followed by zero or one occurence(s) of //h//”) -- try it. In fact, the final //h// may also be repeated to indicate intensity, so an even better query would be ''[word="o+h*"%c]'' (which corresponds to “one ore more occurrences of //o// followed by zero or more occurence(s) of //h//”) -- again, try it.
+These quantifiers can also be applied to character classes. For example, the following query would find all words that begin with an //l// and end with the sequence //ve// (for example, //love//, //live//, //leave// and //legislative//):
+	[word="l.+ve"%c]
+If we want to be more specific and just find words starting with //l// followed by one or more vowels followed by //ve//, we can combine the ''+'' with a self-defined character class:
+	[word="l[aeiou]+ve"%c]
+This will give us //love//, //live//, //leave// and a few others.
+Note that regular expressions are case-sensitive. The above queries will find upper- and lowercase versions of the specified strings because they use the ''%c'' flag, but without this flag, we would have to define a character class for every position that contains the upper- and lowercase letter:
+	[word="[Ll][AEIOUaeiou]+[Vv][Ee]"]
+Instead of using the general-purpose quantifiers ''?'', ''*'' and ''+'', we can also specify exact numbers or ranges, using the notation with curly brackets that you have already seen in the preceding section.
+  * ''{n}'' -- exactly ''n'' occurrences (e.g. ''[word="o{3}h"%c]'' will return //oooh//)
+  * ''{min,}'' -- at least ''min'' occurrences  (e.g. ''[word="o{3,}h"%c]'' will return //oooh// and all cases with more than three //o//'s)
+  * ''{,max}'' -- at most ''max'' occurrences  (e.g. ''[word="o{,3}h"%c]'' will return //h//, //oh//, //ooh// and //oooh//)
+  * ''{min,max}'' -- between ''min'' and ''max'' occurrences (e.g. ''[word="o{3,6}h"%c]'' will return //oooh//, //ooooh//, //oooooh//, and //ooooooh//)
+===== Grouping and alternatives =====
+In some cases, even a combination of character classes and quantifiers is not enough. For example, if we want to find all cases of the strings //love me// and //leave me//, we could use the following query:
+	[word="l[oea]{1,2}ve"%c] [word="me"%c]
+But what if we also wanted fo find the string //fool me//? We could try to construct this query using character classes and quantifiers, like this:
+	[word="[lf][oea]{1,2}[vl]e?"%c] [word="me"%c]
+The query corresponds to “an //l// or //f//, followed by one or two occurrences of any of the characters //o//, //e// and //a//, followed by a //v// or an //l//, followed by zero or one occurrence(s) of the character //e//”. This will return the three strings you want, but also the string //feel me//; also, it is quite difficult to read.
+In such cases, it is easier to define a group of strings instead of classes of characters. This is done by enclosing the group of strings in parentheses and using the pipe symbol ''|'' to separate the strings from each other (it means “or”):
+	[word="(love|leave|fool)"%c] [word="me"%c]
+We can also use grouping inside a string. The following will find //love me// and //leave me//:
+	[word="l(o|ea)ve"%c] [word="me"%c]
+===== Summary and outlook =====
+This section has shown you how to create complex regular expressions. As you read other sections, always think about how you might apply them. [[cqp:more-complex-queries|Section 5a]] will show you how to use regular expressions outside of individual tokens to create very complex queries.
+**[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]] -- [[cqp:exercises|Section 6]] ]**