cqp:regular-expressions-basics
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
cqp:regular-expressions-basics [2023/11/08 09:53] – astefanowitsch | cqp:regular-expressions-basics [2024/06/20 13:53] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **[ [[cqp: | ||
+ | ====== 3e. Regular expressions (basics) ====== | ||
+ | |||
+ | //This section introduces regular expressions — sequences of symbols and characters that can describe classes of strings and shows how they can be used inside the attribute of an attribute-value pair in CQP. It presupposes that you have read Sections [[cqp: | ||
+ | |||
+ | So far, the values we provided for the different attributes in our queries corresponded exactly to one string: '' | ||
+ | |||
+ | [word=" | ||
+ | [word=" | ||
+ | |||
+ | But note that the queries only differ by a single character! Would it not be great if there were a way of searching for a string consisting of an '' | ||
+ | |||
+ | This is where regular expressions come in: among other things, they allow us to specify a character //class// instead of a specific character. | ||
+ | |||
+ | ===== Character classes ===== | ||
+ | |||
+ | The largest character class is represented by the period: '' | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | Try the query query using the BNC, and you will see that it finds all instances of the two strings //love a little// and //live a little//. It would theoretically also find stings with other characters in the positon of the period, such as //leve//, //lxve//, //l3ve// or //l;ve//, but these do not occur in the BNC. This character class is useful if you don't know what to expect in a particular position of a string. For example, you may be interested in what words differ from //love// in just the first character. The query '' | ||
+ | |||
+ | In many cases, you will want to be more specific than this. In these cases, you can define your own class containing exactly the characters you want it to contain, by listing them between square brackets. For example, if you want to find all cases of //love// and //dove//, but not //move//, //cove//, //hove//, etc., you can define a class containing just the characters '' | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | You can also define character classes negatively, by specifying what characters they should //not// contain. You do this by adding a caret ('' | ||
+ | |||
+ | Many versions of regular expressions, | ||
+ | |||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | You are unlikely to need the last two, since CQP does not allow whitespace within tokens, so none of the tokens in any of our corpora contain spaces, tabstops, etc. | ||
+ | |||
+ | ===== Quantification ===== | ||
+ | |||
+ | Using character classes in your queries will often make things easier for us, but they are still constrained in one respect: they only apply to a single position in a string, but we may want to specify a sequence of characters. For example, we may want to find all cases of the interjection //oh// (as in //Oh, my love//). The //o// in this interjection is often repeated number of times to indicate the length or intensity of the interjection — in the BNC, we find //oh//, //ooh//, //oooh//, //ooooh//, //oooooh//, // | ||
+ | |||
+ | Regular expressions provide ways of specifying how often a character (or character class) should occur. There are three general quantifiers: | ||
+ | |||
+ | * ''?'' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | For example, the query '' | ||
+ | |||
+ | These quantifiers can also be applied to character classes. For example, the following query would find all words that begin with an //l// and end with the sequence //ve// (for example, //love//, //live//, //leave// and // | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | If we want to be more specific and just find words starting with //l// followed by one or more vowels followed by //ve//, we can combine the '' | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | This will give us //love//, //live//, //leave// and a few others. | ||
+ | |||
+ | Note that regular expressions are case-sensitive. The above queries will find upper- and lowercase versions of the specified strings because they use the '' | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | Instead of using the general-purpose quantifiers ''?'', | ||
+ | |||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | ===== Grouping and alternatives ===== | ||
+ | |||
+ | In some cases, even a combination of character classes and quantifiers is not enough. For example, if we want to find all cases of the strings //love me// and //leave me//, we could use the following query: | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | But what if we also wanted fo find the string //fool me//? We could try to construct this query using character classes and quantifiers, | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | The query corresponds to “an //l// or //f//, followed by one or two occurrences of any of the characters //o//, //e// and //a//, followed by a //v// or an //l//, followed by zero or one occurrence(s) of the character //e//”. This will return the three strings you want, but also the string //feel me//; also, it is quite difficult to read. | ||
+ | |||
+ | In such cases, it is easier to define a group of strings instead of classes of characters. This is done by enclosing the group of strings in parentheses and using the pipe symbol '' | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | We can also use grouping inside a string. The following will find //love me// and //leave me//: | ||
+ | |||
+ | [word=" | ||
+ | |||
+ | ===== Summary and outlook ===== | ||
+ | |||
+ | This section has shown you how to create complex regular expressions. As you read other sections, always think about how you might apply them. [[cqp: | ||
+ | |||
+ | **[ Introduction to CQP: [[cqp: |