Differences

This shows you the differences between two versions of the page.

--- corpora:historical [2020/06/23 22:20] – [The Penn Corpora] kmiddeke
+++ corpora:historical [2024/06/20 13:53] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+====== Using Corpora in Historical Linguistics ======
+===== Available corpora =====
+==== The Penn Corpora ====
+=== Resources ===
+  * {{ :corpora:penn-cheatsheet.pdf |}} (created by Alhadji Jallow, Jan Reimer and Georg Hartisch in 2017, used with permission)
+  * {{ :corpora:penn-tagset.pdf |}}
+  * {{ :corpora:exercises_penn-corpora.pdf |}}
+=== About ===
+The Penn corpora are
+  * The **PPEME2** (Kroch, Anthony & Ann Taylor. 2000. //The Penn-Helsinki Parsed Corpus of **Middle English**//. Department of Linguistics, University of Pennsylvania.)
+  * The **PPCEME** (Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. //The Penn-Helsinki Parsed Corpus of **Early Modern English**//. Department of Linguistics, University of Pennsylvania.)
+  * The **PPCEEC** (Nurmi, Arja, Ann Taylor, Anthony Warner, Susan Pintzuk and Terttu Nevalainen. 2006. //Parsed Corpus of **Early English Correspondence**//. York: University of York and Helsinki: University of Helsinki.)
+  * The **PPCMBE** (**Modern British English**)
+=== Notes ===
+The Penn corpora are really great, because you can use the exact same queries for all of them, which makes results directly comparable. But there are a few things to watch out for:
+When working with historical corpora, it is especially useful to work with pos-tags, since these corpora are **not lemmatized** and the texts follow **no standard orthography**. So, whenever possible, use pos-tags, e.g. to find forms of the auxiliary //do// etc. If that is not possible, consult the OED to get an idea of the possible spelling variants of the words you are interested in.
+If you need to restrict your query to a specific sub-corpus, remember that the command is [yourquery]**::match.text_**[anything]="[anything]" for the PPCME2, the PPCEME and the PPCMBE, but [yourquery]**::match.letter_**[anything]="[anything]" for the PPCEEC. The available text attributes can be looked up in the cheatsheet (forthcoming). Sub-periods (M4, E1, E2 etc.) are sometimes capitalized, sometimes not, try both if you get no results. The PPCMBE can be restricted to centuries (18th/19th).
+=== Related corpora ===
+Other corpora with broadly the same tagset include
+  * the **[[https://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm|YCOE]]** (Taylor, Ann, Anthony Warner, Susan Pintzuk and Frank Beths. 2003. //The York-Toronto-Helsinki Parsed Corpus of **Old English Prose**//. Department of Language and Linguistic Science, University of York.)
+  * the **[[https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2425|YCOE-P]]** (Pintzuk, Susan & Leendert Plug. 2001. //The York-Helsinki Parsed Corpus of **Old English Poetry**//. http://ota.ox.ac.uk/; http://www-users.york.ac.uk/~lang18/pcorpus.html.)
+  * the **[[https://pcmep.net/links.php|PCMEP]]** (Zimmermann, Richard. //The Parsed Corpus of **Middle English Poetry**//.)
+  * the **[[http://www.chlg.ac.uk/helipad/index.html|HeliPaD]]** (Walkden, George. 2015. //HeliPaD: the Heliand Parsed Database// [the Corpus of **Historical [i.e. Old] Low German**]. Version 0.9.)
+  * the **[[http://www.linguist.is/icelandic_treebank|IcePaHC]]** (Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. //**[Old Norse-]Icelandic** Parsed Historical Corpus//. Version 0.9.)
+Please contact your lecturer for information on how to access and use these corpora, should you be interested.