cqp:cleaning-output
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
cqp:cleaning-output [2020/06/25 10:48] – [FTP software] astefanowitsch | cqp:cleaning-output [2024/06/20 13:53] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 5d. Tidying up the output ====== | ||
+ | The concordances and the output created by the '' | ||
+ | |||
+ | ===== Concordances ===== | ||
+ | |||
+ | Let us assume you have created a concordance of the lemma //love// in the BNC and [[cqp: | ||
+ | |||
+ | cat Love > " | tidycwb.pl > love.csv" | ||
+ | |||
+ | The script will create a csv file with the corpus in the first column, the corpus position in the second column, followed by any [[cqp: | ||
+ | |||
+ | 2233: <text_id A00>< | ||
+ | 8920: <text_id A01>< | ||
+ | 13733: <text_id A01>< | ||
+ | 15915: <text_id A01>< | ||
+ | 38084: <text_id A03>< | ||
+ | 42738: <text_id A04>< | ||
+ | 47797: <text_id A04>< | ||
+ | 60042: <text_id A04>< | ||
+ | 61421: <text_id A04>< | ||
+ | 69613: <text_id A04>< | ||
+ | |||
+ | In contrast, the tidied concordance now looks like this: | ||
+ | |||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | |||
+ | When imported into a spreadsheet program, this file will be displayed as follows – now you can add additional columns to add your own annotation to the hits: | ||
+ | |||
+ | | BNC | 2233 | A00 | unknown | s have provided much | love | and care to many hu | | ||
+ | | BNC | 8920 | A01 | unknown | ‘ I think I 'm in | love | … ’ ‘ How do | | ||
+ | | BNC | 13733 | A01 | unknown | stress to those you | love | most . Not to have | | ||
+ | | BNC | 15915 | A01 | unknown | demonstration of the | love | of Jesus shown by y | | ||
+ | | BNC | 38084 | A03 | mixed | oes , the people all | love | the King so much it | | ||
+ | | BNC | 42738 | A04 | male | tion and imaginative | loves | , the return of the | | ||
+ | | BNC | 47797 | A04 | male | ything , has neither | love | nor hate , and volu | | ||
+ | | BNC | 60042 | A04 | male | ted on it because he | loved | it , and he thereby | | ||
+ | | BNC | 61421 | A04 | male | . The reader with a | love | of art is not alway | | ||
+ | | BNC | 69613 | A04 | male | < | ||
+ | |||
+ | ===== Frequency lists ===== | ||
+ | |||
+ | Let us assume you have created a concordance of the lemma //love// in the BNC and you want to create and save a frequency list of the word forms. Again, instead of [[cqp: | ||
+ | |||
+ | count Love by word > " | tidycwb.pl > love.csv" | ||
+ | |||
+ | The script will create a frequency list with the word form in the first column and the frequency in the second column. Saving the output directly would have given you the following output: | ||
+ | |||
+ | 20160 | ||
+ | 4253 loved [# | ||
+ | 1969 Love [# | ||
+ | 1295 loves [# | ||
+ | 463 | ||
+ | 186 | ||
+ | 51 Loved [# | ||
+ | 41 Loves [# | ||
+ | 40 Loving | ||
+ | 9 | ||
+ | 7 | ||
+ | 5 | ||
+ | 1 | ||
+ | |||
+ | In contrast, the tidied frequency list looks like this: | ||
+ | |||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | |||
+ | Or, imported into a spreadsheet: | ||
+ | |||
+ | | love | 20160 | | ||
+ | | loved | 4253 | | ||
+ | | Love | 1969 | | ||
+ | | loves | 1295 | | ||
+ | | loving | 463 | | ||
+ | | LOVE | 186 | | ||
+ | | Loved | 51 | | ||
+ | | Loves | 41 | | ||
+ | | Loving | 40 | | ||
+ | | LOVED | 9 | | ||
+ | | LOVING | 7 | | ||
+ | | LOVES | 5 | | ||
+ | | lovest | 1 | | ||
+ | |||
+ | ===== Output of the '' | ||
+ | |||
+ | Let us assume you have created a concordance of the lemma //love// in the BNC and you want to group the part of speech (using the '' | ||
+ | |||
+ | group Love match class by match text_mode | ||
+ | |||
+ | If you had saved the output directly, it would have looked like this: | ||
+ | |||
+ | # | ||
+ | written | ||
+ | VERB 12364 | ||
+ | spoken | ||
+ | SUBST 1190 | ||
+ | --- | ||
+ | VERB 12 | ||
+ | spoken | ||
+ | written | ||
+ | |||
+ | Instead, the tidied output looks like this: | ||
+ | |||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | " | ||
+ | |||
+ | Or, imported into a spreadsheet, | ||
+ | |||
+ | | written | SUBST | 13041 | | ||
+ | | written | VERB | 12364 | | ||
+ | | spoken | VERB | 1831 | | ||
+ | | spoken | SUBST | 1190 | | ||
+ | | --- | SUBST | 39 | | ||
+ | | --- | VERB | 12 | | ||
+ | | spoken | UNC | 2 | | ||
+ | | written | ADJ | 1 | | ||
+ | |||
+ | Note that the default output does not repeat the contents in the first column if it would be the same in the next row – this means you cannot sort it. The tidied output does repeat the contents in the first column in every row, so if you sort it, you don't lose any information! | ||
+ | |||
+ | **[ Introduction to CQP: [[cqp: |