User Tools

Site Tools


cqp:cleaning-output

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
cqp:cleaning-output [2020/06/25 11:11] – removed astefanowitschcqp:cleaning-output [2024/06/20 13:53] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +====== 5d. Tidying up the output ======
  
 +The concordances and the output created by the ''count'' and ''group'' commands in CQP can be saved to text files and viewed using a text editor, but often, you need a more structured format which you can import into spreadsheet programs (like LibreOffice Calc, MS Excel or Apple Numbers) or into statistics programs like R. For this purpose, all three types of output can be converted to csv files using a small program we provide as part of INLET: ''tidycwb.pl''. Regardless of what type of output you are dealing with, you can simply send it to this program before saving it to a file, and it will recognize what output it is dealing with and convert it in a useful way.
 +
 +===== Concordances =====
 +
 +Let us assume you have created a concordance of the lemma //love// in the BNC and [[cqp:concordances#storing_concordances_internally|saved it in a variable called ''Love'']]. Instead of [[cqp:concordances#exporting_a_concordance_to_an_external_file|saving it directly]], you can send it to the script tidycwb.pl using the ''|'' operator and then save it:
 +
 + cat Love > " | tidycwb.pl > love.csv"
 +
 +The script will create a csv file with the corpus in the first column, the corpus position in the second column, followed by any [[cqp:metadata|metadata you may have displayed using the PrintStructure command]] each in its own column, followed by the left context, the hit and the right context each in its own column. For example, if you have activated the PrintStructures ''file_id'' and ''text_genre'', the regular concordance would have looked like this:
 +
 + 2233: <text_id A00><text_author_sex unknown>: s have provided much <love> and care to many hu
 + 8920: <text_id A01><text_author_sex unknown>:  ‘ I think I 'm in <love> … ’ ‘ How do 
 + 13733: <text_id A01><text_author_sex unknown>:  stress to those you <love> most . Not to have 
 + 15915: <text_id A01><text_author_sex unknown>: demonstration of the <love> of Jesus shown by y
 + 38084: <text_id A03><text_author_sex mixed>: oes , the people all <love> the King so much it
 + 42738: <text_id A04><text_author_sex male>: tion and imaginative <loves> , the return of the
 + 47797: <text_id A04><text_author_sex male>: ything , has neither <love> nor hate , and volu
 + 60042: <text_id A04><text_author_sex male>: ted on it because he <loved> it , and he thereby
 + 61421: <text_id A04><text_author_sex male>:  . The reader with a <love> of art is not alway
 + 69613: <text_id A04><text_author_sex male>: <80><99> John Constable 's <love> for the work of van
 +
 +In contrast, the tidied concordance now looks like this:
 +
 + "BNC","2233","A00","unknown","s have provided much","love","and care to many hu"
 + "BNC","8920","A01","unknown","‘ I think I 'm in","love","… ’ ‘ How do "
 + "BNC","13733","A01","unknown","stress to those you","love","most . Not to have "
 + "BNC","15915","A01","unknown","demonstration of the","love","of Jesus shown by y"
 + "BNC","38084","A03","mixed","oes , the people all","love","the King so much it"
 + "BNC","42738","A04","male","tion and imaginative","loves",", the return of the"
 + "BNC","47797","A04","male","ything , has neither","love","nor hate , and volu"
 + "BNC","60042","A04","male","ted on it because he","loved","it , and he thereby"
 + "BNC","61421","A04","male",". The reader with a","love","of art is not alway"
 + "BNC","69613","A04","male","<80><99> John Constable 's","love","for the work of van"
 +
 +When imported into a spreadsheet program, this file will be displayed as follows – now you can add additional columns to add your own annotation to the hits:
 +
 +| BNC | 2233 | A00 | unknown | s have provided much | love | and care to many hu |
 +| BNC | 8920 | A01 | unknown | ‘ I think I 'm in | love | … ’ ‘ How do  |
 +| BNC | 13733 | A01 | unknown | stress to those you | love | most . Not to have  |
 +| BNC | 15915 | A01 | unknown | demonstration of the | love | of Jesus shown by y |
 +| BNC | 38084 | A03 | mixed | oes , the people all | love | the King so much it |
 +| BNC | 42738 | A04 | male | tion and imaginative | loves | , the return of the |
 +| BNC | 47797 | A04 | male | ything , has neither | love | nor hate , and volu |
 +| BNC | 60042 | A04 | male | ted on it because he | loved | it , and he thereby |
 +| BNC | 61421 | A04 | male | . The reader with a | love | of art is not alway |
 +| BNC | 69613 | A04 | male | <80><99> John Constable 's | love | for the work of van |
 +
 +===== Frequency lists =====
 +
 +Let us assume you have created a concordance of the lemma //love// in the BNC and you want to create and save a frequency list of the word forms. Again, instead of [[cqp:concordances#exporting_a_concordance_to_an_external_file|saving it directly]], you can send it to the script tidycwb.pl using the ''|'' operator and then save it:
 +
 + count Love by word > " | tidycwb.pl > love.csv"
 +
 +The script will create a frequency list with the word form in the first column and the frequency in the second column. Saving the output directly would have given you the following output:
 +
 + 20160   love  [#2308-#22467]
 + 4253    loved  [#22468-#26720]
 + 1969    Love  [#207-#2175]
 + 1295    loves  [#26721-#28015]
 + 463     loving  [#28017-#28479]
 + 186     LOVE  [#0-#185]
 + 51      Loved  [#2176-#2226]
 + 41      Loves  [#2227-#2267]
 + 40      Loving  [#2268-#2307]
 + 9       LOVED  [#186-#194]
 + 7       LOVING  [#200-#206]
 + 5       LOVES  [#195-#199]
 + 1       lovest  [#28016]
 +
 +In contrast, the tidied frequency list looks like this:
 +
 + "love",20160
 + "loved",4253
 + "Love",1969
 + "loves",1295
 + "loving",463
 + "LOVE",186
 + "Loved",51
 + "Loves",41
 + "Loving",40
 + "LOVED",9
 + "LOVING",7
 + "LOVES",5
 + "lovest",1
 +
 +Or, imported into a spreadsheet:
 +
 +| love | 20160 |
 +| loved | 4253 |
 +| Love | 1969 |
 +| loves | 1295 |
 +| loving | 463 |
 +| LOVE | 186 |
 +| Loved | 51 |
 +| Loves | 41 |
 +| Loving | 40 |
 +| LOVED | 9 |
 +| LOVING | 7 |
 +| LOVES | 5 |
 +| lovest | 1 |
 +
 +===== Output of the ''group'' command =====
 +
 +Let us assume you have created a concordance of the lemma //love// in the BNC and you want to group the part of speech (using the ''class'' tag) by the text mode. Again, instead of [[cqp:concordances#exporting_a_concordance_to_an_external_file|saving it directly]], you can send it to the script tidycwb.pl using the ''|'' operator and then save it:
 +
 + group Love match class by match text_mode  > " | tidycwb.pl > love.csv"
 +
 +If you had saved the output directly, it would have looked like this:
 +
 + #---------------------------------------------------------------------
 + written                       SUBST                              13041
 +                               VERB                               12364
 + spoken                        VERB                                1831
 +                               SUBST                               1190
 + ---                           SUBST                                 39
 +                               VERB                                  12
 + spoken                        UNC                                    2
 + written                       ADJ                                    1
 +
 +Instead, the tidied output looks like this:
 +
 + "written","SUBST",13041
 + "written","VERB",12364
 + "spoken","VERB",1831
 + "spoken","SUBST",1190
 + "---","SUBST",39
 + "---","VERB",12
 + "spoken","UNC",2
 + "written","ADJ",1
 +
 +Or, imported into a spreadsheet, like this:
 +
 +| written | SUBST | 13041 |
 +| written | VERB | 12364 |
 +| spoken | VERB | 1831 |
 +| spoken | SUBST | 1190 |
 +| --- | SUBST | 39 |
 +| --- | VERB | 12 |
 +| spoken | UNC | 2 |
 +| written | ADJ | 1 |
 +
 +Note that the default output does not repeat the contents in the first column if it would be the same in the next row – this means you cannot sort it. The tidied output does repeat the contents in the first column in every row, so if you sort it, you don't lose any information!
 +
 +**[ Introduction to CQP: [[cqp:corpus-structure|Section 1]] -- [[cqp:simple-queries|Section 2]] -- [[cqp:advanced-querying|Section 3]] -- [[cqp:beyond-queries|Section 4]] -- [[cqp:expert-tricks|Section 5]] ]**

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki