In clause analysis, the grammatical structure of text is used to analyse ‘who did what to whom (according to whom)’, to adapt the classical quote from Harold Lasswell. From a users point of view, clause analysis is called in AmCAT similar to other analyses:

library(amcatr)
conn = amcat.connect("http://preview.amcat.nl")
sentence = "Mary told me that John loves her more than anything"
t = amcat.gettokens(conn, sentence=as.character(sentence), module="clauses_en")
t
##        word aid sentence coref  pos    lemma offset source_id source_role
## 1      Mary  NA        1     2  NNP     Mary      0         0      source
## 2      told  NA        1    NA  VBD     tell      5        NA            
## 3        me  NA        1    NA  PRP        I     10        NA            
## 4      that  NA        1    NA   IN     that     13         0       quote
## 5      John  NA        1    NA  NNP     John     18         0       quote
## 6     loves  NA        1    NA  VBZ     love     23         0       quote
## 7       her  NA        1     2 PRP$      she     29         0       quote
## 8      more  NA        1    NA  JJR     more     33         0       quote
## 9      than  NA        1    NA   IN     than     38         0       quote
## 10 anything  NA        1    NA   NN anything     43         0       quote
##    id pos1 clause_role clause_id
## 1   1    M                    NA
## 2   2    V                    NA
## 3   3    O                    NA
## 4   4    P   predicate         0
## 5   5    M     subject         0
## 6   6    V   predicate         0
## 7   7    O   predicate         0
## 8   8    A   predicate         0
## 9   9    P   predicate         0
## 10 10    N   predicate         0

As you can see in the result, this is essentially the output from the lemmatization with three extra sets of columns: * source_id and source_role identify (quoted or paraphrased) sources. In this case, there is one quotation (source_id 0) with Mary being the source, and ‘that … anything’ the quote. * clause_id and clause_role perform a similar function: John is the subject of clause ‘0’, while ‘loving her more than anything’ is the predicate * Finally, coref indicates coreference: words with the same coreference id refer to the same person or entity. In this case, Mary and ‘her’ are correctly identified as co-referring.

Thus, the clause analysis breaks down the sentence into a nested structure, with the clause nested in the quotation. For clauses, the subject is the semantic agent or actor doing something, while the predicate is everything else, including the verb and the direct object, if applicable.

Since this data set is “just another” R data frame containing tokens, the techniques from the first part of the workshop are directly applicable. To show this, we can use the amcat.gettokens command to get the same data set containing American coverage of the Gaza war:

# t3 = amcat.gettokens(conn, project=688, articleset = 17667, module = "clauses_en", page_size = 100, )

The above command will take quite a while to run, so I prepared the tokens in a file that you can download:

if (!file.exists("clauses.rda")) download.file("http://i.amcat.nl/clauses.rda", destfile="clauses.rda")
load("clauses.rda")

Lets have a look at the (beginning of) the second sentence of the first article:

head(t3[t3$sentence==2,], n=25)
##          word sentence pos     lemma offset      aid  id pos1 coref
## 89         ``        2  ``        ``    406 26074649  89    .    NA
## 90        The        2  DT       the    407 26074649  90    D     3
## 91    Israeli        2  JJ   israeli    411 26074649  91    A     3
## 92     attack        2  NN    attack    419 26074649  92    N     3
## 93         on        2  IN        on    426 26074649  93    P     3
## 94       Gaza        2 NNP      Gaza    429 26074649  94    M     3
## 95         is        2 VBZ        be    434 26074649  95    V    NA
## 96        far        2  RB       far    437 26074649  96    B    NA
## 97       from        2  IN      from    441 26074649  97    P    NA
## 98          a        2  DT         a    446 26074649  98    D    NA
## 99     simple        2  JJ    simple    448 26074649  99    A    NA
## 100 operation        2  NN operation    455 26074649 100    N    NA
## 101        to        2  TO        to    465 26074649 101    ?    NA
## 102      stop        2  VB      stop    468 26074649 102    V    NA
## 103  homemade        2  NN  homemade    473 26074649 103    N     2
## 104   rockets        2 NNS    rocket    482 26074649 104    N     2
## 105     being        2 VBG        be    490 26074649 105    V     2
## 106     fired        2 VBN      fire    496 26074649 106    V     2
## 107      into        2  IN      into    502 26074649 107    P     2
## 108    Israel        2 NNP    Israel    507 26074649 108    M     2
## 109         ,        2   ,         ,    513 26074649 109    .    NA
## 110        ''        2  ''        ''    514 26074649 110    .    NA
## 111    writes        2 VBZ     write    516 26074649 111    V    NA
## 112    Philip        2 NNP    Philip    523 26074649 112    M     5
## 113   Giraldi        2 NNP   Giraldi    530 26074649 113    M     5
##     clause_role clause_id source_id source_role freq israel palest
## 89                     NA        NA                1  FALSE  FALSE
## 90      subject         9         1       quote    1  FALSE  FALSE
## 91      subject         9         1       quote    1   TRUE  FALSE
## 92      subject         9         1       quote    1  FALSE  FALSE
## 93                     NA        NA                1  FALSE  FALSE
## 94      subject         9         1       quote    1  FALSE  FALSE
## 95    predicate         9         1       quote    1  FALSE  FALSE
## 96    predicate         9         1       quote    1  FALSE  FALSE
## 97                     NA        NA                1  FALSE  FALSE
## 98    predicate         9         1       quote    1  FALSE  FALSE
## 99    predicate         9         1       quote    1  FALSE  FALSE
## 100   predicate         9         1       quote    1  FALSE  FALSE
## 101   predicate         9         1       quote    1  FALSE  FALSE
## 102   predicate         9         1       quote    1  FALSE  FALSE
## 103   predicate         9         1       quote    1  FALSE  FALSE
## 104   predicate         9         1       quote    1  FALSE  FALSE
## 105   predicate         9         1       quote    1  FALSE  FALSE
## 106   predicate         9         1       quote    1  FALSE  FALSE
## 107                    NA        NA                1  FALSE  FALSE
## 108   predicate         9         1       quote    1   TRUE  FALSE
## 109                    NA        NA                1  FALSE  FALSE
## 110                    NA        NA                1  FALSE  FALSE
## 111                    NA        NA                1  FALSE  FALSE
## 112                    NA         1      source    1  FALSE  FALSE
## 113                    NA         1      source    1  FALSE  FALSE

As you can see, Philip Giraldi is correctly identified as a source, and his quote contains a single clause, with “the Israeli attack” as subject and “is far from … into Israel” is the predicate. This illustrates some of the possibilities and limitations of the method: It correctly identifies the main argument in the sentence: Israel is trying to stop rockets fired into Israel, among other things and according to Philip Giraldi. It does not, however, see the Israeli attack on Gaza as a quote since the mechanism depends on verb structure, and that phrase does not have a verb. Moreover, the problem of understanding complex or even subtle messages like it being “far from” only about stopping rockets is not closer to a solution. That said, this analysis can solve the basic problem in conflict coverage that co-occurrence methods are difficult because most documents talk about both sides, requiring analysis of who does what to whom.

To showcase how this output can be analysed with the same techniques as discussed above, let’s look at the predicates for which Israel and Palestine are subject, respectively. First, we define a variable indicating whether a token is indicative of either actor using a simplistic pattern, then select all clause ids that have Israel as its subject, and finally select all predicates that match that clause_id: (This looks and sound more complex than it is)

t3$israel = grepl("israel.*|idf", t3$lemma, ignore.case = T)
clauses.israel = unique(t3$clause_id[t3$israel & !is.na(t3$clause_role) & t3$clause_role == "subject"])
predicates.israel = t3[!is.na(t3$clause_role) & t3$clause_role == "predicate" & t3$clause_id %in% clauses.israel, ]

Now, we can create a dtm containing only verbs in those predicates, and create a word cloud of those verbs:

library(corpustools)
tokens = predicates.israel[predicates.israel$pos1 == 'V' & !(predicates.israel$lemma %in% c("have", "be", "do", "will")),]
dtm.israel = dtm.create(tokens$aid, tokens$lemma)
dtm.wordcloud(dtm.israel)

Let’s see what Hamas does:

t3$hamas = grepl("hamas.*", t3$lemma, ignore.case = T)
clauses.hamas = unique(t3$clause_id[t3$hamas & !is.na(t3$clause_role) & t3$clause_role == "subject"])
predicates.hamas = t3[!is.na(t3$clause_role) & t3$clause_role == "predicate" & t3$clause_id %in% clauses.hamas, ]
tokens = predicates.hamas[predicates.hamas$pos1 == 'V' & !(predicates.hamas$lemma %in% c("have", "be", "do", "will")),]
dtm.hamas = dtm.create(tokens$aid, tokens$lemma)
dtm.wordcloud(dtm.hamas)