The goal of quanteda.proximity is to add proximity vectors into the tokens object of quanteda.

Proximity is measured by the number of tokens away from the keyword. Given a tokenized sentence: [“I”, “wash”, “this”, “apple”] and suppose “eat” is the keyword. The proximity vector is a vector with the same length as the tokenized sentence and the values (using the default settings) are [2, 1, 2, 3].

Installation

You can install the development version of quanteda.proximity like so:

remotes::install_github("gesistsa/quanteda.proximity")

Example

suppressPackageStartupMessages(library(quanteda))
library(quanteda.proximity)

txt1 <-
c("Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.",
"EU policymakers proposed the new agency in 2021 to stop financial firms from aiding criminals and terrorists. Brussels has so far relied on national regulators with no EU authority to stop money laundering and terrorist financing running into billions of euros.")

tokens_proximity() generates the proximity vectors and stores them as a docvar (document variable).

tok1 <- txt1 %>% tokens() %>%
    tokens_proximity(pattern = "turkish")
tok1
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "turkish"   "president" "tayyip"    "erdogan"   ","         "in"       
#>  [7] "his"       "strongest" "comments"  "yet"       "on"        "the"      
#> [ ... and 26 more ]
#> 
#> text2 :
#>  [1] "eu"           "policymakers" "proposed"     "the"          "new"         
#>  [6] "agency"       "in"           "2021"         "to"           "stop"        
#> [11] "financial"    "firms"       
#> [ ... and 31 more ]
#> 
#> With proximity vector(s).
#> Pattern:  turkish

You can access the proximity vectors by

docvars(tok1, "proximity")
#> $text1
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38
#> 
#> $text2
#>  [1] 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44
#> [26] 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44

The tokens object with proximity vectors can be converted to a (weighted) dfm (Document-Feature Matrix). The default weight is assigned by inverting the proximity.

dfm(tok1)
#> Document-feature matrix of: 2 documents, 64 features (45.31% sparse) and 0 docvars.
#>        features
#> docs    turkish president    tayyip erdogan         ,         in       his
#>   text1       1       0.5 0.3333333    0.25 0.2666667 0.16666667 0.1428571
#>   text2       0       0   0            0    0         0.02272727 0        
#>        features
#> docs    strongest  comments yet
#>   text1     0.125 0.1111111 0.1
#>   text2     0     0         0  
#> [ reached max_nfeat ... 54 more features ]

You have the freedom to change to another weight function. For example, not inverting.

dfm(tok1, weight_function = identity)
#> Document-feature matrix of: 2 documents, 64 features (45.31% sparse) and 0 docvars.
#>        features
#> docs    turkish president tayyip erdogan  , in his strongest comments yet
#>   text1       1         2      3       4 20  6   7         8        9  10
#>   text2       0         0      0       0  0 44   0         0        0   0
#> [ reached max_nfeat ... 54 more features ]

Or any custom function

dfm(tok1, weight_function = function(x) { 1 / x^2 })
#> Document-feature matrix of: 2 documents, 64 features (45.31% sparse) and 0 docvars.
#>        features
#> docs    turkish president    tayyip erdogan          ,           in        his
#>   text1       1      0.25 0.1111111  0.0625 0.04444444 0.0277777778 0.02040816
#>   text2       0      0    0          0      0          0.0005165289 0         
#>        features
#> docs    strongest   comments  yet
#>   text1  0.015625 0.01234568 0.01
#>   text2  0        0          0   
#> [ reached max_nfeat ... 54 more features ]

Application

A clumsy example to calculate the total inverse proximity weighted frequency of “terror*” words.

dict1 <- dictionary(list(TERROR = c("terror*")))

dfm(tok1) %>% dfm_lookup(dict1) %>% rowSums()
#>      text1      text2 
#> 0.03703704 0.04545455

How about changing the target to “Hamas”?

tok2 <- tok1 %>% tokens_proximity(pattern = "hamas")
tok2
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "turkish"   "president" "tayyip"    "erdogan"   ","         "in"       
#>  [7] "his"       "strongest" "comments"  "yet"       "on"        "the"      
#> [ ... and 26 more ]
#> 
#> text2 :
#>  [1] "eu"           "policymakers" "proposed"     "the"          "new"         
#>  [6] "agency"       "in"           "2021"         "to"           "stop"        
#> [11] "financial"    "firms"       
#> [ ... and 31 more ]
#> 
#> With proximity vector(s).
#> Pattern:  hamas

dfm(tok2) %>% dfm_lookup(dict1) %>% rowSums()
#>      text1      text2 
#> 0.20000000 0.04545455

Can we use two targets, e.g. “EU” and “Brussels”?

tok3 <- tok1 %>% tokens_proximity(pattern = c("eu", "brussels"))
tok3
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "turkish"   "president" "tayyip"    "erdogan"   ","         "in"       
#>  [7] "his"       "strongest" "comments"  "yet"       "on"        "the"      
#> [ ... and 26 more ]
#> 
#> text2 :
#>  [1] "eu"           "policymakers" "proposed"     "the"          "new"         
#>  [6] "agency"       "in"           "2021"         "to"           "stop"        
#> [11] "financial"    "firms"       
#> [ ... and 31 more ]
#> 
#> With proximity vector(s).
#> Pattern:  eu brussels

docvars(tok3, "proximity")
#> $text1
#>  [1] 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39 39
#> [26] 39 39 39 39 39 39 39 39 39 39 39 39 39
#> 
#> $text2
#>  [1]  1  2  3  4  5  6  7  8  9 10  9  8  7  6  5  4  3  2  1  2  3  4  5  6  5
#> [26]  4  3  2  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

dfm(tok3) %>% dfm_lookup(dict1) %>% rowSums()
#>      text1      text2 
#> 0.02564103 0.45833333

Can we use phrase?

tok4 <- tok1 %>% tokens_proximity(pattern = phrase("Tayyip Erdogan"))
tok4
#> Tokens consisting of 2 documents.
#> text1 :
#>  [1] "turkish"   "president" "tayyip"    "erdogan"   ","         "in"       
#>  [7] "his"       "strongest" "comments"  "yet"       "on"        "the"      
#> [ ... and 26 more ]
#> 
#> text2 :
#>  [1] "eu"           "policymakers" "proposed"     "the"          "new"         
#>  [6] "agency"       "in"           "2021"         "to"           "stop"        
#> [11] "financial"    "firms"       
#> [ ... and 31 more ]
#> 
#> With proximity vector(s).
#> Pattern:  Tayyip Erdogan

docvars(tok4, "proximity")
#> $text1
#>  [1]  3  2  1  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
#> [26] 23 24 25 26 27 28 29 30 31 32 33 34 35
#> 
#> $text2
#>  [1] 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44
#> [26] 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44

dfm(tok4) %>% dfm_lookup(dict1) %>% rowSums()
#>      text1      text2 
#> 0.04166667 0.04545455

Similar functions

quanteda: quanteda::tokens_select(window), quanteda::fcm(), quanteda::index()
qdap: qdap::word_proximity(), qdap::weight()