Skip to contents

This function extracts distance information from a quanteda::tokens() object.

Usage

tokens_proximity(
  x,
  pattern,
  get_min = TRUE,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  count_from = 1,
  tolower = TRUE,
  keep_acronyms = FALSE
)

Arguments

x

a tokens or tokens_with_proximity object.

pattern

pattern for selecting keywords, see quanteda::pattern for details.

get_min

logical, whether to return only the minimum distance or raw distance information; it is more relevant when keywords have more than one word. See details.

valuetype

See quanteda::valuetype.

case_insensitive

logical, see quanteda::valuetype.

count_from

numeric, how proximity is counted from when get_min is TRUE. The keyword is assigned with this proximity. Default to 1 (not zero) to prevent division by 0 with the default behaviour of dfm.tokens_with_proximity().

tolower

logical, convert all features to lowercase.

keep_acronyms

logical, if TRUE, do not lowercase any all-uppercase words. See quanteda::tokens_tolower().

Value

a tokens_with_proximity object. It is similar to quanteda::tokens(), but only dfm.tokens_with_proximity(), quanteda::convert(), quanteda::docvars(), and quanteda::meta() methods are available. A tokens_with_proximity has a modified print() method. Also, additional data slots are included

  • a document variable proximity

  • metadata slots for all arguments used

Details

Proximity is measured by the number of tokens away from the keyword. Given a tokenized sentence: ["I", "eat", "this", "apple"] and suppose "eat" is the keyword. The vector of minimum proximity for each word from "eat" is [2, 1, 2, 3], if count_from is 1. In another case: ["I", "wash", "and", "eat", "this", "apple"] and ["wash", "eat"] are the keywords. The minimal distance vector is [2, 1, 2, 1, 2, 3]. If get_min is FALSE, the output is a list of two vectors. For "wash", the distance vector is [1, 0, 1, 2, 3]. For "eat", [3, 2, 1, 0, 1, 2]. Please conduct all text manipulation tasks with tokens_*() functions before calling this function. To convert the output back to a tokens object, use quanteda::as.tokens().

Examples

library(quanteda)
tok1 <- data_char_ukimmig2010 %>%
    tokens(remove_punct = TRUE) %>%
    tokens_tolower() %>%
    tokens_proximity(c("eu", "euro*"))
tok1 %>%
    dfm() %>%
    dfm_select(c("immig*", "migr*")) %>%
    rowSums() %>%
    sort()
#>          SNP       LibDem           PC Conservative    Coalition       Greens 
#>   0.01600000   0.01834862   0.01960784   0.05601171   0.13107859   0.28359388 
#>       Labour         UKIP          BNP 
#>   0.45629607   0.60657094   0.61155912 
## compare with
data_char_ukimmig2010 %>%
    tokens(remove_punct = TRUE) %>%
    tokens_tolower() %>%
    dfm() %>%
    dfm_select(c("immig*", "migr*")) %>%
    rowSums() %>%
    sort()
#>           PC          SNP Conservative    Coalition       LibDem         UKIP 
#>            2            2            5            8            8           11 
#>       Greens       Labour          BNP 
#>           16           20           35 
## rerun to select other keywords
tok1 %>% tokens_proximity("britain")
#> Tokens consisting of 9 documents.
#> BNP :
#>  [1] "immigration"  "an"           "unparalleled" "crisis"       "which"       
#>  [6] "only"         "the"          "bnp"          "can"          "solve"       
#> [11] "at"           "current"     
#> [ ... and 2,839 more ]
#> 
#> Coalition :
#>  [1] "immigration"  "the"          "government"   "believes"     "that"        
#>  [6] "immigration"  "has"          "enriched"     "our"          "culture"     
#> [11] "and"          "strengthened"
#> [ ... and 219 more ]
#> 
#> Conservative :
#>  [1] "attract"     "the"         "brightest"   "and"         "best"       
#>  [6] "to"          "our"         "country"     "immigration" "has"        
#> [11] "enriched"    "our"        
#> [ ... and 440 more ]
#> 
#> Greens :
#>  [1] "immigration" "migration"   "is"          "a"           "fact"       
#>  [6] "of"          "life"        "people"      "have"        "always"     
#> [11] "moved"       "from"       
#> [ ... and 598 more ]
#> 
#> Labour :
#>  [1] "crime"       "and"         "immigration" "the"         "challenge"  
#>  [6] "for"         "britain"     "we"          "will"        "control"    
#> [11] "immigration" "with"       
#> [ ... and 608 more ]
#> 
#> LibDem :
#>  [1] "firm"        "but"         "fair"        "immigration" "system"     
#>  [6] "britain"     "has"         "always"      "been"        "an"         
#> [11] "open"        "welcoming"  
#> [ ... and 423 more ]
#> 
#> [ reached max_ndoc ... 3 more documents ]
#> With proximity vector(s).
#> Pattern:  britain