This function extracts distance information from a quanteda::tokens()
object.
Usage
tokens_proximity(
x,
pattern,
get_min = TRUE,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
count_from = 1,
tolower = TRUE,
keep_acronyms = FALSE
)
Arguments
- x
a
tokens
ortokens_with_proximity
object.- pattern
pattern for selecting keywords, see quanteda::pattern for details.
- get_min
logical, whether to return only the minimum distance or raw distance information; it is more relevant when
keywords
have more than one word. See details.- valuetype
See quanteda::valuetype.
- case_insensitive
logical, see quanteda::valuetype.
- count_from
numeric, how proximity is counted from when
get_min
isTRUE
. The keyword is assigned with this proximity. Default to 1 (not zero) to prevent division by 0 with the default behaviour ofdfm.tokens_with_proximity()
.- tolower
logical, convert all features to lowercase.
- keep_acronyms
logical, if
TRUE
, do not lowercase any all-uppercase words. Seequanteda::tokens_tolower()
.
Value
a tokens_with_proximity
object. It is similar to quanteda::tokens()
, but only dfm.tokens_with_proximity()
, quanteda::convert()
, quanteda::docvars()
, and quanteda::meta()
methods are available. A tokens_with_proximity
has a modified print()
method. Also, additional data slots are included
a document variable
proximity
metadata slots for all arguments used
Details
Proximity is measured by the number of tokens away from the keyword. Given a tokenized sentence: ["I", "eat", "this", "apple"] and suppose "eat" is the keyword. The vector of minimum proximity for each word from "eat" is [2, 1, 2, 3], if count_from
is 1. In another case: ["I", "wash", "and", "eat", "this", "apple"] and ["wash", "eat"] are the keywords. The minimal distance vector is [2, 1, 2, 1, 2, 3]. If get_min
is FALSE
, the output is a list of two vectors. For "wash", the distance vector is [1, 0, 1, 2, 3]. For "eat", [3, 2, 1, 0, 1, 2].
Please conduct all text manipulation tasks with tokens_*()
functions before calling this function. To convert the output back to a tokens
object, use quanteda::as.tokens()
.
Examples
library(quanteda)
tok1 <- data_char_ukimmig2010 %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_proximity(c("eu", "euro*"))
tok1 %>%
dfm() %>%
dfm_select(c("immig*", "migr*")) %>%
rowSums() %>%
sort()
#> SNP LibDem PC Conservative Coalition Greens
#> 0.01600000 0.01834862 0.01960784 0.05601171 0.13107859 0.28359388
#> Labour UKIP BNP
#> 0.45629607 0.60657094 0.61155912
## compare with
data_char_ukimmig2010 %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
dfm() %>%
dfm_select(c("immig*", "migr*")) %>%
rowSums() %>%
sort()
#> PC SNP Conservative Coalition LibDem UKIP
#> 2 2 5 8 8 11
#> Greens Labour BNP
#> 16 20 35
## rerun to select other keywords
tok1 %>% tokens_proximity("britain")
#> Tokens consisting of 9 documents.
#> BNP :
#> [1] "immigration" "an" "unparalleled" "crisis" "which"
#> [6] "only" "the" "bnp" "can" "solve"
#> [11] "at" "current"
#> [ ... and 2,839 more ]
#>
#> Coalition :
#> [1] "immigration" "the" "government" "believes" "that"
#> [6] "immigration" "has" "enriched" "our" "culture"
#> [11] "and" "strengthened"
#> [ ... and 219 more ]
#>
#> Conservative :
#> [1] "attract" "the" "brightest" "and" "best"
#> [6] "to" "our" "country" "immigration" "has"
#> [11] "enriched" "our"
#> [ ... and 440 more ]
#>
#> Greens :
#> [1] "immigration" "migration" "is" "a" "fact"
#> [6] "of" "life" "people" "have" "always"
#> [11] "moved" "from"
#> [ ... and 598 more ]
#>
#> Labour :
#> [1] "crime" "and" "immigration" "the" "challenge"
#> [6] "for" "britain" "we" "will" "control"
#> [11] "immigration" "with"
#> [ ... and 608 more ]
#>
#> LibDem :
#> [1] "firm" "but" "fair" "immigration" "system"
#> [6] "britain" "has" "always" "been" "an"
#> [11] "open" "welcoming"
#> [ ... and 423 more ]
#>
#> [ reached max_ndoc ... 3 more documents ]
#> With proximity vector(s).
#> Pattern: britain