Construct a sparse document-feature matrix from the output of tokens_proximity()
.
Usage
# S3 method for tokens_with_proximity
dfm(
x,
tolower = TRUE,
remove_padding = FALSE,
verbose = quanteda::quanteda_options("verbose"),
remove_docvars_proximity = TRUE,
weight_function = function(x) {
1/x
},
...
)
Arguments
- x
output of
tokens_proximity()
.- tolower
convert all features to lowercase.
- remove_padding
logical; if
TRUE
, remove the "pads" left as empty tokens after callingquanteda::tokens()
orquanteda::tokens_remove()
withpadding = TRUE
.- verbose
display messages if
TRUE
.- remove_docvars_proximity
logical, remove the "proximity" document variable.
- weight_function
a weight function, default to invert distance,
- ...
not used.
Value
a quanteda::dfm object
Details
By default, words closer to keywords are weighted higher. You might change that with another weight_function
.
Examples
library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
tok1 <- data_char_ukimmig2010 %>%
tokens(remove_punct = TRUE) %>%
tokens_tolower() %>%
tokens_proximity(c("eu", "europe", "european"))
tok1 %>%
dfm() %>%
dfm_select(c("immig*", "migr*")) %>%
rowSums() %>%
sort()
#> SNP LibDem PC Conservative Coalition Greens
#> 0.01600000 0.01834862 0.01960784 0.05601171 0.13107859 0.28359388
#> Labour UKIP BNP
#> 0.45629607 0.60440644 0.61155912
## Words further away from keywords are weighted higher
tok1 %>%
dfm(weight_function = identity) %>%
dfm_select(c("immig*", "migr*")) %>%
rowSums() %>%
sort()
#> PC SNP UKIP Conservative Coalition Greens
#> 204 250 485 647 931 1916
#> Labour LibDem BNP
#> 2979 3488 9217
tok1 %>%
dfm(weight_function = function(x) {
1 / x^2
}) %>%
dfm_select(c("immig*", "migr*")) %>%
rowSums() %>%
sort()
#> LibDem SNP PC Conservative Coalition Greens
#> 0.0000420840 0.0001280000 0.0001922338 0.0009022326 0.0056756462 0.0118752157
#> Labour UKIP BNP
#> 0.0505539157 0.0593924399 0.0926906182