Skip to contents

Construct a sparse document-feature matrix from the output of tokens_proximity().

Usage

# S3 method for tokens_with_proximity
dfm(
  x,
  tolower = TRUE,
  remove_padding = FALSE,
  verbose = quanteda::quanteda_options("verbose"),
  remove_docvars_proximity = TRUE,
  weight_function = function(x) {
     1/x
 },
  ...
)

Arguments

x

output of tokens_proximity().

tolower

convert all features to lowercase.

remove_padding

logical; if TRUE, remove the "pads" left as empty tokens after calling quanteda::tokens() or quanteda::tokens_remove() with padding = TRUE.

verbose

display messages if TRUE.

remove_docvars_proximity

logical, remove the "proximity" document variable.

weight_function

a weight function, default to invert distance,

...

not used.

Value

a quanteda::dfm object

Details

By default, words closer to keywords are weighted higher. You might change that with another weight_function.

Examples

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
tok1 <- data_char_ukimmig2010 %>%
    tokens(remove_punct = TRUE) %>%
    tokens_tolower() %>%
    tokens_proximity(c("eu", "europe", "european"))
tok1 %>%
    dfm() %>%
    dfm_select(c("immig*", "migr*")) %>%
    rowSums() %>%
    sort()
#>          SNP       LibDem           PC Conservative    Coalition       Greens 
#>   0.01600000   0.01834862   0.01960784   0.05601171   0.13107859   0.28359388 
#>       Labour         UKIP          BNP 
#>   0.45629607   0.60440644   0.61155912 
## Words further away from keywords are weighted higher
tok1 %>%
    dfm(weight_function = identity) %>%
    dfm_select(c("immig*", "migr*")) %>%
    rowSums() %>%
    sort()
#>           PC          SNP         UKIP Conservative    Coalition       Greens 
#>          204          250          485          647          931         1916 
#>       Labour       LibDem          BNP 
#>         2979         3488         9217 
tok1 %>%
    dfm(weight_function = function(x) {
        1 / x^2
    }) %>%
    dfm_select(c("immig*", "migr*")) %>%
    rowSums() %>%
    sort()
#>       LibDem          SNP           PC Conservative    Coalition       Greens 
#> 0.0000420840 0.0001280000 0.0001922338 0.0009022326 0.0056756462 0.0118752157 
#>       Labour         UKIP          BNP 
#> 0.0505539157 0.0593924399 0.0926906182