A Fast WHATWG Compliant URL Parser • adaR

adaR is a wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++ .

It implements several auxilliary functions to work with urls:

public suffix extraction (top level domain excluding private domains) like psl
fast c++ implementation of utils::URLdecode (~40x speedup)

More general information on URL parsing can be found in the introductory vignette via vignette("adaR").

adaR is part of a series of R packages to analyse webtracking data:

webtrackR: preprocess raw webtracking data
domainator: classify domains
adaR: parse urls

Installation

You can install the development version of adaR from GitHub with:

# install.packages("devtools")
devtools::install_github("gesistsa/adaR")

The version on CRAN can be installed with

install.packages("adaR")

Example

This is a basic example which shows all the returned components of a URL.

library(adaR)
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
#>                                                      href
#> 1 https://user_1:password_1@example.org:8080/api?q=1#frag
#>   protocol username   password             host
#> 1   https:   user_1 password_1 example.org:8080
#>      hostname port pathname search  hash
#> 1 example.org 8080     /api   ?q=1 #frag

  /*
   * https://user:pass@example.com:1234/foo/bar?baz#quux
   *       |     |    |          | ^^^^|       |   |
   *       |     |    |          | |   |       |   `----- hash_start
   *       |     |    |          | |   |       `--------- search_start
   *       |     |    |          | |   `----------------- pathname_start
   *       |     |    |          | `--------------------- port
   *       |     |    |          `----------------------- host_end
   *       |     |    `---------------------------------- host_start
   *       |     `--------------------------------------- username_end
   *       `--------------------------------------------- protocol_end
   */

It solves some problems of urltools with more complex urls.

urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
   7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>   scheme                            domain port
#> 1  https 40.7519848,-74.0015045,14.\n   7z <NA>
#>                                                                                 path
#> 1 data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   parameter fragment
#> 1      <NA>     <NA>

ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>                                                                                                                                                                         href
#> 1 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   protocol username password           host       hostname
#> 1   https:                   www.google.com www.google.com
#>   port
#> 1     
#>                                                                                                                                               pathname
#> 1 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   search hash
#> 1

A “raw” url parse using ada is extremely fast (see ada-url.com) but for this to carry over to R is tricky. The performance is still compatible with urltools::url_parse with the noted advantage in accuracy in some practical circumstances.

bench::mark(
  ada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
  urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 ada          2.21µs   2.46µs   384709.        0B     38.5
#> 2 urltools   101.93µs 106.44µs     9258.        0B     61.9

For further benchmark results, see benchmark.md in data_raw.

There are four more groups of functions available to work with url parsing:

ada_get_*() get a specific component
ada_has_*() check if a specific component is present
ada_set_*() set a specific component from URLS
ada_clear_*() remove a specific component from URLS

Public Suffix extraction

public_suffix() extracts their top level domain from the public suffix list, excluding private domains.

urls <- c(
  "https://subsub.sub.domain.co.uk",
  "https://domain.api.gov.uk",
  "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk"                           
#> [2] "gov.uk"                          
#> [3] "butthisispartoftheps.kawasaki.jp"

If you are wondering about the last url. The list also contains wildcard suffixes such as *.kawasaki.jp which need to be matched.

Acknowledgement

The logo is created from this portrait of Ada Lovelace, a very early pioneer in Computer Science.