This vignette gives an overview over adaR
and url
parsing in general.
A primer on URLs
A URL (Uniform Resource Locator) serves as a reference to a web resource and has specific components that give information about how the resource can be fetched. The table below gives an overview of the components of a valid URL.
Name | Description | Example |
---|---|---|
Protocol | Indicates the protocol to access the resource. | http:// |
Username & Password | Contains authentication info. Separated by a colon and followed by
an @ . |
username:password@ |
Hostname | Refers to the domain name or IP of the server where the resource resides. |
example.com or 192.168.1.1
|
Port | Specifies the technical gate used to access the resources on the server. | :8080 |
Pathname | Provides info about the location of the resource on the server, often like a filesystem path. |
/directory/file.html or
/images/pic.jpg
|
Query | Provides additional parameters, often for search queries or data retrieval. | ?key1=value1&key2=value2 |
Fragment | Refers to a specific part of a web resource or document, like an anchor. | #section2 |
A full URL might look something like this:
https://username:password@example.com:8080/directory/file.html?key1=value1&key2=value2#section2
However, URLs can be as simple as just a scheme and host (e.g.,
http://example.com
). The presence and specific combination
of these components can vary based on the exact nature and purpose of
the URL.
The terms are not necessarily unambiguous and there are further (sub)
terms that need explanation. The protocol
can also be
called scheme
. hostname
+port
is
called host
in adaR
. Additionally, the
query
is referred to as search
and the
fragment
as hash
in adaR
.
Some more relevant subcomponents are given in the following table.
Term | Description | Example |
---|---|---|
Domain | A name that represents an IP address of the server which hosts the website. It’s a human-readable form of an address where web resources can be accessed. | example.com |
Subdomain | A subset or a smaller part of the main domain. It’s used to organize and navigate to different sections or services of a website. | blog.example.com |
Top-Level Domain (TLD) | The last segment of the domain name. It follows the last dot in the domain name. Indicates the purpose or origin of a domain. |
.com , .net , .org
|
Public Suffix | A domain under which Internet users can directly register their own domain names. Public suffixes include TLDs as well as certain subdomains under which domains can be registered. |
co.uk , com.au , github.io
|
But wait, there is more. The table below gives the definition of
several terms that are of relevance when dealing with URLs and the
adaR
package.
Term | Description | Example |
---|---|---|
Authority | Combines user info, hostname, and port. Identifies the party responsible for the resource. | userinfo@host:port |
Relative URL | A URL without the scheme and host, often starting with a path. Relative to a base URL. | /path/to/file.html |
Absolute URL | A full URL specifying scheme and host. | https://example.com/path/to/file.html |
Base URL | The URL to which relative URLs are resolved. | <base href="https://example.com/page/"> |
Percent Encoding | Encodes special characters within a URI using %
followed by two hexadecimal digits. |
Hello%20World (represents “Hello World”) |
Punycode | Represents Unicode characters in domain names using ASCII. |
xn--80akhbyknj4f (represents пример ) |
URL Canonicalization | Converts a URL into a standardized or normalized format. | From https://example.com:443/../a.html to
https://example.com/a.html
|
URL Shortening | Converts a long URL into a significantly shorter version that redirects to the original URL. | Shortening https://example.com/some-long-path might
give https://exmpl.co/abc123
|
URL Slug | Part of a URL derived from the title of a webpage, usually human-readable and used for SEO. | For a post titled “How to Bake”, slug might be
how-to-bake
|
URI vs URL | URI is a broader category including URLs (locator) and URNs (name). All URLs are URIs, but not all URIs are URLs. | URI: mailto:john.doe@example.com , URL:
https://example.com
|
“WHATWG compliant”
The underlying C++ code of adaR
, ada-url is “WHATWG copliant”.
Who/What is the WHATWG?
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving the web through standards and tests.
It was founded by individuals of Apple, the Mozilla Foundation, and Opera Software in 2004, after a W3C workshop. Apple, Mozilla and Opera were becoming increasingly concerned about the W3C’s direction with XHTML, lack of interest in HTML, and apparent disregard for the needs of real-world web developers. So, in response, these organisations set out with a mission to address these concerns and the Web Hypertext Application Technology Working Group was born.
What is the WHATWG working on?
The WHATWG’s focus is on standards implementable in web browsers, and
their associated tests. Their existing work can be found here.
The standard of relevance for this package, is the url standard. Being “WHATWG compliant” means, that ada-url follows this url standard.
Parsing urls
The function ada_url_parse()
decomposes a url into the
components shown in the first table.
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
#> href protocol username
#> 1 https://user_1:password_1@example.org:8080/api?q=1#frag https: user_1
#> password host hostname port pathname search hash
#> 1 password_1 example.org:8080 example.org 8080 /api ?q=1 #frag
The function can deal with punycode and percent encoding and does generally handle all types of edge cases well.
corner_cases <- c(
"https://example.com:8080", "http://user:password@example.com",
"http://[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:8080", "https://example.com/path/to/resource?query=value&another=thing#fragment",
"http://sub.sub.example.com", "ftp://files.example.com:2121/download/file.txt",
"http://example.com/path with spaces/and&special=characters?",
"https://user:pa%40ssword@example.com/path", "http://example.com/..//a/b/../c/./d.html",
"https://example.com:8080/over/under?query=param#and-a-fragment",
"http://192.168.0.1/path/to/resource", "http://3com.com/path/to/resource",
"http://example.com/%7Eusername/", "https://example.com/a?query=value&query=value2",
"https://example.com/a/b/c/..", "ws://websocket.example.com:9000/chat",
"https://example.com:65535/edge-case-port", "file:///home/user/file.txt",
"http://example.com/a/b/c/%2F%2F", "http://example.com/a/../a/../a/../a/",
"https://example.com/./././a/", "http://example.com:8080/a;b?c=d#e",
"http://@example.com", "http://example.com/@test", "http://example.com/@@@/a/b",
"https://example.com:0/", "http://example.com/%25path%20with%20encoded%20chars",
"https://example.com/path?query=%26%3D%3F%23", "http://example.com:8080/?query=value#fragment#fragment2",
"https://example.xn--80akhbyknj4f/path/to/resource", "https://example.co.uk/path/to/resource",
"http://username:pass%23word@example.net", "ftp://downloads.example.edu:3030/files/archive.zip",
"https://example.com:8080/this/is/a/deeply/nested/path/to/a/resource",
"http://another-example.com/..//test/./demo.html", "https://sub2.sub1.example.org:5000/login?user=test#section2",
"ws://chat.example.biz:5050/livechat", "http://192.168.1.100/a/b/c/d",
"https://secure.example.shop/cart?item=123&quantity=5", "http://example.travel/%60%21%40%23%24%25%5E%26*()",
"https://example.museum/path/to/artifact?search=ancient", "ftp://secure-files.example.co:4040/files/document.docx",
"https://test.example.aero/booking?flight=abc123", "http://example.asia/%E2%82%AC%E2%82%AC/path",
"http://subdomain.example.tel/contact?name=john", "ws://game-server.example.jobs:2020/match?id=xyz",
"http://example.mobi/path/with/mobile/content", "https://example.name/family/tree?name=smith",
"http://192.168.2.2/path?query1=value1&query2=value2", "http://example.pro/professional/services",
"https://example.info/information/page", "http://example.int/internal/systems/login",
"https://example.post/postal/services", "http://example.xxx/age/verification",
"https://example.xxx/another/edge/case/path?with=query#and-fragment"
)
df <- ada_url_parse(corner_cases)
df[, -1]
#> protocol username password host
#> 1 https: example.com:8080
#> 2 http: user password example.com
#> 3 http: [2001:db8:85a3::8a2e:370:7334]:8080
#> 4 https: example.com
#> 5 http: sub.sub.example.com
#> 6 ftp: files.example.com:2121
#> 7 http: example.com
#> 8 https: user pa@ssword example.com
#> 9 http: example.com
#> 10 https: example.com:8080
#> 11 http: 192.168.0.1
#> 12 http: 3com.com
#> 13 http: example.com
#> 14 https: example.com
#> 15 https: example.com
#> 16 ws: websocket.example.com:9000
#> 17 https: example.com:65535
#> 18 file:
#> 19 http: example.com
#> 20 http: example.com
#> 21 https: example.com
#> 22 http: example.com:8080
#> 23 http: example.com
#> 24 http: example.com
#> 25 http: example.com
#> 26 https: example.com:0
#> 27 http: example.com
#> 28 https: example.com
#> 29 http: example.com:8080
#> 30 https: example.испытание
#> 31 https: example.co.uk
#> 32 http: username pass#word example.net
#> 33 ftp: downloads.example.edu:3030
#> 34 https: example.com:8080
#> 35 http: another-example.com
#> 36 https: sub2.sub1.example.org:5000
#> 37 ws: chat.example.biz:5050
#> 38 http: 192.168.1.100
#> 39 https: secure.example.shop
#> 40 http: example.travel
#> 41 https: example.museum
#> 42 ftp: secure-files.example.co:4040
#> 43 https: test.example.aero
#> 44 http: example.asia
#> 45 http: subdomain.example.tel
#> 46 ws: game-server.example.jobs:2020
#> 47 http: example.mobi
#> 48 https: example.name
#> 49 http: 192.168.2.2
#> 50 http: example.pro
#> 51 https: example.info
#> 52 http: example.int
#> 53 https: example.post
#> 54 http: example.xxx
#> 55 https: example.xxx
#> hostname port
#> 1 example.com 8080
#> 2 example.com
#> 3 [2001:db8:85a3::8a2e:370:7334] 8080
#> 4 example.com
#> 5 sub.sub.example.com
#> 6 files.example.com 2121
#> 7 example.com
#> 8 example.com
#> 9 example.com
#> 10 example.com 8080
#> 11 192.168.0.1
#> 12 3com.com
#> 13 example.com
#> 14 example.com
#> 15 example.com
#> 16 websocket.example.com 9000
#> 17 example.com 65535
#> 18
#> 19 example.com
#> 20 example.com
#> 21 example.com
#> 22 example.com 8080
#> 23 example.com
#> 24 example.com
#> 25 example.com
#> 26 example.com 0
#> 27 example.com
#> 28 example.com
#> 29 example.com 8080
#> 30 example.испытание
#> 31 example.co.uk
#> 32 example.net
#> 33 downloads.example.edu 3030
#> 34 example.com 8080
#> 35 another-example.com
#> 36 sub2.sub1.example.org 5000
#> 37 chat.example.biz 5050
#> 38 192.168.1.100
#> 39 secure.example.shop
#> 40 example.travel
#> 41 example.museum
#> 42 secure-files.example.co 4040
#> 43 test.example.aero
#> 44 example.asia
#> 45 subdomain.example.tel
#> 46 game-server.example.jobs 2020
#> 47 example.mobi
#> 48 example.name
#> 49 192.168.2.2
#> 50 example.pro
#> 51 example.info
#> 52 example.int
#> 53 example.post
#> 54 example.xxx
#> 55 example.xxx
#> pathname search
#> 1 /
#> 2 /
#> 3 /
#> 4 /path/to/resource ?query=value&another=thing
#> 5 /
#> 6 /download/file.txt
#> 7 /path with spaces/and&special=characters
#> 8 /path
#> 9 //a/c/d.html
#> 10 /over/under ?query=param
#> 11 /path/to/resource
#> 12 /path/to/resource
#> 13 /~username/
#> 14 /a ?query=value&query=value2
#> 15 /a/b/
#> 16 /chat
#> 17 /edge-case-port
#> 18 /home/user/file.txt
#> 19 /a/b/c///
#> 20 /a/
#> 21 /a/
#> 22 /a;b ?c=d
#> 23 /
#> 24 /@test
#> 25 /@@@/a/b
#> 26 /
#> 27 /%path with encoded chars
#> 28 /path ?query=&=?#
#> 29 / ?query=value
#> 30 /path/to/resource
#> 31 /path/to/resource
#> 32 /
#> 33 /files/archive.zip
#> 34 /this/is/a/deeply/nested/path/to/a/resource
#> 35 //test/demo.html
#> 36 /login ?user=test
#> 37 /livechat
#> 38 /a/b/c/d
#> 39 /cart ?item=123&quantity=5
#> 40 /`!@#$%^&*()
#> 41 /path/to/artifact ?search=ancient
#> 42 /files/document.docx
#> 43 /booking ?flight=abc123
#> 44 /€€/path
#> 45 /contact ?name=john
#> 46 /match ?id=xyz
#> 47 /path/with/mobile/content
#> 48 /family/tree ?name=smith
#> 49 /path ?query1=value1&query2=value2
#> 50 /professional/services
#> 51 /information/page
#> 52 /internal/systems/login
#> 53 /postal/services
#> 54 /age/verification
#> 55 /another/edge/case/path ?with=query
#> hash
#> 1
#> 2
#> 3
#> 4 #fragment
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10 #and-a-fragment
#> 11
#> 12
#> 13
#> 14
#> 15
#> 16
#> 17
#> 18
#> 19
#> 20
#> 21
#> 22 #e
#> 23
#> 24
#> 25
#> 26
#> 27
#> 28
#> 29 #fragment#fragment2
#> 30
#> 31
#> 32
#> 33
#> 34
#> 35
#> 36 #section2
#> 37
#> 38
#> 39
#> 40
#> 41
#> 42
#> 43
#> 44
#> 45
#> 46
#> 47
#> 48
#> 49
#> 50
#> 51
#> 52
#> 53
#> 54
#> 55 #and-fragment
ada_url_parse()
is the power horse of adaR
which always returns all components of a URL. Specific components can be
parsed with the ada_get_*()
set of functions.
ada_get_hostname(corner_cases)
#> [1] "example.com" "example.com"
#> [3] "[2001:db8:85a3::8a2e:370:7334]" "example.com"
#> [5] "sub.sub.example.com" "files.example.com"
#> [7] "example.com" "example.com"
#> [9] "example.com" "example.com"
#> [11] "192.168.0.1" "3com.com"
#> [13] "example.com" "example.com"
#> [15] "example.com" "websocket.example.com"
#> [17] "example.com" ""
#> [19] "example.com" "example.com"
#> [21] "example.com" "example.com"
#> [23] "example.com" "example.com"
#> [25] "example.com" "example.com"
#> [27] "example.com" "example.com"
#> [29] "example.com" "example.испытание"
#> [31] "example.co.uk" "example.net"
#> [33] "downloads.example.edu" "example.com"
#> [35] "another-example.com" "sub2.sub1.example.org"
#> [37] "chat.example.biz" "192.168.1.100"
#> [39] "secure.example.shop" "example.travel"
#> [41] "example.museum" "secure-files.example.co"
#> [43] "test.example.aero" "example.asia"
#> [45] "subdomain.example.tel" "game-server.example.jobs"
#> [47] "example.mobi" "example.name"
#> [49] "192.168.2.2" "example.pro"
#> [51] "example.info" "example.int"
#> [53] "example.post" "example.xxx"
#> [55] "example.xxx"
ada_has_*()
can be used to check if certain components
are present or not.
ada_has_search(corner_cases)
#> [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
#> [13] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [25] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#> [37] FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
#> [49] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
ada_set_*()
can be used to set specific components of a
URL.
ada_set_hostname("https://example.de/test", "example.com")
#> [1] "https://example.com/test"
ada_clear_*()
can be used to remove certain
components.
url <- "https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"
ada_clear_port(url)
#> [1] "https://user_1:password_1@example.org/api?q=1#frag"
ada_clear_hash(url)
#> [1] "https://user_1:password_1@example.org:8080/api?q=1"
ada_clear_search(url)
#> [1] "https://user_1:password_1@example.org:8080/api#frag"
Public suffic extraction
The package also implements a public suffix extractor public_suffix(), based on a lookup of the Public Suffix List. Note that from this list, we only include registry suffixes (e.g., com, co.uk), which are those controlled by a domain name registry and governed by ICANN. We do not include “private” suffixes (e.g., blogspot.com) that allow people to register subdomains. Hence, we use the term domain in the sense of “top domain under a registry suffix”. See https://github.com/google/guava/wiki/InternetDomainNameExplained for more details.
urls <- c(
"https://subsub.sub.domain.co.uk",
"https://domain.api.gov.uk",
"https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk" "gov.uk"
#> [3] "butthisispartoftheps.kawasaki.jp"
If you are wondering about the last url. The list also contains
wildcard suffixes such as *.kawasaki.jp
which need to be
matched.