The wbdataset package is an extension of the dataset, which in turn is an R package that helps to exchange, publish and combine datasets more easily by improving their semantics. The wbdataset extends the usability of dataset by connecting the Wikibase API with the R statistical environment.
Let us initialize a dataset from Wikidata.
You do not need to use the login functions get_csrf
and
get_csrf_token
to work with Wikidata, however, you need to
provide login credentials with those functions to a privately managed
Wikibase instance. In this example, we retrieve data from Wikidata
without password protection. The same scripts would retrieve data from a
private Wikibase with providing the optional CSRF token to the
functions.
Retrieving items
Items will serve as your observations or data subjects. We will collect their attributes (usually constants, like the area size of a country) and variables (like their population.)
We will start with four countries, of course, you could read in a longer list from a file:
# Select the following country profiles from Wikidata:
wikidata_countries <- c(
"http://www.wikidata.org/entity/Q756617", "http://www.wikidata.org/entity/Q347",
"http://www.wikidata.org/entity/Q3908", "http://www.wikidata.org/entity/Q1246"
)
And download the main identifiers of these data subject, i.e., the countries:
-
QID
: -
label
: -
description
:
# Retrieve their labels into a dataset called 'European countries':
wikidata_countries_df <- get_item(
qid = wikidata_countries,
language = "en",
title = "European countries",
creator = person("Daniel", "Antal")
)
The resulting dataset:
print(wikidata_countries_df)
#> Antal D (2024). "European countries."
#> qid label description language
#> <hvn_lbl_> <hvn_lbl_> <hvn_lbl_> <hvn_lb>
#> 1 Q756617 Kingdom of Denmark Kingdom of Denmark and its autonomous … en
#> 2 Q347 Liechtenstein country in Central Europe en
#> 3 Q3908 Galicia autonomous community of Spain en
#> 4 Q1246 Kosovo country in southeastern Europe en
The provenance and the definition of the key qid
column
is well described in the attributes.
Adding attributes and variables
Now let us add further columns, making sure that we include the precise definition of each of the variables.
We add properties of these countries, or attributes/variables of
these countries by reading statements about those variables or
attributes. Our left_join_column
tries to retrieve the
attribute, for example, ISO 3166-1 alpha-2 code (P297)
for
every country identified by a QID, starting with Q756617
as
defined above.
ds <- wikidata_countries_df %>%
left_join_column(
label = "ISO 3166-1 alpha-2 code",
property = "P297"
) %>%
left_join_column(
property = "P1566",
label = "Geonames ID",
namespace = "https://www.geonames.org/"
) %>%
left_join_column(
label = "different from Wikipedia item",
property = "P1889"
)
#> Left join claims: 1/4: Q756617 P297
#> Left join claims: 2/4: Q347 P297
#> Left join claims: 3/4: Q3908 P297
#> Left join claims: 4/4: Q1246 P297
#> Left join claims: 1/4: Q756617 P1566
#> Left join claims: 2/4: Q347 P1566
#> Left join claims: 3/4: Q3908 P1566
#> Left join claims: 4/4: Q1246 P1566
#> Left join claims: 1/4: Q756617 P1889
#> Left join claims: 2/4: Q347 P1889
#> Left join claims: 3/4: Q3908 P1889
#> Left join claims: 4/4: Q1246 P1889
You can set the workflow to SILENT to avoid the progress report
above, i.e.,Left join claims: m/n: Qxxx Pyyy
.
ds <- wikidata_countries_df %>%
left_join_column(
label = "ISO 3166-1 alpha-2 code",
property = "P297", silent = TRUE
) %>%
left_join_column(
property = "P1566",
label = "Geonames ID",
namespace = "https://www.geonames.org/", silent = TRUE
) %>%
left_join_column(
label = "different from Wikipedia item",
property = "P1889", silent = TRUE
)
Let us see the results:
print(ds)
#> Antal D (2024). "European countries."
#> qid label description language rowid P297 P1566 P1889
#> <hvn_lbl_> <hvn_lbl_> <hvn_lbl_> <hvn_lb> <hvn> <hvn> <hvn> <hvn>
#> 1 Q756617 Kingdom of Denmark Kingdom of Den… en eg:1 DK NA Q35
#> 2 Q347 Liechtenstein country in Cen… en eg:2 LI 3042… NA
#> 3 Q3908 Galicia autonomous com… en eg:3 NA 3336… Q180…
#> 4 Q1246 Kosovo country in sou… en eg:4 XK 8310… Q1231
The metadata added to these columns (attributes):
attributes(ds$P1566)
#> $label
#> [1] "Geonames ID"
#>
#> $namespace
#> [1] "https://www.geonames.org/"
#>
#> $class
#> [1] "haven_labelled_defined" "haven_labelled" "vctrs_vctr"
#> [4] "character"
This resolves the third cell in the P1566
column
(Geonames ID of Galicia) to https://www.geonames.org/3336902. Galicia is not a
sovereign state, therefore it has no P297
value, i.e., it
has no ISO country code.
print(dataset::get_bibentry(ds), style = "bibtex")
#> @Misc{,
#> title = {European countries},
#> author = {Daniel Antal},
#> publisher = {:unas},
#> year = {2024},
#> resourcetype = {Dataset},
#> version = {0.1.0},
#> description = {:unas},
#> language = {:unas},
#> format = {application/r-rds},
#> rights = {:unas},
#> }
Some provenance is recorded:
dataset::provenance(ds)
#> NULL