You get reading API access to Wikidata (GET) without a password. But if you want API access to a privately managed Wikibase instance, you need to get a “bot” username and password and authenticate your scripted session.
Security first
CSRF is a way for a malicious website to exploit your logged-in session on another website to perform actions as you without your consent. The MediaWiki API (the API of Wikibase instances) employs CSRF tokens (also often called “edit tokens” in the MediaWiki context) as a crucial defense mechanism against Cross-Site Request Forgery attacks.
To receive such a CSRF security token (connected to your current
editing session on a Wikibase instance) you have to run first
get_csrf()
to establish your session and your credentials.
As a result, you will receive a csrf object which contains among other
data, your CSRF token. Then, now running the
get_csrf_token()
function will unwrap from returned CSRF
object the token itself, which is often, but not necessarily, a
character string of length 42.
The MediaWiki API’s CSRF tokens are not designed to be long-lived. Their lifespan is intentionally kept short, typically tied to the user’s current editing session or a reasonable timeframe for a single action. This is a crucial security measure, and therefore you may need to log in several times while you are working with wbdataset.
A convenient and elegant solution to keep your login credentials R is the use of the keyring package.
require(keyring)
#> Loading required package: keyring
#> Warning: package 'keyring' was built under R version 4.4.2
If keyring is not installed in your computer, you can do it like this:
install.packages("keyring")
To get started, read the keyring package website. At first time use, you
set up your credentials with key_set_with_value
.
Use wbdataset on a Wikibase instance
You can of course use different secure ways to authenticate yourself,
what is important that you will have to regularly give your username and
password to get_csrf()
. You should this in a way that you
absolutely not at risk of revealing your bot password on the internet.
With such a password a malicious actor can destroy an entire open
knowledge graph.
library(keyring)
# https://reprexbase.eu/example/api.php does not exist in reality
key_set_with_value(
service = "https://reprexbase.eu/example/api.php",
username = "Demo@adminbot",
password = "8******************************b"
)
You must never store those username-password pairs in your code to prevent it from accidentally getting to GitHub and it will be permanently revealed to everybody. (There is no delete button on GitHub!)
key_get(
service = "https://reprexbase.eu/example/api.php",
username = "Demo2@adminbot"
)
To gain access to a MediaWiki API, you need to get a CSRF token, which is a secure random token (e.g., synchronizer token or challenge token) that is used to prevent CSRF attacks.
my_csrf <- get_csrf(
username = "Demo@adminbot",
password = key_get(
service = "https://reprexbase.eu/example/api.php",
username = "Demo@adminbot"
),
wikibase_api_url = "https://reprexbase.eu/example/api.php"
)
If all goes well, you receive similar messages on your terminal (the API instance as well as the login token are fictional in this example):
Received a `handle`: https://reprexbase.eu/example/api.php
response: Establish session with https://reprexbase.eu/example/api.php: 200
Session: OK(200)
response: login to https://reprexbase.eu/example/api.php: 200
Login: OK(200)
Login token: 7c7af4d52x***********
Post login data to https://reprexbase.eu/example/api.php
To interact with the Wikibase Instance, you will have to provide the
API address of the instance and this CSRF object. In fact, under the
hood, from the CSRF object get_csrf_token(csrf)
function
will unwrap your token itself from the CSRF security object holds the
data of your session (not only the token itself.)
Now you are ready to send API requests to the MediaWiki API.
get_claims(
qid = "Q528626",
property = "P625",
wikibase_api_url = "wikibase_api_url",
csrf = my_csrf
)