Wikidata and SPARQL
From time to time I come back to playing with Wikidata. I will start entering some data and, invariably, tinker with the SPARQL service it provides. Every single time I need to relearn this, because I do not use it often enough. So this time I figured I would keep some notes I could more easily reference next time.
Wiki What? SPARQL Who?
Wikidata is a big open database of RDF triples (also called semantic triples). In essence, RDF triples come down to statements like: subject predicate object. An example would be: Ward is a human.
SPARQL (something something Query Language) is a method of processing that information. Think SQL, but for RDF triple databases. Wikidata provides a service for you to enter SPARQL and query Wikidata.
What Makes a Query
Basic Search
The basic search has a SELECT
and a WHERE
. After the SELECT
you list the
variables (starting with a ?
) that you want to be returned in the results. In
the WHERE
you can throw a bunch of statements about RDF triples. You can also
group and order and such after the WHERE
. I will go into those later.
SELECT ?item
WHERE {
# Various statements that limit the number of matches
# For example property P31 "instance of" Q146 "house cat"
?item wdt:P31 wd:Q146.
}
Prefixes
These will pop up a bit everywhere, it tells the SPARQL service how to
interpret the Qnnn
and Pnnn
identifiers that follows after the :
. Do not
worry if that does not make sense, it will once you see some code.
wd:
Wikidata entitywdt:
Wikidata propertyp:
Access the statement itself (see Qualifiers below)ps:
Access the main data from a statement (see Qualifiers below)pq:
Access a qualifier (see Qualifiers below)
Statements
Regular statement inside a WHERE
. You can turn each part of “subject property
object” into a variable. Note the Qnnn
are entities in Wikidata. The nnn
is
its ID in the database. Similarly, the Pnnn
are properties in the Wikidata
database.
# ?cat <property:instance of> <entity:house cat>
?cat wdt:P31 wd:Q146.
Use ;
for multiple statements for one subject. You can continue adding
statements ending in ;
and adding another statement. All will use the same
subject.
# ?cat <instance of> <house cat>
?cat wdt:P31 wd:Q146;
# <position held> <Chief Mouser to the Cabinet Office>
wdt:P39 wd:Q198641.
Property Combinations
Use the forward slash /
to chain properties.
# <subjects> <position held>/<part of> <objects>
?cat wdt:P39/wdt:P361 ?organisation.
is short for
# <subjects> <position held> <objects>
?cat wdt:P39 ?position.
# <previousobjects now subjects> <part of> <objects>
?position wdt:P361 ?organisation.
[ ... ]
can achieve a similar effect. The internet tells me it is more
flexible than /
, but potentially slower. It creates a “blank node” that you
can use in place of the object. Presumably the subject and property too, but I
cannot get that to work right now. The following seem to behave the same, but I
cannot make guarantees.
?entity wdt:P39 ?position.
?position wdt:P263 wd:Q169101.
?entity wdt:P39/wdt:P263 wd:Q169101.
# entities <position held> [ <official residence> <10 Downing Street> ]
?entity wdt:P39 [ wdt:P263 wd:Q169101 ].
^
reverses the “subject property object” to “object property subject”. Note
it only reverses the property it is in front of, not any others you might chain
with /
.
# <house cat> <reversed instance of> <the actual cats>
wd:Q146 ^wdt:P31 ?cat.
Multipliers *
and +
to have a property match zero or more and once or more,
respectively. In this example, it lets us get all the subclasses of a house
cat. The *
to +
difference is that, in the example, the *
will also
return the house cat entity itself, the +
omits it.
# <cat types> <subclass of>* <house cat>
?cattypes wdt:P279* wd:Q146.
Note that you can combine all the above into one funky matcher. Try to balance brevity and readability.
Qualifiers
You can use p:
to get a reference to the entire statement instead of to the
object. From such a statement, you can then get the object by using ps:
. So,
for example, rather than getting all entities holding the position of Chief
Mouser, we can get the statement that says “entity holds position something”,
then specify that the position held is Chief Mouser. Note we use the same
property (P39
) for each.
# Following are same as
# # entities <position held> <Chief Mouser to the Cabinet Office>
# ?cat wdt:P39 wd:Q198641.
?cat p:P39 ?positionheldstatement.
?positionheldstatement ps:P39 wd:Q198641.
“What is the point?”, I hear you ask. A statement can come with qualifiers. There have been several Chief Mousers and for each, the position held statement comes with a qualifier such as its start time, end time, series ordinal. Now that we have grabbed hold of a statement, we can reach those qualifiers.
?cat p:P39 ?positionheldstatement.
?positionheldstatement ps:P39 wd:Q198641.
# that statement <start time> capture in variable
?positionheldstatement pq:P580 ?started.
# that statement <end time> capture in variable
?positionheldstatement pq:P582 ?ended.
Note that the current Chief Mouser (at the time of writing: Larry) does not
have an end date, so he would not actually appear in the results here. To make
him reappear, you could wrap that final statement in an OPTIONAL { }
.
More Keywords
Make a statement OPTIONAL
. Without this, items that do not match a certain
statement might get filtered out. Every statement not wrapped in an OPTIONAL
must have a match.
OPTIONAL { statement }
UNION
different statements to combine results
{statement}
UNION
{statement}
DISTINCT
ensures distinct results.
SELECT DISTINCT ...
WHERE {
}
COUNT
to get the number of matches. More efficient than returning all matches
and counting them on your end. Note that you have to give a name to the result.
Also note that you have to wrap the entire count-and-rename into parentheses or
you will get a syntax error.
SELECT (COUNT(?cat) AS ?numberofcats)
WHERE {
# cat <instance of> <house cat>
?cat wdt:P31 wd:Q146.
}
GROUP BY
to combine results by means of variable values. Note that everything
that appears in the SELECT
part has to be a variable that is grouped by or
something that gets aggregated by means of an expression such as COUNT
.
SELECT ?position (COUNT(?cat) AS ?numberofcats)
WHERE {
?cat wdt:P31 wd:Q146.
?cat wdt:P39 ?position.
}
GROUP BY ?position
That variable in the group by requirement also is the case for labels!
SELECT ?position ?positionLabel (COUNT(?cat) AS ?numberofcats)
WHERE {
?cat wdt:P31 wd:Q146.
?cat wdt:P39 ?position.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
GROUP BY ?position ?positionLabel # MUST add it here too
Combine GROUP BY
with HAVING
if you want to put limits on the groups you
get. You can only using HAVING
in combo with a grouping. For example if you
only want positions held by at least two cats, do this. Note parentheses around
it.
SELECT ?position (COUNT(?cat) AS ?numberofcats)
WHERE {
?cat wdt:P31 wd:Q146.
?cat wdt:P39 ?position.
}
GROUP BY ?position
HAVING (?numberofcats > 1)
Use ORDER BY
to have your results show in a particular way. To decide
ascending or descending, wrap it in ASC
or DESC
. Default is ascending.
SELECT ?cat ?dob
WHERE {
# cat <instance of> <house cat>
?cat wdt:P31 wd:Q146.
# subject <date of birth> object
?cat wdt:P569 ?dob.
}
ORDER BY DESC(?dob)
Sometimes you get too many results. Throw in a LIMIT
at the end to decide the
maximum number of results.
SELECT ?cat ?dob
WHERE {
# cat <instance of> <house cat>
?cat wdt:P31 wd:Q146.
# subject <date of birth> object
?cat wdt:P569 ?dob.
}
ORDER BY DESC(?dob)
LIMIT 20
Other keywords that might come in handy: FILTER()
, YEAR()
.
Label
Adding a label the Wikidata way. The variable name is implicit, adds Label
to
the existing variable name. You can also use ?catDescription
to get the
description text.
SELECT ?cat ?catLabel
WHERE
{
?cat wdt:P31 wd:Q146.
# Label in your language, if not, then default for all languages, then en language
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}
Adding a label slightly more generically (and less automatic), lets you pick
the variable name. Note that this only selects a single language label. Not
FILTER
ing the language label will return the row several times for every
language label match.
SELECT ?cat ?name
WHERE
{
?cat wdt:P31 wd:Q146.
?cat rdfs:label ?name.
FILTER (LANG(?name) = "en")
}
QLever
I was going to go into QLever as well, which is a, supposedly, faster query engine and instance that besides Wikidata also offers OpenStreetMap, IMDb, and a bunch of other data as triples. I think I will keep that for a possible future post. QLever’s Wikidata service should behave more or less the same as described here, though it might require explicitly defining the Wikidata prefixes beforehand. As far as I can tell, it automatically adds them for you though. Also note that Wikidata’s service usually updates within a few minutes. In QLever it seems to take maybe a week.
Further Reading
- Wikidata:SPARQL query service
- Wikidata Query Service Tutorial. Beware, this site is terribly slow.