Tutorial

Querying the Graph using SPARQL

Knowledge Graphs and SPARQL

The biomedical literature is teeming with entities of various kinds from cells to drugs to diseases. Researchers read articles and books to develop mental models of biomedical domains, and then create new hypotheses and answer questions. With the growth of the biomedical literature and the explosion of computational tools, it became increasingly clear to transport these mental models out of the mind and into computers.

Biomedical knowledge graphs are computational representations of what we, as biomedical scientists, consider knowledge. Ontologies, a logically rigorous knowledge graph, have flourished in hundreds of domains since the early 2000s. The ontologist catalog and assign identifiers for the entities in their domains of interest: the Cell Ontology catalogs cell types, the Protein Ontology catalogs protein types, the UBERON Anatomy Ontology catalogs anatomical structures, and so on. Moreover, these ontologists carefully curate relations between the entities and catalog them in semantically-consistent databases.

Figure: Biomedical entities and their relations in the Orpheus graph

The entities (cells, drugs, diseases, etc.) form the nodes in the graph, while the relations form the graph's edges. The structure containing two nodes and the edge between them is called a triple. In a toy model, imagine a file containing a single statement, a single piece of knowledge:

COVID-19 → has cause → SARS-CoV-2

As words are ambiguous, semantic databases rely on identifiers. One clever trick used by such databases is to use URLs as identifiers. In that way, you do not need to explain a concept: users can load the URL into a web browser and see the description for themselves. The databases use prefixes to make it simple to write the bits of knowledge without too much repetition. For example, the knowledge bit above could be represented as

PREFIX db: <http://example.com/>
db:covid_19 db:has_cause db:SARS-CoV-2

In the above example, @prefix indicates the mapping of a prefix to an URL, and below that, there is a triple that represents the knowledge. This structure is also used by SPARQL, a query language that processes huge sets of triples. Understanding the structure of the database is halfway to writing SPARQL queries.

One additional challenge for some databases is that the identifiers come from a number of different projects, mixing databases and prefixes. The bit of knowledge above could be represented in the wild as:

PREFIX mondo: <http://purl.obolibrary.org/obo/MONDO_>
PREFIX ro: <http://purl.obolibrary.org/obo/RO_>
PREFIX ncbi: <http://purl.obolibrary.org/obo/NCBITaxon_>

mondo:0100096 ro:0004023 ncbi:2697049

It quickly becomes hard to track which prefix to use for each biological entity of interest. Ideally, we would have a centralized database, where we could get identifiers for every entity. From that idea came Wikidata, a general database linked to Wikipedia that covers (ideally) all domains of knowledge.

Wikidata and Wisecube

The basic structure of Wikidata follows the W3C standards and uses entities and relationships. For example, the following are the URL/identifiers for the concepts “COVID-19” and “SARS-CoV-2” and the relation “has cause”

COVID-19: http://www.wikidata.org/entity/Q84263196
has cause: http://www.wikidata.org/prop/direct/P828
SARS-CoV-2: http://www.wikidata.org/entity/Q82069695

The piece of information stating that COVID-19 is caused by SARS-CoV-2 would be present in the following format:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

wd:Q84263196  wdt::P828  wd:Q82069695

In the snippet above, wdt: is a prefix for relations and wd: is a prefix for s. They are the bread and butter of the Wikidata knowledge-base, and of SPARQL queries on Wisecube.

The basic structure of SPARQL is inspired by the SQL query language:

SELECT * WHERE {}

In the structure above, the wildcard (*) indicates to the query service to return all matches. The desired matches are specified inside the brackets ({}). For example, to retrieve all causes of COVID-19 on Wisecube's Graph, one would write:

SELECT * WHERE { wd:Q84263196  wdt:P828  ?disease_cause .  }

Note that PREFIX statements are not needed anymore, as they are already implicitly considered by Wisecube's Query Service. In the snippet above:

?disease_cause is a placeholder that tells the system to return all entities that match this structure,
wd:Q84263196 is the identifier for COVID-19,
wdt:P828 is the identifier for the relation “has cause”.

To test Wisecube's SPARQL query service, you can use the Orpheus API run the SPARQL query using it.

If you ran it, you saw that it provided only a link for the ID of “SARS-CoV-2” (Q82069695), but did not show the label. For SPARQL to get it, you will have to tell it explicitly, as in the query below:

SELECT * WHERE 

{ 
   wd:Q84263196  wdt:P828  ?disease_cause .  
   ?disease_cause rdfs:label ?disease_cause_label . 
   FILTER ( LANG(?disease_cause_label) = "en")
}

The lines above look for the labels associated with the identifier “?disease_cause”. As the knowledge graph is multilingual, we have to add a new line and say that we want only the labels in the English language.

In the Wisecube knowledge graph, we are implementing 3 different ways to navigate the network of knowledge: direct SPARQL queries, a visual query builder and natural language to SPARQL conversions.

One advantage of using Wisecube, beyond the tailored support for your needs, is the enrichment of the graph. For example, let’s say you want to find the diseases associated with the gene MTND6. For that we will need:

The Q-ID for MTND6: wd:Q18029567
The P-ID for “genetic association” : wdt:P2293

From what we get the following query:

SELECT * WHERE
{
   wd:Q18029567 wdt:P2293  ?disease  . 
   ?disease rdfs:label ?disease_label .
   FILTER ( LANG(?disease_label) = "en")
}

Wikidata (as of 12 April 2022) lists 2 diseases for that query: Leber hereditary optic neuropathy and MELAS syndrome. By running the exact same query on Wisecube’s knowledge graph you get these 2 hits, plus the general concept of “mitochondrial disease” and “Leigh Syndrome”, which is also associated with MTND6 mutations in the literature (Bakare et al, 2021).

Chaining information: going beyond Google

The power of the SPARQL queries lies not on these simple questions, but on chaining and combining information.

Let’s say, for example, that you want to know the diseases associated with genes in the Y chromosome. As you can see below, Google (and its own Knowledge Graph) fails to provide an answer.

To write the same query in SPARQL you need 2 relations and 1 entity:

wdt:P2293 (genetic association) which links genes and diseases.
wdt:P1057 (present in chromosome) which links a gene to a chromosome.
wd:Q2966734, the entry for “human Y chromosome on Wikidata.

SELECT * WHERE 
{ 
   ?disease wdt:P2293  ?gene .
   ?gene wdt:P1057 wd:Q2966734 . 
  
   ?gene rdfs:label ?gene_label . 
   FILTER ( LANG(?gene_label) = "en")
  
   ?cell_type rdfs:label ?cell_type_label . 
   FILTER ( LANG(?cell_type_label) = "en")
}

Run the query above on Wisecube and you'll get the answer to your question, relying on the connectedness of Wikidata. With such queries, you start getting insights that are way beyond what Google can provide. In the Wisecube platform, you can see hundreds of examples of biomedical questions and their equivalent in SPARQL.

The Wikidata community provides a number of tutorials, that might be worth exploring if you are looking to increase your SPARQL skills. They are not tailored to the biomedical domain, but will provide you with resources to query the biomedical domain:

Wikidata:SPARQL tutorial (basic/intermediary)
https://wdqs-tutorial.toolforge.org/ (basic/intermediary interactive tutorial by Wikimedia Israel)

Conclusion

This short tutorial introduces the topic of SPARQL and querying biomedical knowledge graphs. Although learning SPARQL is challenging, it will provide you with the tools to interact with the modern knowledge graphs. We hope it inspires you to create hypotheses and questions that will bring value to the biomedical network.

If you are interested in learning more about some aspects of SPARQL in biomedical research, do not hesitate to contact us. Feel free also to make query requests and see how we transform the research questions into powerful SPARQL queries.