06-2023 5th issue content

Information on cookies

On the basis of your freely given consent, which can be revoked at any time, your visit to our website is recorded using Matomo, an open source web analytics software program. The information produced will be used solely for statistical purposes and to improve the website and server. No personal data will be stored or shared with third parties. For more information, please refer to the Privacy Policy on our website. By clicking “Accept”, you consent to our use of cookies for analytical purposes. If you do not agree to this, please click “Reject”. In principle, you can visit our website without cookies being enabled. This does not apply in the case of essential cookies.

Welcome

Welcome to the fifth issue of the MaRDI Newsletter on mathematical research data. In the first four issues, we focused on the FAIR principles. Now we move to a topic, which makes use of FAIR data and also implements the FAIR principles in data infrastructures. So without further ado, let me introduce you to the ultimate use case of FAIR data: Knowledge Graphs.

Download the illustration

by Ariel Cotton, licensed under CC BY-SA 4.0.

Knowledge graphs are very natural and represent information similar to how we humans think. They come in handy when you want to avoid redundancy in storing data (as it may happen quite often with tabular methods), and also for complex dataset queries.

This newsletter issue offers some insight into the structure of knowledge, examples of knowledge graphs, including some specific to MaRDI, an interview with a knowledge graph expert, as well as news and announcements related to research data.

In the last issue, we asked how long it would take you to find and understand your own research data. These are the results:

Now we ask you for specific challenges when searching for mathematical data. You may choose from the multiple-choice options or enter something else you faced.

Click to enter your challenges!

You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.

Structuring Knowledge

The knowledge ladder

We are not sure exactly how humans store knowledge in their brains, but we certainly pack concepts into units, and then relate those conceptual units together. For example, if asked to list animals, nobody remembers an alphabetical list (unless you explicitly train yourself to remember such a list). Instead, you start the list with something familiar, like a dog, then you recall that dog is a pet animal, and then you list other pet animals like cat or canary. Then you recall that canary is a bird, and then you list other birds, like eagle, falcon, owl… when you run out of birds, you recall that birds fly in the air, which is one environment medium. Another environment medium is water, and this prompts you to start listing fishes and sea animals. This suggests that we can represent human knowledge in the form of a mathematical graph: concepts are nodes, and relationships are edges. This structure is also ingrained in language, which is the way humans communicate and store knowledge. All languages in the world, across all cultures, have nouns, verbs, or adjectives, and establish relationships through sentences. Almost every language organizes sentences in a subject-verb-object pattern (or any permutations: SVO, SOV, VSO, etc). The subject and the object are typically nouns or pronouns, the verb is often a relationship. A sentence like “my mother is a teacher” encodes the following knowledge: the person “my mother” is a node 1, “teacher” is a node 2, and “has as a job” is a relational edge from node 1 to node 2. Also, there is a node 3, the person “me”, and a relationship “is the mother of” from node 1 to node 3, (which implies a reciprocal relationship “is a child of” from node 3 to node 1).

David Somerville / Hugh McLeod
informationversusknowledge-blog.tumblr.com/

On this construction, we can expect to have an abstract representation of human knowledge that we can store, retrieve, and search with a computer. But not all data automatically gives knowledge, and raw knowledge is not all you may need to solve a problem. This distinction is sometimes referred as “knowledge ladder”, although terminology has not been universally agreed upon. In this ladder, data is the lowest level of information, data are raw input values that we have collected with our senses, or with a sensor device. Information is data tagged with meaning; I am a person, that person is called Mary, teaching is a job, this thing I see is a dog, this list of numbers are daily temperatures in Honolulu. Knowledge is achieved when we find relationships between bits of information; Mary is my mother, Mary’s job is teacher, these animals live together and compete for food; pressure, temperature, and volume in a gas are related by the gas law PV=nRT. Insight is discerning. It is singling out the information that is useful for your purpose from the rest, it is finding seemingly unrelated concepts that behave alike. Finally, wisdom is understanding the connections between concepts. It is the ability to explain step by step how concept A relates to concept B. This ladder is illustrated in the image above. From this point of view, “research” means to know and understand all portions of the human knowledge that falls into or close to your domain, and then enlarging the graph with more nodes and edges, for which you need both insight and wisdom.

The advent of knowledge graphs

Knowledge graphs (KG) as a theoretical construction have been discussed in information theory, linguistics and philosophy for at least five decades, but it is only in this century that computers allowed us to implement algorithms and data retrieval at a practical and massive scale. Google introduced its own knowledge graph in 2012, you may be familiar with it. When you look up in Google some person, some place, etc, there is a small box to the right that displays some key information such as birthdate and achievements for a person, opening times for a shop, etc. This information is not a snippet from a website, it is information collected from many sources and packed into a node of a graph. Then those nodes are linked together by some affinity relationship. For instance, if you look up “Agatha Christie”, you will see an “infobox” with her birthdate, deathdate, short description extracted from Wikipedia, a photograph… And also a list of “People also search for” that will bring you to her family relatives such as Archibald Christie, or to other British authors, such as Virginia Woolf.

A Google search of “Agatha Christie” in June 2023. It offers much more than just links to webs containing the string “Agatha Christie”. Those boxes “About”, “People also search for”, “People also ask”... are powered by Google’s knowledge graph.

But probably the biggest effort to bring all human knowledge into structured data is Wikidata. Wikidata is a sister project of Wikipedia. Wikipedia aims to gather all human knowledge in the form of encyclopedic articles, that is, into non-structured human-readable data. Wikidata, by contrast, is a knowledge graph. It is a directed labeled graph, made of triples of the form subject (node) - predicate (edge) - object (node). The nodes and edges are labeled, actually, they contain a whole list of attributes.

The Wikidata graph is not designed to be used directly by humans. It is designed to retrieve information automatically, to be a “base of truth” that can be relied on. For instance, it can check automatically that all languages of Wikipedia state basic facts correctly (birthplace, list of authored books…), and can be used by external services (such as Google and other search engines or voice assistants) to offer correct and verifiable answers to queries.

In practice, nodes are pages, for instance, this one for Agatha Christie. Inside the page, it lists some “statements”, which are the labeled edges to other nodes. For example, Agatha Christie is an instance of a human, her native language is English, and her field of work is crime novel, detective literature, and others. If we compare that page with the Agatha Christie entry in the English Wikipedia, clearly the latter contains more information, and the Wikidata page is less convenient for a human to read. Potentially, all the ideas described with English sentences in Wikipedia could be represented by relationships in the Wikidata graph, but this task is tedious and difficult for a human, and AI systems are not yet sufficiently developed to make this conversion automatically.

Wikipedia (human-readable, non-structured)

In the backend, Wikidata is stored in relational SQL databases (the same Mediawiki software as used in Wikipedia), but the graph model is that of triples subject-predicate-object as defined in the web standard RDF (Resource Description Framework), This graph structure can be explored and queried with the language SPARQL (Simple Protocol And RDF Query Language). Note that usually, we use the verb “query” as opposed to “search” when we want to retrieve information from a graph, database, or other structured sources of information.

Thus, one can access the Wikidata information in several ways. First, one can use the web interface to access single nodes. The web interface has a search function that allows one to look up pages (nodes) that contain a certain search string. However, it is much more insightful to get information that takes advantage of the graph structure, that is, querying for nodes that are connected to some topic by a particular predicate (statement), or that have a particular property. For Wikidata, we have two main tools: direct SPARQL queries, and the Scholia plug-in tool.

The web and API at query.wikidata.org allows to send queries in SPARQL language. This is the most powerful search, you can browse the examples in that site. The output can be a list, a map, a graph, etc. There is a query builder help function, but essentially it requires some familiarity with SPARQL language. On the other hand, Scholia is a plug-in tool that helps querying and visualizing the Wikidata graph. For instance, searching for “covid-19” via Scholia, it will offer a graph of related topics, a list of authors and recent publications on the topic, organizations, etc., in different visual forms.

Knowledge graphs, artificial intelligence, and mathematics

Knowledge graphs are a hot research area in connection with Artificial Intelligence. On the one hand, there is the challenge of creating a KG from a natural language text (for instance, in English). While detecting grammar and syntax rules (subject, verb, object) is relatively doable, creating a knowledge graph requires encoding the semantics, that is, the meaning of the sentence. In the example of a few paragraphs above, “my mother is a teacher”, to extract the semantics we need the context of who is “me” (who is saying the sentence), we need to check if we already know the person “my mother” (her name, some kind of identifier), etc. The node for that person can be on a small KG with family or contextual information, while “teacher” can be part of a more general KG of common concepts.

In the case of mathematics, extracting a KG from natural language is a tremendous challenge, unfeasible with today’s techniques. Take a theorem statement: it contains definitions, hypotheses, and conclusions, and each one has a different context of validity (the conclusion is only valid under the hypotheses, but that is what you need to prove). Then imagine that you start your proof by reduction to absurd, so you have several sentences that are valid under the assumption that the hypotheses of the theorem hold, but not the conclusion. At some point, you want to find a contradiction with your previous knowledge, thus proving the theorem. The current knowledge graph paradigm is simply not suitable to follow this type of argumental line. The most similar thing to structured data for theorems and proofs are formal languages in logic, and there are practical implementations such as LEAN Theorem Prover. LEAN is a programming language that can encode symbolic manipulation rules for expressions. A proof by algebraic manipulation of a mathematical expression can therefore be described as a list of manipulations from an original expression (move a term to the other side of the equal sign, raise the second index in this tensor using a metric…). Writing proofs in LEAN can be tedious but it has the benefit of being automatically verifiable by a machine. There is no need for a human referee. Of course, we are still far from an AI checking the validity of a proof without human intervention, or even figuring out proofs to conjectures on its own. On the other hand, a dependency graph of theorems, derived in a logic chain from some axioms, is something that a knowledge graph like the MaRDI KG would be suitable to encode.

In any case, structured knowledge (in the form of KG or other forms, such as databases) is a fundamental piece to providing AI systems with a source of truth. Recent advances in the field of generative AI include the famous conversation bots ChatGPT and other Large Language Models (LLM), which are impressive in the sense that they can generate grammatically correct text, with meaningful sentences while keeping attention to maintaining a conversation. However, these systems are famous for not being able to distinguish truth from falsehood (to be precise, the AI is trained with text data that is assumed to be mostly true, but it cannot make any logical deductions). If we ask an AI for the biography of a nonexistent person, it may simply invent it trying to fulfill the task. If we contradict the AI with a pure fact, it will probably just accept our input despite its previous answer. Currently, conversational AI systems are not capable of rebutting false claims by providing evidence. However, in the likely future, a conversational AI with access to a Knowledge Base (KG, database, or other), will be able to process queries and generate answers in natural language, but also to check for verified facts, and to present relevant information extracted from the knowledge base. An example in this direction is the Wolfram Alpha plug-in for ChatGPT. With some enhanced algorithms to traverse and explore a knowledge graph, we will maybe witness AI systems stepping up from Knowledge to Insight, or further up the ladder.

Knowledge Graphs in MaRDI

One of the mottos of MaRDI is “Your Math is Data”. Indeed, from an information theory perspective, all mathematical results (theorems, proofs, formulas, examples, classifications) are data, and some mathematicians also use experimental or computational data (statistical datasets, algorithms, computer code…). MaRDI intends to create the tools, the infrastructure, and the cultural shift to manage and use all research data efficiently. In order to climb up the “knowledge ladder” from Data to Information and Knowledge, the Data needs to be structured, and knowledge graphs are one excellent tool for that goal.

AlgoData

Several initiatives within MaRDI are based on knowledge graphs. A first example is AlgoData (requires MaRDI / ORCID credentials), a knowledge graph of numerical algorithms. In this KG, the main entities (nodes) are algorithms that solve particular problems (such as linear systems of equations or integrate differential equations). Other entities in the graph are supporting information for the algorithms, such as articles, software (code), or benchmarks. For example, we want to encode that algorithm 1 solves problem X, it is described in article Y, it is implemented on software Z, and it scores p points in benchmark W. A use case would be querying for algorithms that solve a particular type of problem, comparing the candidates using certain benchmarks, and retrieving the code to be used (ideally, being interoperable with your system setup).

AlgoData has a well-defined ontology. An ontology (from the Greek, loosely, “study or discourse of the things that exist”) is the set of concepts relevant to your domain. For instance, in an e-commerce site, “article”, “client”, “shopping cart”, or “payment method” are concepts that need to be defined, and included in the implementation of the e-commerce platform. For knowledge graphs, the list would include all types of nodes, and all labels for the edges and other properties. In general-purpose knowledge graphs, such as Wikidata, the ontology is huge, and for practical purposes the user (human or machine) relies on search/suggestion algorithms to identify the property that fits the most to their intention. In contrast, for specific-purpose knowledge graphs, such as AlgoData, a reduced and well-defined ontology is possible, as it simplifies the overall structure and search mechanisms.

The ontology of AlgoData (as of June 2023, under development) is the following:

Classes:
Algorithm, Benchmark, Identifiable, Problem, Publication, Realization, Software.

Object Properties:
analyzes, applies, documents, has component, has subclass, implements, instantiates, invents, is analyzed in, is applied in, is component of, is documented in, is implemented by, is instance of, is invented in, is related to, is solved by, is studied in, is subclass of, is surveyed in, is tested by, is used in, solves, specializedBy, specializes, studies, surveys, tests, uses.

Data Properties:
has category, has identifier.

We can display this ontology as a graph,

The ontology of AlgoData (version 0.1, June 2023)

Currently, AlgoData implements two search functions: “Simple search” by matching words on the content, and “Graph search” where we query for nodes in the graph satisfying certain conditions in their connections. The main AlgoData page gives a sneak preview of the system (these links are password protected, but MaRDI team members and any researcher with a valid ORCID identifier can access)

A project closely related to AlgoData is the Model Order Reduction Benchmark (MORB) and its Ontology (MORBO). This sub-project focuses on the creation of benchmarks for algorithms solving Model Reduction (a standard technique in mathematical modelization, to reduce the simulation time for large-scale systems) and has its own knowledge graph and ontology, tailored to this problem. More information on the MOR Wiki and the MaRDI TA2 page.

The MaRDI portal and knowledge graph

The main output from the MaRDI project will also be based on a knowledge graph. The MaRDI Portal will be the entry point to all services and resources provided by MaRDI. The portal will be backed by the MaRDI knowledge graph, a big knowledge graph scoped to all mathematical research data. You can already have a sneak peek to see the work in progress.

The architecture of the MaRDI knowledge graph follows that of Wikidata, and it is compatible with it. In fact, many entries of Wikidata have been imported into the MaRDI KG and vice-versa. The MaRDI knowledge graph will also integrate many other resources from open knowledge, thus leveraging from many projects. A non-exhaustive list would include:

The MaRDI AlgoData knowledge graph described above.
Other MaRDI knowledge graphs, such as the MORWiki or the graph of Workflows with other disciplines.
The zbMATH Open repository of reviews of mathematical publications.
The swMATH Open database of mathematical software
The NIST Digital Library of Mathematical Functions (DLMF).
The CRAN repository of R packages.
Mathematical publications in arXiv.
Mathematical publications in Zenodo.
The OpenML platform of Machine Learning projects.
Mathematical entries from Wikidata.
Entries added manually from users.

The MaRDI Portal does not intend to replace any of those projects, but to link all those openly available resources together in a big knowledge graph of greater scope. As of June 2023, the MaRDI KG has about 10 million triples (subject-predicate-object as in the RDF format). As with Wikidata, the ontology is too big to be listed, and it is described within the graph itself (e.g. the property P2 is the identifier for functions from the DLMF database).

Let us see some examples of entries in the MaRDI KG. A typical entry node in the MaRDI KG (in this example, the program ggplot2) is very similar to a Wikidata entry. This page is a human-friendly interface, but we can also get the same information in machine-readable formats such as RDF or JSON.

For the end user, probably it is more useful to query the graph for connections. As with Wikidata, we can query the MaRDI knowledge graph directly in SPARQL. It is a work-in-progress to enable the Scholia plug-in to work with the MaRDI KG. Currently, the beta MaRDI-Scholia queries against Wikidata.

Some queries that are available in the MaRDI KG but not on Wikidata are for instance queries to formulas in DLMF: here formulas that use the gamma function, or formulas that contain sine and tangent functions (the corpus of the database is still small, but it illustrates the possibilities). Wikidata can nevertheless query for symbols in formulas too.

The MaRDI KG is still in an early stage of development, and not ready for public use (all the examples cited are illustrative only). Once the KG begins to grow, mostly from open knowledge sources, the MaRDI team will improve it with some “knowledge building” techniques.

One such technique is the automated retrieval of structured information. For instance, the bibliographic references in an article are structured information, since they follow one of a few formats, and there are standards (bibTeX, Zb/MR number, …).

Another technique is link inference. This addresses the problem of low connectivity in graphs made by importing sub-graphs from multiple third-party sources, which may result in very few links between the sub-graphs. For instance, an article citing some references and a GitHub repository citing the same references are likely talking about the same topic. These inferences can then be reviewed by a human if necessary.

Another enhancement would be to improve search in natural language so that more complex queries can be made in plain English without the need to use SPARQL language.

The latest developments of the MaRDI Portal and its knowledge graph will be presented at a mini-symposium at the forthcoming DMV annual meeting in Ilmenau in September 2023.

Glossary on Knowledge Graphs

Knowledge ladder: Steps on which information can be classified, from the rawest to the more structured and useful. Depending on the authors, these steps can be enumerated as Data, Information, Knowledge, Insight, Wisdom.
Data: raw values collected from measurements.
Information: Data tagged with its meaning.
Knowledge: Pieces of information connected together with causal or other relationships.
Knowledge base: A set of resources (databases, dictionaries…) that represent Knowledge (as in the previous definition).
Knowledge graph: A knowledge base organized in the form of a mathematical graph.
Insight: Ability to identify relevant information from a knowledge base.
Wisdom: Ability to find (or create) connections between information points, using existing or new knowledge relationships.
Ontology: Set of all the terms and relationships relevant to describe your domain of study. In a knowledge graph, the types of nodes and edges that exist, with all their possible labels.
RDF (Resource Description Framework): A web standard to describe graphs as triples (subject - predicate - object).
SPARQL (Simple Protocol And RDF Query Language): A language to send queries (information retrieval/manipulation requests) to graphs in RDF format.
Wikipedia: a multi-language online encyclopedia based on articles (non-structured human-readable text).
Wikidata: an all-purpose knowledge graph intended to host data relevant to multiple Wikipedias. As a byproduct, it has become a tool to develop the semantic web, and it acts as a glue between many diverse knowledge graphs.
Semantic web: a proposed extension of the web in which the content of a website (its meaning, not just the text strings) is machine-readable, to improve search engines and data discovery.
Mediawiki: the free and open-source software that runs Wikipedia, Wikidata, and also the MaRDI portal and knowledge graph.
Scholia: A plug-in software for Mediawiki, to enhance visualization of data queries to a knowledge graph
AlgoData: a knowledge graph for numerical algorithms, part of the MaRDI project.

Data Dates

The video is available under the CC BY 4.0 license. You are free to share and adapt it, when mentioning the author (MaRDI).

In Conversation with Daniel Mietchen

In this episode of Data Dates, Daniel and Tabea talk about knowledge graphs. Touching on the general concept, how it would help you find the proverbial needle and specific challenges that include mathematical structures. In addition, we also hear about the MaRDI knowledge graph and what this brings to mathematicians.

Leibniz MMS Days

The 6th Leibniz MMS Days, organized by the Leibniz Network "Mathematical Modeling and Simulation (MMS)", took place this year from April 17 to 19 in Potsdam at the Leibniz Institute for Agricultural Engineering and Bioeconomics. A small MaRDI faction, consisting of Thomas Koprucki, Burkhard Schmidt, Anieza Maltsi, and Marco Reidelbach made their way to Postdam to participate.

This year's MMS Days placed a special emphasis on "Digital Twins and Data-Driven Simulation," "Computational and Geophysical Fluid Dynamics," and "Computational Material Science," which were covered in individual workshops. There was also a separate session on research data and its reproducibility in which Thomas introduced the MaRDI consortium with its goals and vision, and promoted two important MaRDI services of the future, AlgoData and ModelDB; two knowledge graphs for documenting algorithms and mathematical models. Marco concluded the session by providing insight into the MaRDMO plugin, which links established software in research data management with the different MaRDI services, thus enabling FAIR documentations of interdisciplinary workflows. The presentation of the ModelDB was met with great interest among the participants and was the subject of lively discussions afterwards and in the following days. Some aspects from these discussions have already been considered in the further design of the ModelDB.

In addition to the various presentations, staff members of the institute gave a brief insight into the different fields of activity of the institute, such as the optimal design of packaging and the use of drones in the field, during a guided tour. The highlight of the tour was a visit to the 18-meter wind tunnel, which is used to study flows in and around agricultural facilities. So MaRDI actually got to know its first cowshed, albeit in miniature.

NFDI4friends

MaRDI RDM Barcamp

MaRDI, supported by the Bielefeld Center for Data Science (BiCDaS) and the Competence Center for Research Data at Bielefeld University, will host a Barcamp on research-data management in mathematics on July 4th, 2023, at the Center for Interdisciplinary Research (ZiF) in Bielefeld.

More information:

in English

Working group on Knowledge Graphs

The NFDI working group aims to promote the use of knowledge graphs in all NFDI consortia, to facilitate cross-domain data interlinking and federation following the FAIR principles, and to contribute to the joint development of tools and technologies that enable the transformation of structured and unstructured data into semantically reusable knowledge across different domains. You can sign up to the mailing list of the working group here.
Knowledge graphs in other NFDI consortia can be found for instance at the NFDI4Culture KG (for cultural heritage items) or at the BERD@NFDI KG (for business, economic, and related data items).

More information:

in English

NFDI-MatWerk Conference

The 1^st NFDI-MatWerk Conference to develop a common vision of digital transformation in materials science and engineering will take place from 27 - 29 June 2023 as a hybrid conference. You can still book your ticket for either on-site or online participation (online tickets are even free of charge).

More information:

in English

Open Science Barcamp

The Barcamp is organized by the Leibniz Strategy Forum Open Science and Wikimedia Deutschland. It is scheduled for 21 September 2023 in Berlin and is open to everybody interested in discussing, learning more about, and sharing experiences on practices in Open Science.

More information:

in English