# Math & Data Quarterly

## News and insights into the realm of mathematical research data

**6th issue - Mathematical Research Data**

Welcome back to the Newsletter on mathematical research data—this time, we are discussing a topic that is very much at the core of our interest and that of our previous articles: what is mathematical research data? And what makes it special?

Our very first newsletter delved into a brief definition and a few examples of mathematical research data. To quickly recap, research data are all the digital and analog objects you handle when doing research: this includes articles and books, as well as code, models, and pictures. This time, we zoom into these objects, highlight their properties, needs, and challenges (check out the article "Is there Math Data out there?" in the next section of this newsletter), and explore what sets them apart from research data in other scientific disciplines. We also report from workshops and lectures where we discussed similar questions, present an interview with Günter Ziegler, and invite you to events to learn more.

by Ariel Cotton, licensed under CC BY-SA 4.0.

We start off with a fun survey. It is again just one multiple choice question. This time, we created a decision tree, which will guide you to answer the question:

**What type of mathematician are you? **

You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.

The decision tree is available as a poster for download, licensed under CC BY 4.0.

### Is there Math Data out there?

“Mathematics is the queen and servant of sciences”, according to a quote by Carl F. Gauss. This opinion of Gauss can be a source of philosophical discussions. Is Mathematics even a science? Why does it play a special role? Connecting these questions to our concerns, what is the relationship between research data and these philosophical questions? We cannot arrive at a conclusion in this short article, but it is a good starting point to discuss the mindset (the philosophy, if you wish) that should be adopted regarding research data in mathematical science.

A wide agreement is that a science is any form of study that follows the scientific method: observation, formulation of hypotheses, experimental verification, extraction of conclusions, and back to observation. In most sciences (natural sciences and, to a great extent, also social sciences), observation requires gathering data from nature in the form of empirical records. In contrast, in pure mathematics observations can be made simply by reflection on known theory and logic. In natural sciences, nature is the ultimate judge of the validity or invalidity of a theory. This experimental verification also requires gathering research data in the form of empirical records that support or refute a hypothesis. In contrast, in mathematics, experimental verification is substituted by formal proofs. Such characteristics have prompted some philosophers to claim that mathematics is not really a science, but a meta-science because it does not rely on empirical data. More pragmatically, it can tempt some researchers and mathematicians to say that (at least, pure) mathematics does not use research data. But as you will guess, in the Mathematical Research Data Initiative (MaRDI), we advocate for quite the opposite view.

Firstly, some parts of mathematics do use experimental data extensively. Statistics (and probability) is the branch of mathematics for analyzing large collections of empirical records. Numerical methods are practical tools to perform computations in experimental data. This is the case for pure mathematics as well, where we can build lists of records (prime numbers, polytopes, groups…) that are somewhat experimental.

Secondly, research data are not only empirical records. Data are any raw piece of information upon which we can build knowledge (we discussed the difference between data, information, and knowledge in the previous newsletter). When we talk about research data, we mean any piece of information that researchers can use to build new knowledge in the scientific domain in question, mathematics in this case. As such, articles and books are pieces of data. More precisely, theorems, proofs, formulas, and explanations are individual pieces of data. They have traditionally been bundled into articles and books, and stored in paper, but nowadays are largely available in digital form and accessible through computerized means.

### Types of data

In modern mathematical research, we can find many types of data:

**Documents** (articles, books) and their constituent parts (theorems, proofs, formulas…) are data. Treating mathematical texts as data (and not only as mere containers where one deposits ideas in written form) recognizes that mathematical texts deserve the same treatment as other forms of structured data. In particular, FAIR principles and data management plans also apply to texts.

**Literature references** are data. Although bibliographic references are part of mathematical documents, we mention them separately because references are structured data. There is a defined set of fields, (such as author, title, publisher…), there are standard formats (e.g. bibTeX), and there are databases of mathematical references (e.g. zbMATH, MathSciNet,...). This makes bibliographic references one of the most curated type of research data (especially in Mathematics) .

**Formalized mathematics** is data. Languages that implement formal logic like Coq, HOL, Isabelle, Lean, Mizar, etc, are a structured version of the (unstructured) mathematical texts that we just mentioned. They contain proofs verifiable by software and are playing an increasingly vital role in mathematics. Data curation is essential to keep those formalizations useful and bound to their human-readable counterparts.

**Software** is data. From small scripts that help in a particular problem to wide libraries that integrate into larger frameworks (Sage, Mathematica, MATLAB…). Notebooks (Jupyter,...) are a form of research data that mix text explanations and interactive prompts, so they need to be handled as both documents and software.

**Collections of objects** are data. Classifications play a major role in mathematics. Either gathered by hand or produced algorithmically, the result can be a pivotal point on which many other works will derive from. Although this output result of a classification can have more applications than the process to arrive at it, it is essential that both input algorithm (or manual process) and the output classification are clearly documented, so that the classification can be verified and reproduced independently, apart from being reused in further projects.

**Visualizations and examples** are data. Examples and visual realizations of mathematical objects (including images, animations, and other types of graphics) can be very intricate and have an enormous value for understanding and developing a theory. Although examples and visualizations can be omitted in more spartan literature, if provided, they deserve a full research data curation as other research data essential to logical proofs.

**Empirical records** are data. Of course, raw collection of natural information, intended to be processed to extract knowledge of the data itself, or from the statistical method, are data that need special tools to handle. This applies to statistical databases, but also to machine learning models that require vast amounts of training data.

**Simulations** are data. Simulations are lists of records not measured from the outside world, but generated from a program. This is usually a representation of a state of a system, including possibly some discretizations and simplifications of reality in the modeling process. As with collections, this output simulation data is as necessary as the input source code that generates it. Simulation data is what allows us to extract conclusions, whereas the reproducibility verification requires that the processing input-to-output be performed by a third party, allowing the recognition of flaws or errors in either the input or the output, or allowing for the rerun of the simulation with different parameters.

**Workflow documentations** are data. More general than simulations, workflows involve several steps of data acquisition, data processing, data analysis, and extraction of conclusions in many scientific researches. An overview of the process is in itself a valuable piece of data, as it gives insights into the interplay of the different parts. A numerical algorithm can be individually robust and performant, but it may not be the best fit for the task at hand. We can only spot such issues when we have a good overview of the entire process.

### The building of mathematics

One key difference between mathematics and other sciences is the existence of proofs. Once a result is proven, it is true forever, as it cannot be overruled by new evidence. The Pythagorean theorem, for instance, is today as valid and useful as it was in the times of Pythagoras (or even in the earlier times of ancient Babylonians and Egyptians, who knew and used it. However, the Greeks invented the concept of proof, turning mathematics from a practice into a science). The Book of Elements by Euclid, written circa 300 BC, one of the most relevant books in the history of mathematics and mankind, perfectly represents the idea that mathematics is a building, or a network, in which each block is built on top of others, in a chain starting with some predetermined axioms. The image shows the dependency graph of propositions in Book I of the Elements.

Imagine now that we extend the above graph to include all propositions and theorems from all mathematical literature up to the current state of research. That huge graph would have millions of theorems and dependency connections, and will be futile to draw on paper. This graph does not exist yet physically or virtually except as an abstract concept. Parts of this all-mathematics graph are stored in the brains of some mathematicians, or in literature as texts, formulas, and diagrams. The breakthrough of our times is that it is conceivable to materialize this graph with today’s technology, in the form of a knowledge graph similar to those being developed at MaRDI or Wikidata. The benefits of having such a graph in a computer system are many: we will be able to find any known theorem that applies to our problems, access the fundamental blocks of literature where those results were established, find and verify logical connections in complex proofs, facilitating a panoramic view of mathematics and its different areas.

The crucial point is that to succeed in such an endeavor, we must realize that mathematical knowledge is composed of pieces of data, that require FAIR and complex data management and a particular infrastructure to handle data at this scale. Although it is not completely out of MaRDI’s scope, MaRDI itself does not have a goal of creating a knowledge graph of all mathematical theorems but instead focuses on the research data management required by today’s researchers. The most advanced project aiming to fulfill this all-mathematics graph is probably within the LEAN community (see also our interview with Johan Commelin).

### Mathematics as a tool

The “special role” of mathematics amongst sciences comes from the role of *tool* that it plays in any other science, to the point that a science is not considered mature enough until it has a mathematical formalization. The fact that mathematics can be used as the tool for doing science is the so-called “unreasonable effectiveness of mathematics in the natural sciences”. But once this role of mathematics as a tool is accepted, we must admit that, in theory, it is a very reliable tool. It is so, foremost, because of the logical building process that we described above. A proven theorem will not fail unexpectedly, the rules of logic will not cease to exist tomorrow. But in practice, relying on tools that someone else developed requires, first, that one can trust the tool to execute its intended goal; and second, that one can learn how to use the tool effectively. This entails responsibility from mathematics as a science and from mathematicians as a community with respect to other sciences and researchers.

As happens with physical tools, a craftsman must know their tools well in order to use them efficiently. But also any modern toolmaker must state clearly the technical characteristics of the tool, the intended use, the safety precautions, its quality standards and regulations, etc. In our analogy, mathematicians must take care of impeccable preparation of the results they produce, especially when talking about algorithms and methods that will probably be applied by researchers in other fields of science.

Think of the calculus used in quantitative finance, statistical hypothesis tests to analyze data in medicine, or computers tracking the exact location of spaceships. If mathematicians did not get their derivatives and integration right, these methods will not provide reliable results, leading to wrong conclusions, often even putting people’s lives in danger. It is of utmost importance to be able to fully trust at least the theoretical basis, especially since applied science has to deal with rounding errors, components of nature that were not integrated into the original model, and the possibility of human failure. This requires a verifiability of the results.

Concerning the mastering of the use of a tool, mathematical production must take into account its future reusability as tools for other scientists. This means appropriate documentation, using appropriate standards for interoperability with existing tools, using legal licenses that allow unencumbered reusability, and in general following some form of agreed good practices of the community that can help as guidelines for the research practice.

Modern science in the age of information and computation depends entirely on research data, but different fields have adapted their methods and practices with uneven success. Mathematics is not especially well placed in terms of managing research data and software in comparison to other fields.

Software development, especially in the open source community, has been facing data management problems for decades, meaning that some of the solutions are currently standard practices in the industry. For instance, version control (with git as a de-facto standard tool) is a basic practice to track changes and improvements to source code (could be any document or any data). If we couple the version control with a public repository (GitHub, GitLab…), we get a reliable method of publishing software and working collaboratively. Once a project has many contributors, one will face merging problems, when different teams develop in different directions. A solution is a continuous integration scheme, with automated tests, that guarantee your modifications will not break other parts of the project if adopted. The amount of security and verification in the industry for any new development in big software projects (think for instance on new Linux kernel releases) is certainly unparalleled in most software projects in the scientific research community (with notable exception efforts like xSDK). This is often excused as research is in its nature experimental (in the sense of untested and unfinished), but academic and theoretical research should not have lower standards than industry research.

### In Conversation with Günter Ziegler

"There's nothing more successful than success" Günter Ziegler says in our latest data date: best practices will be embraced by the community. We talk about what's his combinatorical view on research data, the need for classifications, and the difference between everlasting mathematical results and theories in physics.

### Mathematics Meets Data: Highlights from MaRDI's Barcamp

What better way to get researchers to find out that research-data management is their topic than with a Barcamp? That way, every participant can explore their own experiences, questions, and approaches.

On July 4th, MaRDI hosted its first Barcamp on Research-Data Management in Mathematics at Bielefeld University's Center for Interdisciplinary Research. It was a joint effort involving the Bielefeld mathematics faculty, MaRDI, BiCDaS, and the Bielefeld Competence Center for Research Data.

The day began with a casual breakfast, where attendees mingled, discussed expectations, and chatted about questions. A poster showcasing research data types served as a useful conversation starter (find the download link for the poster in the welcome section of this newsletter issue).

Before the session pitches commenced, Lars Kastner and Pedro Costa-Klein delivered brief talks on code reproducibility and best practices for using Docker in the Collaborative Research Center 1456 (Mathematics of the Experiment) in Göttingen, respectively.

The session pitches revealed that the Barcamp had appealed to many young researchers unfamiliar with the topic. To address this, an introductory session on "What is research data?" kicked off the discussions. Meanwhile, those more experienced with research data management discussed ways to engage the mathematical community with the topic.

One of the defining features of a Barcamp is its participant-driven agenda. Attendees had the unique opportunity to shape the discussions and focus on the topics most pertinent to their research and data management needs. This resulted in a diverse set of topics. One session on research data management plans matched experts from the Competence Center and mathematicians to exchange perspectives and requirements. A smaller group's discussions centered on Binderhub, whereas another tackled research data repositories and their adherence to FAIR principles. Additional sessions explored the peculiarities of mathematical research data, the importance of good documentation, and a hands-on session on an online databasethat collects and discusses ideas on FAIR data.

This Barcamp offered the mathematics community an exceptional platform to exchange insights and inquiries regarding research-data management within their discipline.

### Teaching research-data management

A survey conducted in the summer of 2021 in German mathematics departments revealed that teaching mathematicians estimate the awareness and knowledge of their students regarding good scientific practice, authorship attributions, the FAIR principles, and research software as too low. Unfortunately, these are classical research-data management (rdm) topics. Motivated by that need and by successful, cross-disciplinary rdm courses at Bielefeld and Leipzig universities, six lectures in research-data management for mathematicians took place in Leipzig in the summer term 2023. To the teacher's knowledge, this was the first of its kind. The large group of attendees came from a variety of career levels including six undergraduate students, two PhD students, two postdocs, and five MaRDIans. This contributed to lively discussions centered around properties and common problems of mathematical research data, metadata standards for papers and the difficulties in deciding appropriate metadata for mathematical results, the scientific method, good scientific practice, and how to write, cite, and document mathematics. Feedback for the course was very good, with students appreciating the interactive atmosphere, the time allocated for questions, and the informal nature of the classes. A one-day course of maths rdm in Magdeburg in October will build on these first successful sessions and discuss questions of reproducibility and repositories, in addition to introductory topics. Lecture notes for both are now in the making. They will be made publicly available for a second installment next summer term for free use and reuse by any mathematician interested in the topic of rdm.

### MaRDMO Workshop at the NFDI-MatWerk Conference

The "1st Conference on Digital Transformation in Materials Science and Engineering - NFDI-Matwerk Conference" took place in Siegburg between 26-29.06.2023. With 30 talks, 17 posters, 10 workshops, and 160 participants (on-site and online), the conference provided an ideal setting for the urgently needed transformation in materials science. In addition to status updates from each NFDI-MatWerk task area and various interdisciplinary use cases, the conference initiated collaborations between different NFDI consortia and new community participants, emphasizing their role in shaping the future of NFDI-MatWerk. Several NFDI consortia, namely NFDI4Chem, NFDI4energy, DAPHNE4NFDI, and FAIRmat, also gave keynote presentations, highlighting the need for collaboration.

Marco Reidelbach from TA4 attended the conference on behalf of the MaRDI consortium to present MaRDMO, a plugin for the Research Data Management Organiser (RDMO) for documenting, publishing, and searching interdisciplinary workflows. Though participation was low at the 100-minute demonstration, discussions vital for the further development of MaRDMO ensued. The central point of the discussion was the automation of the documentation process to minimize additional work for researchers, thereby increasing the acceptance of MaRDMO. We also discussed the use of RDMO, which on paper appears to be an ideal interface to all research disciplines, but was completely unknown to the workshop participants. Here, the NFDI in particular is also called upon to take a clear stand. A good two-thirds of the consortia have declared their support for RDMO, while the remaining consortia want to rely on alternatives or are still undecided.

Overall, the NFDI-MatWerk consortium conference showed that the defining infrastructural issues, far from the concrete content, differ little or not at all from the issues in the MaRDI consortium and the other consortia at the conference. The construction of knowledge graphs and the harmonization of ontologies are central problems that require a joint effort and make it necessary to leave one's own comfort zone.

### MaRDI at CoRDI

MaRDI was present at the first Conference on Research Data Infrastructure (CoRDI), held in Karlsruhe between 12 - 14 September 2023. This interdisciplinary event brought all the NFDI consortia together, during which they presented their projects in general and detailed discussions. The conference was a unique opportunity to exchange experiences and ideas amidst a wide range of communities with different needs, but share common challenges and solutions regarding Research Data.

MaRDI presented three talks and two posters. The general conference proceedings are linked in the recommended further reading section at the end of this newsletter issue. We provide links to individual sections here:

Talks:

MaRDI. Building Research Data Infrastructures for Mathematics and the Mathematical Sciences. Renita Danabalan, Michael Hintermüller, Thomas Koprucki, Karsten Tabelow.

MaRDIFlow: A Workflow Framework for Documentation and Integration of FAIR Computational Experiments. Pavan L. Veluvali, Jan Heiland, Peter Benner.

Building Ontologies and Knowledge Graphs for Mathematics and its Applications. Björn Schembera, Frank Wübbeling, Thomas Koprucki, Christine Biedinger, Marco Reidelbach, Burkhard Schmidt, Dominik Göddeke, Jochen Fiedler

Posters:

MaRDMO Plugin. Document and Retrieve Workflows Using the MaRDI Portal. Marco Reidelbach, Eloi Ferrer, Marcus Weber.

Spreading the Love for Mathematical Research Data. Tabea Bacher, Christiane Görgen, Tabea Krause, Andreas Matt, Daniel Ramos, Bianca Violet.

**Math Meets Information Specialists, October 09 - 11, 2023, MPI MiS, Leipzig**

MaRDI invites information specialists, librarians, data stewards, and mathematicians to discuss mathematical research data, present their own ideas and services, and make new connections in a three-day noon-to-noon workshop with talks, hands-on sessions, and a barcamp. The workshop will be held in German.

**More information:**

- in German

### Data-Driven Materials Informatics, March 4 - May 24, 2024

The aim of this long program at IMSI is to bring together a diverse scientific audience, both between scientific fields (physical sciences, materials sciences, biophysics, etc.) and within mathematics (mathematical modeling, numerical analysis, statistics, data analysis, etc.), to make progress on key questions of materials informatics.

**More information:**

- in English

### RDM with LinkAhead, September 29, 2023, online

At the NFDI4Chem Stammtisch, the research data management software LinkAhead will be introduced. This agile, open-source software toolbox enables professional data management in research where other approaches are too rigid and inflexible. It will make your data findable and reusable.

**More information:**

### NFDI Code of Conduct

The Consortial assembly, comprising the speakers of each consortium, voted on 27 June 2023 to adopt the code of conduct for the NFDI. This Code of Conduct is intended to provide a binding framework for effective collaboration within the NFDI association.

**More information:**

- in German

- A a generic JSON based file format which is suitable for computations in computer algebra is described in the paper A FAIR File Format for Mathematical Software by Antony Della Vecchia, Michael Joswig, and Benjamin Lorenz. This file format is implemented in the computer algebra system OSCAR, but the paper also indicates how it can be used in a different context.
- To understand our world, we classify things. A famous example is the periodic table of elements, which describes the properties of all known chemical elements and classifies the building blocks we use in physics, chemistry, and biology. In mathematics, and algebraic geometry in particular, there are many instances of similar periodic tables, describing fundamental classification results. In his article, The Periodic Tables of Algebraic Geometry, Pieter Belmans invites you on a tour of some of these results. It appeared within the series 'Snapshots of modern mathematics from Oberwolfach'.
- Play with the educational tool Classified graphs. With this open-source web app you can draw any graph, or select one from a collection, and then compute a few invariants, such as the adjacency determinant. In the Identify mode, you are challenged to find out which of the graphs in the collection is shown as a target. The tool is part of Pieter Belmans's project Classified maths.
- In each episode of the podcast "Mathematical Objects", Katie Steckles and Peter Rowlett chat about some aspect of mathematics using a mathematical object as inspiration. The podcast is also available on YouTube.
- Proceedings of the Conference on Research Data Infrastructure (CoRDI):

https://www.tib-op.org/ojs/index.php/CoRDI/issue/view/12

**5th issue - Knowledge Graphs**

Welcome to the fifth issue of the MaRDI Newsletter on mathematical research data. In the first four issues, we focused on the FAIR principles. Now we move to a topic, which makes use of FAIR data and also implements the FAIR principles in data infrastructures. So without further ado, let me introduce you to the ultimate use case of FAIR data: Knowledge Graphs.

by Ariel Cotton, licensed under CC BY-SA 4.0.

Knowledge graphs are very natural and represent information similar to how we humans think. They come in handy when you want to avoid redundancy in storing data (as it may happen quite often with tabular methods), and also for complex dataset queries.

This newsletter issue offers some insight into the structure of knowledge, examples of knowledge graphs, including some specific to MaRDI, an interview with a knowledge graph expert, as well as news and announcements related to research data.

In the last issue, we asked how long it would take you to find and understand your own research data. These are the results:

Now we ask you for specific challenges when searching for mathematical data. You may choose from the multiple-choice options or enter something else you faced.

**Click to enter your challenges!**

You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.

### The knowledge ladder

We are not sure exactly how humans store knowledge in their brains, but we certainly pack concepts into units, and then relate those conceptual units together. For example, if asked to list animals, nobody remembers an alphabetical list (unless you explicitly train yourself to remember such a list). Instead, you start the list with something familiar, like a dog, then you recall that dog is a pet animal, and then you list other pet animals like cat or canary. Then you recall that canary is a bird, and then you list other birds, like eagle, falcon, owl… when you run out of birds, you recall that birds fly in the air, which is one environment medium. Another environment medium is water, and this prompts you to start listing fishes and sea animals. This suggests that we can represent human knowledge in the form of a mathematical graph: concepts are nodes, and relationships are edges. This structure is also ingrained in language, which is the way humans communicate and store knowledge. All languages in the world, across all cultures, have nouns, verbs, or adjectives, and establish relationships through sentences. Almost every language organizes sentences in a subject-verb-object pattern (or any permutations: SVO, SOV, VSO, etc). The subject and the object are typically nouns or pronouns, the verb is often a relationship. A sentence like “my mother is a teacher” encodes the following knowledge: the person “my mother” is a node 1, “teacher” is a node 2, and “has as a job” is a relational edge from node 1 to node 2. Also, there is a node 3, the person “me”, and a relationship “is the mother of” from node 1 to node 3, (which implies a reciprocal relationship “is a child of” from node 3 to node 1).

On this construction, we can expect to have an abstract representation of human knowledge that we can store, retrieve, and search with a computer. But not all data automatically gives knowledge, and raw knowledge is not all you may need to solve a problem. This distinction is sometimes referred as “knowledge ladder”, although terminology has not been universally agreed upon. In this ladder, *data* is the lowest level of information, data are raw input values that we have collected with our senses, or with a sensor device. *Information* is data tagged with meaning; I am a person, that person is called Mary, teaching is a job, this thing I see is a dog, this list of numbers are daily temperatures in Honolulu. *Knowledge* is achieved when we find relationships between bits of information; Mary is my mother, Mary’s job is teacher, these animals live together and compete for food; pressure, temperature, and volume in a gas are related by the gas law PV=nRT. *Insight* is discerning. It is singling out the information that is useful for your purpose from the rest, it is finding seemingly unrelated concepts that behave alike. Finally, *wisdom* is understanding the connections between concepts. It is the ability to explain step by step how concept A relates to concept B. This ladder is illustrated in the image above. From this point of view, “research” means to know and understand all portions of the human knowledge that falls into or close to your domain, and then enlarging the graph with more nodes and edges, for which you need both insight and wisdom.

### The advent of knowledge graphs

Knowledge graphs (KG) as a theoretical construction have been discussed in information theory, linguistics and philosophy for at least five decades, but it is only in this century that computers allowed us to implement algorithms and data retrieval at a practical and massive scale. Google introduced its own knowledge graph in 2012, you may be familiar with it. When you look up in Google some person, some place, etc, there is a small box to the right that displays some key information such as birthdate and achievements for a person, opening times for a shop, etc. This information is not a snippet from a website, it is information collected from many sources and packed into a node of a graph. Then those nodes are linked together by some affinity relationship. For instance, if you look up “Agatha Christie”, you will see an “infobox” with her birthdate, deathdate, short description extracted from Wikipedia, a photograph… And also a list of “People also search for” that will bring you to her family relatives such as Archibald Christie, or to other British authors, such as Virginia Woolf.

But probably the biggest effort to bring all human knowledge into structured data is Wikidata. Wikidata is a sister project of Wikipedia. Wikipedia aims to gather all human knowledge in the form of encyclopedic articles, that is, into non-structured human-readable data. Wikidata, by contrast, is a knowledge graph. It is a directed labeled graph, made of triples of the form subject (node) - predicate (edge) - object (node). The nodes and edges are labeled, actually, they contain a whole list of attributes.

The Wikidata graph is not designed to be used directly by humans. It is designed to retrieve information automatically, to be a “base of truth” that can be relied on. For instance, it can check automatically that all languages of Wikipedia state basic facts correctly (birthplace, list of authored books…), and can be used by external services (such as Google and other search engines or voice assistants) to offer correct and verifiable answers to queries.

In practice, nodes are pages, for instance, this one for Agatha Christie. Inside the page, it lists some “statements”, which are the labeled edges to other nodes. For example, Agatha Christie is an instance of a human, her native language is English, and her field of work is crime novel, detective literature, and others. If we compare that page with the Agatha Christie entry in the English Wikipedia, clearly the latter contains more information, and the Wikidata page is less convenient for a human to read. Potentially, all the ideas described with English sentences in Wikipedia could be represented by relationships in the Wikidata graph, but this task is tedious and difficult for a human, and AI systems are not yet sufficiently developed to make this conversion automatically.

In the backend, Wikidata is stored in relational SQL databases (the same Mediawiki software as used in Wikipedia), but the graph model is that of triples subject-predicate-object as defined in the web standard RDF (Resource Description Framework), This graph structure can be explored and queried with the language SPARQL (Simple Protocol And RDF Query Language). Note that usually, we use the verb “query” as opposed to “search” when we want to retrieve information from a graph, database, or other structured sources of information.

Thus, one can access the Wikidata information in several ways. First, one can use the web interface to access single nodes. The web interface has a search function that allows one to look up pages (nodes) that contain a certain search string. However, it is much more insightful to get information that takes advantage of the graph structure, that is, querying for nodes that are connected to some topic by a particular predicate (statement), or that have a particular property. For Wikidata, we have two main tools: direct SPARQL queries, and the Scholia plug-in tool.

The web and API at query.wikidata.org allows to send queries in SPARQL language. This is the most powerful search, you can browse the examples in that site. The output can be a list, a map, a graph, etc. There is a query builder help function, but essentially it requires some familiarity with SPARQL language. On the other hand, Scholia is a plug-in tool that helps querying and visualizing the Wikidata graph. For instance, searching for “covid-19” via Scholia, it will offer a graph of related topics, a list of authors and recent publications on the topic, organizations, etc., in different visual forms.

### Knowledge graphs, artificial intelligence, and mathematics

Knowledge graphs are a hot research area in connection with Artificial Intelligence. On the one hand, there is the challenge of creating a KG from a natural language text (for instance, in English). While detecting grammar and syntax rules (subject, verb, object) is relatively doable, creating a knowledge graph requires encoding the semantics, that is, the meaning of the sentence. In the example of a few paragraphs above, “my mother is a teacher”, to extract the semantics we need the context of who is “me” (who is saying the sentence), we need to check if we already know the person “my mother” (her name, some kind of identifier), etc. The node for that person can be on a small KG with family or contextual information, while “teacher” can be part of a more general KG of common concepts.

In the case of mathematics, extracting a KG from natural language is a tremendous challenge, unfeasible with today’s techniques. Take a theorem statement: it contains definitions, hypotheses, and conclusions, and each one has a different context of validity (the conclusion is only valid under the hypotheses, but that is what you need to prove). Then imagine that you start your proof by reduction to absurd, so you have several sentences that are valid under the assumption that the hypotheses of the theorem hold, but not the conclusion. At some point, you want to find a contradiction with your previous knowledge, thus proving the theorem. The current knowledge graph paradigm is simply not suitable to follow this type of argumental line. The most similar thing to structured data for theorems and proofs are formal languages in logic, and there are practical implementations such as LEAN Theorem Prover. LEAN is a programming language that can encode symbolic manipulation rules for expressions. A proof by algebraic manipulation of a mathematical expression can therefore be described as a list of manipulations from an original expression (move a term to the other side of the equal sign, raise the second index in this tensor using a metric…). Writing proofs in LEAN can be tedious but it has the benefit of being automatically verifiable by a machine. There is no need for a human referee. Of course, we are still far from an AI checking the validity of a proof without human intervention, or even figuring out proofs to conjectures on its own. On the other hand, a dependency graph of theorems, derived in a logic chain from some axioms, is something that a knowledge graph like the MaRDI KG would be suitable to encode.

In any case, structured knowledge (in the form of KG or other forms, such as databases) is a fundamental piece to providing AI systems with a source of truth. Recent advances in the field of generative AI include the famous conversation bots ChatGPT and other Large Language Models (LLM), which are impressive in the sense that they can generate grammatically correct text, with meaningful sentences while keeping attention to maintaining a conversation. However, these systems are famous for not being able to distinguish truth from falsehood (to be precise, the AI is trained with text data that is assumed to be mostly true, but it cannot make any logical deductions). If we ask an AI for the biography of a nonexistent person, it may simply invent it trying to fulfill the task. If we contradict the AI with a pure fact, it will probably just accept our input despite its previous answer. Currently, conversational AI systems are not capable of rebutting false claims by providing evidence. However, in the likely future, a conversational AI with access to a Knowledge Base (KG, database, or other), will be able to process queries and generate answers in natural language, but also to check for verified facts, and to present relevant information extracted from the knowledge base. An example in this direction is the Wolfram Alpha plug-in for ChatGPT. With some enhanced algorithms to traverse and explore a knowledge graph, we will maybe witness AI systems stepping up from Knowledge to Insight, or further up the ladder.

One of the mottos of MaRDI is “Your Math is Data”. Indeed, from an information theory perspective, all mathematical results (theorems, proofs, formulas, examples, classifications) are data, and some mathematicians also use experimental or computational data (statistical datasets, algorithms, computer code…). MaRDI intends to create the tools, the infrastructure, and the cultural shift to manage and use all research data efficiently. In order to climb up the “knowledge ladder” from Data to Information and Knowledge, the Data needs to be structured, and knowledge graphs are one excellent tool for that goal.

### AlgoData

Several initiatives within MaRDI are based on knowledge graphs. A first example is AlgoData (requires MaRDI / ORCID credentials), a knowledge graph of numerical algorithms. In this KG, the main entities (nodes) are algorithms that solve particular problems (such as linear systems of equations or integrate differential equations). Other entities in the graph are supporting information for the algorithms, such as articles, software (code), or benchmarks. For example, we want to encode that algorithm 1 solves problem X, it is described in article Y, it is implemented on software Z, and it scores p points in benchmark W. A use case would be querying for algorithms that solve a particular type of problem, comparing the candidates using certain benchmarks, and retrieving the code to be used (ideally, being interoperable with your system setup).

AlgoData has a well-defined ontology. An ontology (from the Greek, loosely, “study or discourse of the things that exist”) is the set of concepts relevant to your domain. For instance, in an e-commerce site, “article”, “client”, “shopping cart”, or “payment method” are concepts that need to be defined, and included in the implementation of the e-commerce platform. For knowledge graphs, the list would include all types of nodes, and all labels for the edges and other properties. In general-purpose knowledge graphs, such as Wikidata, the ontology is huge, and for practical purposes the user (human or machine) relies on search/suggestion algorithms to identify the property that fits the most to their intention. In contrast, for specific-purpose knowledge graphs, such as AlgoData, a reduced and well-defined ontology is possible, as it simplifies the overall structure and search mechanisms.

The ontology of AlgoData (as of June 2023, under development) is the following:

Classes:

Algorithm, Benchmark, Identifiable, Problem, Publication, Realization, Software.

Object Properties:

analyzes, applies, documents, has component, has subclass, implements, instantiates, invents, is analyzed in, is applied in, is component of, is documented in, is implemented by, is instance of, is invented in, is related to, is solved by, is studied in, is subclass of, is surveyed in, is tested by, is used in, solves, specializedBy, specializes, studies, surveys, tests, uses.

Data Properties:

has category, has identifier.

We can display this ontology as a graph,

Currently, AlgoData implements two search functions: “Simple search” by matching words on the content, and “Graph search” where we query for nodes in the graph satisfying certain conditions in their connections. The main AlgoData page gives a sneak preview of the system (these links are password protected, but MaRDI team members and any researcher with a valid ORCID identifier can access)

A project closely related to AlgoData is the* Model Order Reduction Benchmark* (MORB) and its *Ontology* (MORBO). This sub-project focuses on the creation of benchmarks for algorithms solving Model Reduction (a standard technique in mathematical modelization, to reduce the simulation time for large-scale systems) and has its own knowledge graph and ontology, tailored to this problem. More information on the MOR Wiki and the MaRDI TA2 page.

### The MaRDI portal and knowledge graph

The main output from the MaRDI project will also be based on a knowledge graph. The MaRDI Portal will be the entry point to all services and resources provided by MaRDI. The portal will be backed by the MaRDI knowledge graph, a big knowledge graph scoped to all mathematical research data. You can already have a sneak peek to see the work in progress.

The architecture of the MaRDI knowledge graph follows that of Wikidata, and it is compatible with it. In fact, many entries of Wikidata have been imported into the MaRDI KG and vice-versa. The MaRDI knowledge graph will also integrate many other resources from open knowledge, thus leveraging from many projects. A non-exhaustive list would include:

- The MaRDI AlgoData knowledge graph described above.
- Other MaRDI knowledge graphs, such as the MORWiki or the graph of Workflows with other disciplines.
- The zbMATH Open repository of reviews of mathematical publications.
- The swMATH Open database of mathematical software
- The NIST Digital Library of Mathematical Functions (DLMF).
- The CRAN repository of R packages.
- Mathematical publications in arXiv.
- Mathematical publications in Zenodo.
- The OpenML platform of Machine Learning projects.
- Mathematical entries from Wikidata.
- Entries added manually from users.

The MaRDI Portal does not intend to replace any of those projects, but to link all those openly available resources together in a big knowledge graph of greater scope. As of June 2023, the MaRDI KG has about 10 million triples (subject-predicate-object as in the RDF format). As with Wikidata, the ontology is too big to be listed, and it is described within the graph itself (e.g. the property P2 is the identifier for functions from the DLMF database).

Let us see some examples of entries in the MaRDI KG. A typical entry node in the MaRDI KG (in this example, the program ggplot2) is very similar to a Wikidata entry. This page is a human-friendly interface, but we can also get the same information in machine-readable formats such as RDF or JSON.

For the end user, probably it is more useful to query the graph for connections. As with Wikidata, we can query the MaRDI knowledge graph directly in SPARQL. It is a work-in-progress to enable the Scholia plug-in to work with the MaRDI KG. Currently, the beta MaRDI-Scholia queries against Wikidata.

Some queries that are available in the MaRDI KG but not on Wikidata are for instance queries to formulas in DLMF: here formulas that use the gamma function, or formulas that contain sine and tangent functions (the corpus of the database is still small, but it illustrates the possibilities). Wikidata can nevertheless query for symbols in formulas too.

The MaRDI KG is still in an early stage of development, and not ready for public use (all the examples cited are illustrative only). Once the KG begins to grow, mostly from open knowledge sources, the MaRDI team will improve it with some “knowledge building” techniques.

One such technique is the automated retrieval of structured information. For instance, the bibliographic references in an article are structured information, since they follow one of a few formats, and there are standards (bibTeX, Zb/MR number, …).

Another technique is link inference. This addresses the problem of low connectivity in graphs made by importing sub-graphs from multiple third-party sources, which may result in very few links between the sub-graphs. For instance, an article citing some references and a GitHub repository citing the same references are likely talking about the same topic. These inferences can then be reviewed by a human if necessary.

Another enhancement would be to improve search in natural language so that more complex queries can be made in plain English without the need to use SPARQL language.

The latest developments of the MaRDI Portal and its knowledge graph will be presented at a mini-symposium at the forthcoming DMV annual meeting in Ilmenau in September 2023.

**Knowledge ladder**: Steps on which information can be classified, from the rawest to the more structured and useful. Depending on the authors, these steps can be enumerated as Data, Information, Knowledge, Insight, Wisdom.**Data**: raw values collected from measurements.**Information**: Data tagged with its meaning.**Knowledge**: Pieces of information connected together with causal or other relationships.**Knowledge base**: A set of resources (databases, dictionaries…) that represent Knowledge (as in the previous definition).**Knowledge graph**: A knowledge base organized in the form of a mathematical graph.**Insight**: Ability to identify relevant information from a knowledge base.**Wisdom**: Ability to find (or create) connections between information points, using existing or new knowledge relationships.**Ontology**: Set of all the terms and relationships relevant to describe your domain of study. In a knowledge graph, the types of nodes and edges that exist, with all their possible labels.**RDF (Resource Description Framework)**: A web standard to describe graphs as triples (subject - predicate - object).**SPARQL (Simple Protocol And RDF Query Language)**: A language to send queries (information retrieval/manipulation requests) to graphs in RDF format.**Wikipedia**: a multi-language online encyclopedia based on articles (non-structured human-readable text).**Wikidata**: an all-purpose knowledge graph intended to host data relevant to multiple Wikipedias. As a byproduct, it has become a tool to develop the semantic web, and it acts as a glue between many diverse knowledge graphs.**Semantic web**: a proposed extension of the web in which the content of a website (its meaning, not just the text strings) is machine-readable, to improve search engines and data discovery.**Mediawiki**: the free and open-source software that runs Wikipedia, Wikidata, and also the MaRDI portal and knowledge graph.**Scholia**: A plug-in software for Mediawiki, to enhance visualization of data queries to a knowledge graph**AlgoData**: a knowledge graph for numerical algorithms, part of the MaRDI project.

### In Conversation with Daniel Mietchen

In this episode of Data Dates, Daniel and Tabea talk about knowledge graphs. Touching on the general concept, how it would help you find the proverbial needle and specific challenges that include mathematical structures. In addition, we also hear about the MaRDI knowledge graph and what this brings to mathematicians.

### Leibniz MMS Days

The 6th Leibniz MMS Days, organized by the Leibniz Network "Mathematical Modeling and Simulation (MMS)", took place this year from April 17 to 19 in Potsdam at the Leibniz Institute for Agricultural Engineering and Bioeconomics. A small MaRDI faction, consisting of Thomas Koprucki, Burkhard Schmidt, Anieza Maltsi, and Marco Reidelbach made their way to Postdam to participate.

This year's MMS Days placed a special emphasis on "Digital Twins and Data-Driven Simulation," "Computational and Geophysical Fluid Dynamics," and "Computational Material Science," which were covered in individual workshops. There was also a separate session on research data and its reproducibility in which Thomas introduced the MaRDI consortium with its goals and vision, and promoted two important MaRDI services of the future, AlgoData and ModelDB; two knowledge graphs for documenting algorithms and mathematical models. Marco concluded the session by providing insight into the MaRDMO plugin, which links established software in research data management with the different MaRDI services, thus enabling FAIR documentations of interdisciplinary workflows. The presentation of the ModelDB was met with great interest among the participants and was the subject of lively discussions afterwards and in the following days. Some aspects from these discussions have already been considered in the further design of the ModelDB.

In addition to the various presentations, staff members of the institute gave a brief insight into the different fields of activity of the institute, such as the optimal design of packaging and the use of drones in the field, during a guided tour. The highlight of the tour was a visit to the 18-meter wind tunnel, which is used to study flows in and around agricultural facilities. So MaRDI actually got to know its first cowshed, albeit in miniature.

### MaRDI RDM Barcamp

MaRDI, supported by the Bielefeld Center for Data Science (BiCDaS) and the Competence Center for Research Data at Bielefeld University, will host a Barcamp on research-data management in mathematics on July 4th, 2023, at the Center for Interdisciplinary Research (ZiF) in Bielefeld.

**More information:**

- in English

### Working group on Knowledge Graphs

The NFDI working group aims to promote the use of knowledge graphs in all NFDI consortia, to facilitate cross-domain data interlinking and federation following the FAIR principles, and to contribute to the joint development of tools and technologies that enable the transformation of structured and unstructured data into semantically reusable knowledge across different domains. You can sign up to the mailing list of the working group here.

Knowledge graphs in other NFDI consortia can be found for instance at the NFDI4Culture KG (for cultural heritage items) or at the BERD@NFDI KG (for business, economic, and related data items).

**More information: **

- in English

### NFDI-MatWerk Conference

The 1^{st} NFDI-MatWerk Conference to develop a common vision of digital transformation in materials science and engineering will take place from 27 - 29 June 2023** **as a hybrid conference. You can still book your ticket for either on-site or online participation (online tickets are even free of charge).

**More information: **

- in English

### Open Science Barcamp

The Barcamp is organized by the Leibniz Strategy Forum Open Science and Wikimedia Deutschland. It is scheduled for 21 September 2023 in Berlin and is open to everybody interested in discussing, learning more about, and sharing experiences on practices in Open Science.

**More information: **

- in English

- The department of computer science at Stanford University offers this graduate-level research seminar, which includes lectures on knowledge graph topics (e.g., data models, creation, inference, access) and invited lectures from prominent researchers and industry practitioners.

It is available as a 73-page pdf document, divided into chapters:

https://web.stanford.edu/~vinayc/kg/notes/KG_Notes_v1.pdf

and additionally as video playlist:

https://www.youtube.com/playlist?list=PLDhh0lALedc7LC_5wpi5gDnPRnu1GSyRG - Video lecture on knowledge graphs by Prof. Dr. Harald Sack. It covers the topics of basic graph theory, centrality measures, and the importance of a node.

https://www.youtube.com/watch?v=TFT6siFBJkQ The Working Group (WG) Research Ethics of the German Data Forum (RatSWD) has set up the internet portal “Best Practice for Research Ethics”. It bundles information on the topic of research ethics and makes them accessible.

https://www.konsortswd.de/en/ratswd/best-practices-research-ethics/

**4th issue - Reusability**

Welcome to the fourth MaRDI Newsletter! This time we will investigate the fourth and final FAIR principle: Reusability. We consider the R in FAIR to capture the ultimate aim of sustainable and efficient handling of research data, that is to make your digital maths objects reusable for others and to reuse their results in order to advance science. In the words of the scientific computing community, we want mathematics to stand on the shoulders of giants rather than to be building on quicksand.

licensed under CC BY-NC-SA 4.0.

To achieve this, we need to make sure every tiny piece in a chain of results is where it should be, seamlessly links to its predecessors and subsequent results, is true and is allowed to be embedded in the puzzle we try to solve. This last comment is crucial, so we dedicate our main article in this issue of the newsletter to the topic of documentation, verifiability, licenses, and community standards for mathematical research data. We also feature some nice pure-maths examples we made for the love data week, report on the first MaRDI workshop for researchers in theoretical fields who are new to FAIR research data management, and entertain you with surveys and news from the world of research data.

To get into the mood of the topic, here is a question for you:

If you need to (re)use research data you created some time ago, how much time would you need to find and understand it? Would you have the data at your fingertips, or would you have to search for it for several days?

You will be taken to the results page automatically after submitting your answer, where you can find out how long other researchers would take. Additionally, the current results can be accessed here.

### On the shoulders of giants

The famous quote from Newton: “If I have seen further, it is by standing on the shoulders of giants" usually refers to how science is built on top of previous knowledge, with researchers basing their results on the works of scientists who came before them. One could reframe it by saying that scientific knowledge is reusable. This is a fundamental principle in the scientific community: once a result is published, anyone can read it, learn how it was achieved, and then use it as a basis for further research. Reusing knowledge is also ingrained in the practice of scientific research as the basis of verifiability. In natural sciences, the scientific method demands that experimental data back your claims. In mathematical research, the logic construction demands mathematical proof of your claims. This means that for a good scientific practice, your results must be verifiable by other researchers, and this verification requires a reuse of not only the mental processes but also the data and tools used in the research.

Research data must be as reusable as the results and publications they support. From the perspective of modern, intensively data-driven science, this demand poses some challenges. Some barriers to reusability are technical, because of incompatibilities of standards or systems, and this problem is largely covered in the Interoperability principle of FAIR. But other problems such as poor documentation or legal barriers can be even bigger obstacles than technical inconveniences.

Reuse of research data is the ultimate goal of FAIR principles. The first three principles (Findable, Accessible, Interoperable) are necessary conditions for effective reuse of data. What we list here as “Reusability” requirements are all the remaining conditions, often more subjective or harder to evaluate, that appeal to the final goal of having a piece of research data embedded in a new chain of results.

To be precise, the Reusability principle requires data and metadata to be richly characterised with descriptors and attributes. Anyone potentially interested in reusing the data should easily find out if that data is useful for their purposes, how it can be used, how it was obtained, and any other practical concerns for reusing it. In particular, data and metadata should be:

- associated with detailed provenance
- released with a clear and accessible data usage license.
- broadly aligned with agreed community standards of its discipline.

#### Documentation

It is essential for researchers to acknowledge that the research data they generate is a first-class output of their scientific research and not only a private sandbox that helps them produce some public results. Hence, research data needs to be curated with reusability in mind, documenting all details (even some that might seem irrelevant or trivial to its authors) related to its source, scope, or use. In data management, we use the term “provenance” to describe the story and rationale behind that data. Why does it exist, what problem was it addressing, how it was gathered, transformed, stored, used… all this information might be relevant for a third party that first encounters the data and has to judge if it is relevant for themselves or not.

In experimental data, it is important to document exactly what was the purpose of the experiment, which protocol was followed to gather the data, who did the fieldwork (in case that contact information is needed), which variables were recorded, how the data is organized, which software was used, which version of the dataset it is, etc. As an antithesis of the ideal situation, imagine that you, as a researcher, find out about an article that uses some statistical data that you think you could reuse or that you want to look at as a referee. The data is easily available, and it is in a format that you can read. The data, however, is confusing. The fields on the tables have cryptic names such as “rgt5” and “avgB” that are not defined anywhere, leaving you to guess their meaning. Units of the measures are missing. Some registries are marked as “invalid” without any explanation of the reason and without making clear whether those registries were used or not on calculations. Derived data is calculated from a formula, but the implementation in the spreadsheet is slightly but significantly different than the formula in the article. If you re-run the code, the results are thus a bit different from those stated in the article. At some point, you try to contact the authors, but the contact data is outdated, or it is unclear who of several authors can help with the data (you can picture such a scene in this animated short video). Note that in this scenario we describe, the research data might have been perfectly Findable, Accessible and technically good and Interoperable, but without attention to those Reusable requirements, the whole purpose of FAIR data is defeated.

In computer-code data, documentation and good community development practices are non-trivial issues the industry has been addressing for a long time. Communities of programmers concerned by these problems have developed tools and protocols that solve, mitigate, or help manage these issues. Ideally, scientists working on scientific computing should learn and follow those good practices for code management. For instance, package managers for standard libraries, version control systems, continuous integration schemes, automated testing, etc., are standard techniques in the computer industry. While not using any of these techniques and just releasing source code in zip files might not break F-A-I principles, it will make reuse and community development much more difficult.

Documenting algorithms is especially important. Algorithms frequently use tricks, constants that get hard-coded, code patterns that come from standard recipes, parts that handle exceptional cases… Most often, even a very well-commented code is not enough to understand the algorithm, and a scientific paper is published to explain how the algorithm works. The risk is having a mismatch between the article that explains the algorithm, and the released production-ready code that implements it. If the code implements something similar but not exactly what is described in the article, there is a gap where mistakes can enter. Having a close integration between the paper and the code is crucial to prevent the newcomer from having to rework how the described algorithm translates into code.

#### Verifiability

As we introduced above, independent verification is a pillar of scientific research, and verification cannot happen without reusability of all necessary research data. MaRDI puts a special effort into enabling verification of data-driven mathematical results, by building FAIR tools and exchange platforms for the fields of computer algebra, numerical analysis, and statistics and machine learning.

An interesting example arises in computer algebra research. In that field, output results are often as valuable by themselves as the program that produced them. For instance, classifications and lists are valuable by themselves (see for example the LMFDB or MathDB sites for some classification projects). Once that list is found, it can be stored and reused for other purposes without any need to revisit the algorithm that produced such a list. Hence, the focus is normally on reusability of the output, but forgetting the reusability of the sources. This neglects to describe the provenance of the data, how it was created, which techniques were used to find it. This entails serious risks. Firstly, it is essential to verify that the list is correct (since a lot of work will be carried out assuming it is). Secondly, it is often the case that later research needs a slight variation of the list offered in the first place, so researchers need to modify parameters or characteristics of the algorithm to create a modified list.

In the case of numerical analysis, the output algorithms are usually focused on user reusability, often in the form of computing packages or libraries. However, several different algorithms may compete for accuracy, speed, hardware requirements, etc., so the “verification” process gets replaced by a series of benchmarks that can rate an algorithm in different categories and verify its performance. We have described, in the previous newsletter, how MaRDI would like to make numerical algorithms easier to reuse and benchmark them in different environments.

As for statistical data, our Interoperability issue of the newsletter describes how MaRDI curates datasets with “ground truths,” known facts that we know for sure independently from the data, that allow for the validation of new statistical tools to be applied to the data. In this case, re-using these new statistical tools on new studies increases the corpus of cases where the tool has been successfully used, making each reuse a part of the validation process.

#### Licenses

We also discussed licenses in our Accessibility issue. Let’s recall that FAIR principles do not prescribe free / open licenses, although those licenses are the best way to allow unrestricted reusability. However, FAIR principles do require a clear statement of the license that applies, be it restrictive or permissive.

Even within free/open licenses, the choice is wide and tricky. In software, *open source* licenses (e.g. MIT, Apache licenses) refer to the fact that the source code must be provided to the user. Those are amongst the most permissive because with the code one can study, run, or modify it. In contrast, *free* software licenses (e.g. GPL) carry some restrictions and an ethical/ideological load. For instance, many free licenses include *copyleft*, which means that any derived work must keep the same license, effectively preventing a company to bundle this software in a proprietary package that is not free software.

In creative works (texts, images…), the Creative Commons licenses are the standard legal tool to explicitly allow redistribution of works. There are several variants, ranging from almost no restrictions (CC0 / Public domain), to including clauses for attribution (CC-BY, attribution), sharing with the same license (CC-SA, share alike), or restricting commercial use (CC-NC, non-commercial) or derivative works (CC-ND, non-derivative), and any compatible combination. For databases, the Open Database License (ODbL) is a widely used open license, along with CC.

The following diagram shows how you can determine which CC license would be appropriate for you to use:

Attention must be paid that CC-ND is not an open license, and CC-NC is subject to interpretation of the term “non-commercial,” which can pose problems. While CC licenses have been defended in court in many jurisdictions, there are always legal details that can pose issues. For instance, the CC0 license intends to waive all rights over a work, but in some jurisdictions, there are rights (such as authorship recognition) that cannot be refrained. Other details concern the license versions. The latest CC version is 4.0, and it intends to be valid internationally without need to “port” or adapt to each jurisdiction, but each CC version has its own legal text and thus provides slightly different legal protection. Please note that this survey article does not provide legal advice, you can find all the legal text and human-readable text on the CC website.

In general, the best policy for open science is to use the least restrictive license that suits your needs and, with very few exceptions, not to add or remove clauses to modify a license. Reusing and combining content implies that newly generated content needs a license compatible with those of the parts that were used. This can become complicated or impossible the more restrictions they have (for instance, with interpretations of commercial interest or copyleft demands). Also, licenses and user agreements can conflict with other policies, such as data privacy; see an example in the Data Date interview in this newsletter.

#### Community

Perhaps the most synthetic form of the Reusability principle would be “do as the community does or needs” since it is a goal-focused principle: if the community is re-using and exchanging data successfully, keep those policies; if the community struggles with a certain point, act so that reuse can happen.

MaRDI takes a practical approach to this, studying the interaction between and within the mathematics community and other research communities and the industry. We described this “collaboration with other disciplines” in the last newsletter, and we highlighted the concept of “workflow” as the object of study, that is, the theoretical frameworks, the experimental procedures, the software tools, the mathematical techniques, etc. used by a particular research community. By studying the workflows in concrete focus communities, we expect to significantly increase and improve their reuse of mathematical tools, while also setting methods that will apply to other research communities as well.

MaRDI’s most visible output will be the MaRDI Portal, which will give access to a myriad of ‘FAIR’ resources via federated repositories, organized cohesively in Knowledge Graphs. MaRDI services will not only facilitate reusability of research data to mathematicians and researchers in other fields alike but also be a vivid example of best-practices research life. This portal will be a gigantic endeavor to organize FAIR research data, a giant on whose shoulders tomorrow’s scientists can stand. We strive for MaRDI to establish a new data culture in the mathematical research community and in all disciplines it relates to.

### In Conversation with Elisabeth Bergherr

In this episode of Data Dates, Elisabeth and Christiane talk about reusability and the use of licenses in interdisciplinary statistical research, students' thesis, and teaching.

### Love Data Week

Love Data Week is an international week of actions to raise awareness for research data and research data management. As part of this initiative, MaRDI created an interactive website that allows you to play around with various mathematical objects and learn interesting facts about their file formats.

### Research data in discrete math

Mid March, the MaRDI outreach task area hosted the first research-data workshop for rather theoretical mathematicians in discrete math, geometry, combinatorics, computational algebra, and algebraic geometry. These communities are not covered by MaRDI's topic-specific task areas but form an important part of the German mathematical landscape, in particular with the initiative for a DFG priority program whose applicants co-organized the event. A big crowd of over sixty participants spent two days in Leipzig discussing automated recognition of Ramanujan identities with Peter Paule, machine-learned Hodge numbers with Yang He, and Gröbner bases for locating photographs of dragons with Kathlén Kohn. Michael Joswig led a panel in focusing on the future of computers in discrete mathematics research and the importance of human intuition. Antony Della Vecchia presented file formats for mathematical databases, and Tobias Boege encouraged the audience to reproduce published results in a hands-on session with participants finding pitfalls even in the most simple exercise. In the final hour, young researchers took the stage to present their areas of expertise, the research data they handle, and their take-away messages from this workshop: to follow your interests, keep communicating with your peers and scientists from other disciplines, and make sure your research outputs are FAIR for yourself and others. This program made for a very lively atmosphere in the lecture hall and was complemented by involving discussions on mathematicians as pattern-recognition machines, how mathematics might be a bit late to the party in terms of software, whether humans will be obsolete soon, and the hierarchy of difficulty in mathematical problems.

### Conference on Research Data Infrastructure

The Conference will take place September 12th – 14th, 2023, in Karlsruhe (Germany). There will be disciplinary tracks and cross-disciplinary tracks.

Abstract submissions deadline: April 21, 2023

**More information:**

- in English

### IceCube - Neutrinos in Deep Ice

This code competition aims to identify which direction neutrinos detected by the IceCube neutrino observatory came from. PUNCH4NFDI is focused on particle, astro-, astroparticle, hadron, and nuclear physics, and is supporting this ML challenge.

Deadline: April 23, 2023

**More Information:**

- in English

### Open Science Radio

Get an overview of all NFDI consortia funded to date, and gain an insight into the development of the NFDI, its organizational structure, and goals in the 2-hour Open Science Radio episode interviewing Prof. Dr. York Sure-Vetter, the current director of the NFDI.

**Listen:**

- in English

The DMV, in cooperation with the KIT library, maintains a free self-study course on good scientific practice in mathematics, including notes on the FAIR principles. (Register here to subscribe to the free course.)

Edmund Weitz of the University of Hamburg recorded an entertaining chat about mathematics with ChatGPT (in German).

Remember our interview about accessibility with Johan Commelin in the second MaRDI Newsletter? The Xena Project is "an attempt to show young mathematicians that essentially all of the questions which show up in their undergraduate courses in pure mathematics can be turned into levels of a computer game called Lean". It has published a blog post highlighting very advanced maths, which can now be understood using the interactive theorem prover Lean Johan told us about.

On March 14, the International Day of Mathematics was celebrated worldwide. You can relive the celebration through the live blog, which also includes two video sessions with short talks for a general audience—one with guest mathematicians and one with the 2022 Fields Medal laureates. This year, the community was asked to create Comics. Explore the featured gallery and a map with all of the mathematical comic submissions worldwide.

**3rd issue - Interoperability**

Welcome to the third issue of the MaRDI Newsletter on mathematical research data, and happy holidays! We give you a brief snapshot of the world of interoperability. This is the third and may be one of the most challenging of the FAIR principles; very topic dependent, and much more technical than, say, findability. Its key question is: how do you seamlessly hand a digital object from one researcher to another?

licensed under CC BY-NC-SA 4.0.

We discuss the meaning and implications of interoperability in a number of mathematical disciplines, interview an expert on scientific software, report on workshops that have happened in the mathematical research-data universe, and much more.

We encounter different systems almost everywhere in our lives*, *both professionally and in everyday situations. Not all of them seem to be interoperable. For example, a navigation app will not be able to interpret equations, and it might not be trivial to ask Mathematica to compile your Julia computations. Think of any two systems—what would a marriage of the two look like? (We understand marriage here to be establishing the base for communication and exchange.)

If you could choose two systems you would like to get married, which ones would you choose?

Did you choose a perfect match in the survey above? You can add more anytime...

### Interoperability: Let's play together

In our previous newsletters, we have covered, the *Findability* and* Accessibility *principles in FAIR research data. Those are the basic principles that give researchers awareness and access to the existence of research data. In contrast, the remaining two principles, *Interoperability* and *Reusability*, are related to what can be done with that data or rather to the quality of it. They have more profound implications for the interactions of the research community as a whole.

Research is almost never conducted in isolation. Researchers build on top of other researchers’ findings, combine different sources with their own insights, and use plenty of tools and methods developed by others. Here we will focus on some technical (and less technical) requirements to make this research community possible: Interoperability.

Interoperability is the capacity to combine pieces from different sources to work together. Standards in science and industry, such as measuring units or the shape of plug connectors, are designed for interoperability. In research, a simple example is language. Most scientific research is nowadays written and published in English. While there may be valid reasons to use other languages (in specific disciplines, in outreach, to foster exchanges in a particular cultural group…), the reality is that using a single *lingua franca* for scientific research enables comprehension and use of any scientific publication to all researchers. This creates a necessity for researchers to learn and use the English language as part of their research (and life) skills. When it comes to computers, plenty of standards respond to the need for interoperable data, such as file formats or computer languages (pdf, LaTeX, …), some having more success than others.

For research data, interoperability is crucial to enable a research community to collaborate and interact. Interoperability means using a standard set of vocabulary and data models that give a good and agreed representation of the type of research data in question. This effectively sets a standard for data communication. Then each researcher can adapt their tools and methods to process data within those standards.

To be precise, FAIR principles provide a framework for interoperable research data:

- Data and metadata must use a knowledge representation (ontologies, data models) that is shared, broadly applicable, and accessible.
- Such knowledge representation must be itself FAIR.
- When data and metadata reference other data and metadata, their relationship must be qualified (e.g. data X uses algorithm Y in such a way, data Z is derived from dataset W by applying such filtering)

In information science, an ontology is the set of all relevant concepts and relationships for a particular domain. This can be an enumeration or represented by a knowledge graph where nodes are concepts (think of nouns), and edges are qualifiers (think of verbs). This theoretical reflection of the nature of your research data is fundamental to developing useful standards that enable practical interoperability.

The MaRDI project actually devotes a significant part of its efforts to improving the interoperability (and reusability) of research data. Here we provide a brief summary of these interoperability efforts.

#### Computer Algebra

Computer Algebra concerns calculations on abstract mathematical objects, such as groups, rings, polynomials, manifolds, polytopes, etc. Computations are generally exact (no numerical approximations). Typical use cases of computer algebra are enumeration problems, for instance, finding a list of all graphs with certain properties. For such abstract objects, just data representation is already non-trivial, therefore researchers often build on top of specific frameworks called Computer Algebra Systems (CAS) that implement these data types and methods. Such CASes can be of broad scopes, like Mathematica, Maple, Magma, SageMath, OSCAR, etc, or they can have a focus on a specific domain, like GAP (group theory), Singular (algebraic geometry), Polymake (polytopes and other combinatorial objects), etc. A desirable goal would be to have a common data format to allow interoperability between different software systems without the loss of CAS information, enabling the parsing of files and call of functions from one system to another. This is obviously not an easy task. On the one hand, some of those CASes (e.g. Mathematica, Maple…) are proprietary, their focus is not purely on math research but they also provide tools used in other fields such as engineering or education. Interoperative approaches that use anything other than their provided APIs will therefore likely fail. On the other hand, the specific purpose CASes such as GAP, Singular, or Polymake (incidentally, all three have originated and are maintained by German universities and researchers close to MaRDI) are open-source, and can be used stand-alone but are also integrated into broader CASes such as SageMath (Python based) or OSCAR (Julia based). Turning these specific systems into broad-purpose CASes while also retaining state-of-the-art algorithms from the latest research is already a great success story.

The goals for MaRDI in Computer Algebra are to document and establish workflows, data formats, and guidelines on how to set up databases. By ‘workflows’ we understand this to be the process of generating/retrieving data, setting up an experiment, and obtaining conclusions, which will imply documenting the exact versions of the software (and possibly hardware) used as well as the tech stack (from the operating system to the languages and interpreters and libraries used). This will have benefits such as enabling verification of the results and making further reuse easier. It also provides clear guidelines on which software can be used together, replaced, or mixed, and therefore evaluating its interoperability.

Documenting and establishing data formats means going a step further in the interoperability, not only describing which software or data format the current work adheres to but actually making a system-agnostic description of the data. For instance, if we are using a particular ring of polynomials in several variables with coefficients in a particular field, the data description should make clear how we store and operate the elements of such a ring. Typically, this will follow a data format from a particular CAS, but having an independent description will enable other CASes to implement a compatibility layer to reuse the data. This will become even more relevant when implementing new abstract structures. Eventually, the goal is for all CASes that wish to support a particular data format, to be possible to implement a compatibility layer based on the data description. This is called data serialization, as the goal is to translate internal data structures into a text description, which can be exchanged to another system to be de-serialized, that is, turned into the data structure of the new system with the same semantic information but possibly a different implementation. The MaRDI team is implementing this data serialization in OSCAR, but the goal is to have a system-agnostic specification.

Finally, documenting computer algebra databases will, among other benefits in findability, enable a comprehensive picture of the different systems and the compatibility layers needed to have interoperability amongst them.

**Scientific computing**

Numerical algorithms are central to scientific computing. Their approximations to exact mathematical quantities come with inherent inexactness and error propagation, due to finite precision in the used data structures. This contrasts with the abstract and exact objects used in Computer Algebra. Typical examples are linear solvers (Ax=b) for different types of matrices (big, small, huge, sparse, dense, stochastic…), or numerical integration methods for ODEs or PDEs. Numerical algorithms are closely associated with applied mathematics, and performance or scalability are relevant factors for choosing one method over another. We already described in the Findability article that MaRDI is building a knowledge graph for those numerical algorithms, together with benchmarks, supporting articles for theoretical background, and other features. But the goal goes beyond creating such a graph just to find algorithms, it is also an ambitious goal to develop an infrastructure to make all these algorithms interoperable.

Researchers implement their algorithms in programming languages such as MATLAB (which is proprietary), or C/C++, Julia, Python, etc, possibly with extension libraries. To implement interoperability between different numerical methods, MaRDI proposes a three-component architecture (driver - connector - implementor). For a particular algorithm, the implementor is the piece of software that contains the actual existing algorithm in whatever language or framework that the author used. The driver is a high-level calling function that contains the semantics of the data, but not the implementation of the algorithm. The same data model can then be used by drivers of different numerical algorithms, even if their implementation uses completely different technologies, thus enabling an interoperable ecosystem. The prototypes of those drivers are being proposed and defined by the MaRDI team. The missing critical piece is the connector, which communicates between the driver and the implementor, which needs to be developed for each algorithm, likely in collaboration with the original author. The MaRDI team is implementing some examples, but the goal is that in the future, any researcher who is developing numerical algorithms can use their preferred technology stack and then easily implement a connector to standard driver functions.

The benchmark comparison between algorithms (planned for the knowledge graph) actually requires this interoperability architecture so that the same test can be executed by different algorithms in equal conditions without a need to adapt the data to fit a particular tech framework.

**Statistics and Machine Learning**

Typical research data usage in statistics or machine learning include big experimental datasets, frequently coming from other domains. Good examples of this are genetic data or financial data. These datasets contain valuable information that researchers try to extract using statistics or AI techniques. In statistics, for instance, a typical goal is to create a model, meaning to describe a joint probability distribution of all the variables depending on the individual probability distributions of each variable. This means understanding the dependencies between the variables.

A problem often found by statisticians who develop new theoretical methods to extract information from experimental data is that there is only a very limited collection of suitable datasets where they can test new methods. It is difficult to obtain curated data from interdisciplinary teams before the statistical tools are proven useful and robust, which leaves researchers with limited choices to run tests. The most valuable information in curated data includes “ground truths”, that is, relationships between variables that are known externally to the experimental data, via expert knowledge from another field. For instance, in a macroeconomic study, some variables can be related or independent, or their relationship may depend on the presence of a third variable indicator, or even more complex interactions. We may know some of these interactions by knowing government policies or strategies which are not reflected directly in the data. For the statistician, such a "ground truth" is very useful to validate the algorithm used to fit the model. A goal for MaRDI is to collect a broader, curated list of datasets that can be used by statisticians to test and validate modeling techniques. Those datasets need to be cleaned and ready to be used by standard statistical packages (that is, to be interoperable), and to have useful annotated “ground truths” attached to the data for use on interdisciplinary teams. Besides this data collection, MaRDI aims to be a leading example of quality curated data so that experimentalists can adhere to those quality standards.

Another goal concerns machine learning (ML) algorithms. The community around ML is much broader than mathematicians (software developers, data scientists, ML engineers…), and therefore the frameworks used are very diverse. TensorFlow and Torch are two popular tools in the industry, but there are many others. The language R is suitable for statistics and data science, and also for machine learning. An initiative to bring cohesion and interoperability in this software ecosystem is mlr3 (machine learning for the R language), which MaRDI is using and extending. The mlr3 project brings different R packages together (often based on or operating on other frameworks), providing unified naming conventions, and a full suite of tools (learners, benchmarks, analyzers, importers/exporters, …), making R and mlr3 a competitive integrated framework for ML.

We can see a couple of examples of how MaRDI is bridging interoperability gaps in this field. A first example: in machine learning (as in the statistics case we saw earlier), there is a great need for more quality datasets (training, evaluation…). OpenML is a web service that allows sharing of datasets and ML tasks within the ML community. MaRDI is helping to build mlr3oml, an interoperability interface between mlr3 and OpenML. MaRDI also builds and stores “curated quality datasets” in OpenML that can be used for testing and benchmarking, and also as a model of good practices.

A second example: Many learning algorithms in ML are treated as black boxes, they come from different ML techniques and have different implementations. However, a significant part of these algorithms come from some neural network techniques that have some common characteristics: architecture, loss function, optimizer… The package mlr3torch, being developed with MaRDI, aims to “open” some of those black boxes giving greater control of those parameters.

**Cooperation with other disciplines**

MaRDI strives to bring together mathematical methods and the people who use them. Today this collaboration requires much more than having a common spoken language and publishing in international journals, nowadays data languages are crucial. MaRDI aims to understand and document how researchers in disciplines other than mathematics use (or would like to use) mathematical research data. Hence, the “interoperability” between mathematics and other fields is key. For the past year, MaRDI has collected a series of case studies from other NFDI (the German National Research Data Infrastructure program) consortia, other research groups, and also in the industry, to document through a series of templates how they work and use research data. The key concept is the “workflow”, meaning the documentation of the whole process of setting a theoretical framework, hypothesis to scrutinize, experiment model, data acquisition, technical equipment, metadata association, data processing, software used, data analysis techniques, extraction of results, publications… everything that is directly related to data management, but also its research context. Several examples of workflows can be found on the MaRDI portal TA4 page. Currently, the collected information is textual, highlighting the data acquisition process (and its metadata), and the mathematical model used. In the future, both the (meta)data and the model will be formalized by means of ontologies and model pathway diagrams (graphs) to enable further uses of the research data, such as reproducibility, replacing methods and techniques by newer or more performant ones, or enabling reusability by other researchers.

By looking at the case studies, one can observe that most researchers implement “island solutions” adapted to their specific needs, even if those solutions may be very professional and optimized. There is a great potential to increase interoperability and exchange. MaRDI aims to leverage a change in mathematical data management and analysis to support researchers, in the belief that such a shift will be broadly welcome within the research community.

**MaRDI portal**

The MaRDI portal will be the single entry point to all the MaRDI services and resources collected by the different task areas. The portal team is currently building a knowledge graph of mathematical research data by retrieving information from other sources (for instance, WikiData, swMATH for documenting mathematical software, package repositories to improve the information granularity of some mathematical software, zbMATH Open to retrieve publications, etc.). This requires a lot of interoperability efforts using the respective APIs since the volume of data is not manageable by hand. Some automation and AI techniques are being considered to foster this process. In due time, all the different MaRDI teams will start producing their output goals, and the portal team will manage the integration within the portal. For instance, the knowledge graph of numerical algorithms will be integrated into the knowledge graph of the MaRDI portal. The statistical datasets collections will also be described as entities in the MaRDI knowledge graph, and so on. In a sense, the portal needs to create interoperability layers between the internal task areas of MaRDI.

All in all, the interoperability principle is an enabling condition for building and strengthening a community. That is the driving goal of all the efforts from MaRDI that we described here. This enabling condition turns into an actual collaboration when the data is reused across different projects and researchers, which will be the topic of our fourth article in this series, about Reusability.

### In Conversation with Ulrike Meier Yang

In the third episode of the interview series Data Date, Ulrike and Christiane talk about mathematical research data in the xSDK project, the importance of guidelines, three levels of interoperability, and automated testing.

### MaRDI annual workshop 2022

Mid November, the whole MaRDI team met at WIAS in Berlin for their second annual workshop. The kickoff in Leipzig one year before had provided an enthusiastic start for the consortium and for building infrastructure for mathematical research data in Germany. The slogan at the time was to spend the coming twelve months doing two things: listening (zuhören) and simply getting started (einfach anfangen)!

Now the team looked back, recapped, and planned for the second year and further into the future. Over the course of three days, approximately forty people met in person including some participating online to first present each task area's updates and vision, discuss current issues in interactive small-group BarCamps, and finally decide on the upcoming route. The event was kicked off with a keynote talk by Martin Grötschel, who stressed the importance to follow a bottom-up process and potential projects' pitfalls drawn from his learned experience. This was followed by NFDI's Cord Wiljes describing potential benefits of cross-consortial collaborations. There was plenty of lively discussion centered around possible career paths of women in maths and data and how MaRDI could live up to the central expectations of the Portal, link knowledge graphs, best deal with the very diverse mathematical research data in management plans, and build a community. BarCamps developed ideas and new work packages, like the setting up of an editorial team for the Portal. All throughout many participants compiled self-designed sheets of bingo to collect #MaRDI_buzzwords. The long and pleasant days were accompanied by a visit to the computer-games museum and a conference dinner. At the end of the workshop, the MaRDI team drew the conclusion to best spend the coming year building on the previous "listening and getting started" and now focusing on two different tasks: networking with the community (vernetzen) and cross collaboration (zusammenarbeiten) within the consortium. This will link MaRDI's expertise across different institutions and will ensure that resulting services reach and engage with potential users early on, making them truly useful for the working mathematician.

### MaRDI Movies

The first in this series of short, entertaining, and informative videos is called 'Mardy, the happy math rabbit'. Follow Mardy through the pitfalls of reproducing software results: An introduction to software review in mathematics by Jeroen Hanselmann.

### MOM workshop on MaRDI, OSCAR, and MATHREPO

In November, MaRDI's task area for Computer Algebra invited their community to ZIB and TU Berlin for the "MOM workshop on MaRDI, OSCAR and MATHREPO". Over the course of two days, some twenty people met in person to discuss how to deal with databases, polytopes, triangulations, graded rings, polynomials, gröbner bases, finite point configurations and the like. Particularly important were questions on how to save an object, where to store it long-term, how to seamlessly interact with databases, and how to reproduce a computation.

The MaRDI organisers presented serialisation and workflow efforts and led an exercise in reproducibility where the participants were asked to rerun published research outputs. Some could be redone quite well, others were not so easy to reproduce. A number of examples came from the mathematical research-data repository MathRepo, co-maintained by MaRDI's Tabea Bacher. The awarding of the FAIRest MathRepo page of 2022 was part of the workshop. A jury of interested workshop participants took a closer look at the contributions previously nominated by the audience and judged them according to the FAIR principles. The highly deserved winner was Tobias Boege from Aalto University for his entry on Selfadhesivity in Gaussian conditional independence structures In addition to very good documentation, by compressing files and using the MPDL Repository keeper as longterm storage solution, he found a way to make huge amounts of his research data FAIRly available, which was an unusually difficult problem.

Alheydis Geiger from the Max Planck Institute for Mathematics in the Sciences, Leipzig, presented a user story of OSCAR. In her paper she and her collaborators combined different computer algebra systems, such as OSCAR, Macaulay 2, Magma, Julia, Polymake, Singular and more, to investigate self-dual matroids from canonical curves. The Graded Ring Database was introduced in a talk by Alexander M. Kasprzyk from the the University of Nottingham. Focusing on the mathematical meaning of the research data in the data base as well as technical and accessibility matters.

In a final session, researchers split up into two smaller groups to discuss. The first group collected both computer algebra and general software systems used by the participants and discussed which system was best suited for what research questions. In the other group technical peer reviewing was discussed: how it can be done and why it would be necessary (for more on technical peer reviewing watch the MaRDI Movie Mardy, the happy math rabbit).

### MaRDI Workshop on scientific computing—A platform to discuss the “HOW”

From October 26 to 28, 2022, the first MaRDI Workshop on Scientific Computing took place at WWU, Münster. About 40 people from the scientific computing community and from MaRDI came together to learn and talk about research data in three densely packed days of exchange.

The introductory talk by Thomas Koprucki on MaRDI was followed by blocks of talks on topics such as: workflows and reproducibility, ontologies and knowledge graphs or benchmarks. Ten invited speakers presented their projects: for example, Ulrike Meier Yang (see video interview above) introduced the extreme-scale scientific software development kit xSDK, Benjamin Uekermann presented preCICE, a general-purpose simulation coupling interface, Andrea Walther talked about 40 years of developing ADOL-C, which is a package for automatic differentiation of algorithms and FitBenchmarking and an open source tool for comparing data analysis software was presented by Tyrone Rees.

As one of the main goals of the organizers was to bring together researchers from the scientific computing community and related disciplines to learn from different projects and related expertise, speakers were encouraged to present work in progress, open problems or report on personal experiences; not only to talk about the "WHAT" but also to share the "HOW". It can be said that this concept worked out. This was noticeable in both the coffee breaks, which were characterized by lively conversations and in the afternoon of October, 27th that was devoted entirely to discussions. There were several discussion groups focused on a variety of topics, such as workflows and reproducibility, knowledge graphs, research software, benchmarks, training and awareness, ... The training and awareness group discussed how to deal with software that is not associated with a paper—there are some journals that might publish on such topics, but it is difficult to get the recognition deserved- and which career level is best approached for research data management topics. After the discussion in groups, the results were presented to everyone. One of the ideas, that was discussed a lot when the groups reconvened, was the possibility of providing better job security for software engineers by making them permanent employees of universities and having the projects they work on pay the university for their services.

Mario Ohlberger, co-spokesperson at MaRDI and co-organizer of the workshop, said there was great feedback for the event. The workshop created a new platform for exchange and generated many new impulses for MaRDI. Many participants had never been to such a workshop before, they were happy to find others that are passionate about the same topics and are willing to exchange ideas.

### Digital Humanities meet Mathematics (DiHMa.Lab)

The first session of DiHMa.Lab took place in September with a workshop organized jointly by the Ada Lovelace Center for Digital Humanities and MaRDI’s interdisciplinary task area, TA4. Over a course of two days, about thirty people from archeology, philology, literary sciences, history, cultural studies, research-data management and of course mathematics came together in this hybrid event to identify and discuss various interconnections, exchange experiences and come up with ideas on how to improve the cooperation and understanding of each other's research. The main focus of the workshop was to engage with both NFDI consortia—NFDI4Memory, NFDI4Objects, Text+, NFDI4Culture, KonsortSWD, MaRDI—and institutes involved in social sciences and humanities research and to familiarize everyone with the methods, problems, questions, and research data of the represented fields.

To that end researchers presented examples of (mathematical) research data and their handling in various projects from digital humanities. For instance, Nataša Djurdjevac Conrad (ZIB) talked about a project where the spreading of wool-bearing sheep in ancient times was analyzed by using agent-based models. Christoph von Tycowicz (ZIB) presented instances of geometric morphometrics used to determine installation sites of ancient sundials or changing facial expressions during the aging process. Tom Hanika (Uni Kasel) and Robert Jäschke (IBI - HU Berlin) spoke about formal concept analysis and order theory and how it can be applied and yield interesting results when analyzing literary works or art.

What these projects have in common is that they avoid black box situations, where a method is applied without really knowing how it works and therefore making it a matter of chance to interpret the results in a fitting manner. In order to obtain reliable results it is necessary for mathematicians to understand the complex questions and data arising in digital humanities and researchers from digital humanities to be careful in applying mathematical methods and understand them first as to be able to choose “the right“ method and to correctly interpret the results. Achieving that enables successful collaborations and contributes to entirely new mathematical questions. This then opens up rich sources for novel questions in digital humanities.

All in all, it was a very successful workshop, resulting in the idea of DiHMa.Lab establishing a „marketplace for methods“ where digital humanities questions could be posted and liked by mathematicians – preferably proposing also a method. Moreover, the participants were very open, accommodating, and interested in the topics and concerns from the different fields, eager to learn new methods, to see what is possible if „we“ join forces, and what new questions arise.

### New consortia and an initiative for basic services

On November 4, the Joint Science Conference (GWK) decided to fund seven additional consortia as well as an initiative for the realization of cross-consortia basic services Base4NFDI within the framework of the National Research Data Infrastructure (NFDI). As in the two previous years, the decision by the GWK follows the recommendations of the NFDI expert panel appointed by the German Research Foundation (DFG).

**More information:**

- in German

### International Love Data Week 2023

Love Data Week is an international celebration of data, hosted by the Inter-university Consortium for Political and Social Research (ICPSR), that takes place every year during the week of Valentine's day (in 2023: February 13 - 17). Universities, nonprofit organizations, government agencies, corporations, and individuals around the world are encouraged to host and participate in data-related events and activities held either online or in-person locally. The theme this year is Data: Agent of Change.

**More information:**

- in English

In October, The Netherlands hosted the "1st international conference on FAIR digital objects" with over 150 professionals signing the Leiden Declaration on FAIR Digital Objects. This is deemed to be "an opportunity for all of us working in research, technology, policy and beyond to support an unprecedented effort to further develop FAIR digital objects, open standards and protocols, and increased reliability and trustworthiness of data".

A group of MaRDI team members together with external experts have written a new article highlighting the status quo, the needs and challenges of research-data management plans for mathematics: a preprint is already available here.

The ICPSR published a guide to data preparation and archiving in 2020. Even though addressed to social scientists, the presented guidelines can be applied to any field.

The "Making MaRDI" Twitter series we announced in the previous Newsletter has been launched and integrated into the website. There are currently four profiles presenting the work that Karsten Tabelow, Tabea Bacher, Christian Himpe, and Ilka Agricola carry out in the consortium.

**2nd issue - Accessibility**

Welcome to the second issue of the MaRDI Newsletter. In each newsletter, we talk about various research-data themes that might be of interest to the mathematical community, in particular finding data that is relevant to advance your research, ensuring other people can access your files, solving the difficult problem of managing files between coauthors, and preserving your results such that your peers can build their research on those.

The FAIR principles for sustainable research-data management are important to us, so we present them individually in a series of articles. This issue of the Newsletter is dedicated to the A in FAIR: accessibility and what this means for mathematics.

licensed under CC BY-NC-SA 4.0.

In each newsletter, we also publish an episode of our interview series "Data Dates", tell you about an event that happened in the MaRDI universe, and offer some reading recommendations on FAIR topics.

In our last newsletter issue, we asked you to enter 3 methods you commonly use to search for/find mathematical research data. Here are the results to that survey:

Share your accessibility nightmare (or a success story)!

We will feature a selection of your stories in an upcoming newsletter (anonymously).

### FAIR access to research data

Access to research information is the most fundamental principle for spreading science across the scientific community and society. Publishing and making research results available is a cornerstone of research. This, however, is not exempt from issues. On the one hand, some research is either private, restricted within the industry, or protected by intellectual property. On the other hand, other barriers exist while accessing data in the form of technical incompatibilities, paywalls, bad metadata, or just incomplete data.

The Accessibility principle of FAIR data is the idea that all the relevant data connected to a research result should be properly available. This concerns which data is available, to whom it is accessible, how it is technically stored and retrieved, and how it is classified and managed. This principle is rooted in the scientific fundament of reproducibility and verifiability: other researchers should be able to repeat and independently verify the published results. While this is especially important in the experimental sciences, it also applies to the domain of mathematics.

The FAIR principles state that research data is Accessible when it respects the following recommendations:

- The data is accessible over the internet, possibly after authentication and authorization. The means of access (protocols) must be open, free, and universal, and those protocols must include authentication and authorization whenever necessary.
- The metadata must be available together with the data, and it must persist even after the data is no longer available.

It is important to note the "possibly after authentication and authorization" sentence. It is a common misunderstanding that FAIR accessibility implies free of cost or under open licenses. That is not the case. Free-of-cost publication and open licenses fall into the domain of the Open Access principles. While FAIR and Open Access have points in common, we will see examples where non-open access databases can be FAIR; or open access articles and research data which are not FAIR because metadata or appropriate protocols are missing.

Standards and protocols are a fundamental element in FAIR accessible data. Many tasks, especially those that are repeated in the same way, are performed much more efficiently by machines than by humans. That is why computers are very important when dealing with research data, too. In terms of accessibility, any storage location would ideally provide interfaces where machines can automatically access research data, also referred to as Application Programming Interfaces or APIs.

**The research data behind the articles**

Let us see three stories of fictional mathematicians that use some research data as a fundamental part of their research. They handle different types of data (databases, classifications, source code, articles...), which can also have different origins (produced by themselves or from a third party). They face different challenges to keep their research data FAIR.

Alice is a mathematician working in computational algebra. She makes intensive use of software, but in her published articles, she often uses sentences such as "using software XX, we can see that...". Her scripts in the form of source code, software packages, toolchains, and her computed results are research data that, if omitted from the published results, are not FAIR data, making her results difficult to be validated or replicated. She is aware of that problem, and she wants to solve that, so she decided to set up a server in her math department with some files with her source code, and she mentions that those files exist on her personal website; maybe she even puts the URL to the code in her articles. However, she has changed from university several times, thus changing her servers and websites, and many files and projects related to older articles are now lost. In order to be fully FAIR compliant, she needs to ensure that the data is bound to a metadata reference and to the research article, that it is accessible through standard internet protocols, and plan for a long-term archive that does not disappear when she changes her job position. Ideally, she would assign a DOI to the source code and host it in some long-time archiving (e.g., Zenodo, GitHub, MathRepo, or others ). Furthermore, she needs to make the code Interoperable and Reusable, which we will discuss in forthcoming issues. The MaRDI project aims to help mathematicians in this situation improve their FAIR data management.

Alice also participates in a collaborative project to classify all instances of her favorite algebraic objects. She and other colleagues have set up an online catalog listing all the known examples, the invariants they use to classify, and bibliographical information. At the moment, this catalog contains a few hundred items; Alice and her team will need to provide download options, filters, and means to retrieve information from the database beyond the graphical web interface. They will need to provide the results in formats that can be further processed with standard tools. That is, they will need an API and standardized formats to allow other researchers to use that database effectively in their own research projects.

Bob and Charlie are mathematicians modelling biological processes. Bob models tumor growth in human cancer, and Charlie neurological activity in animals. They handle three types of data: experimental specimen data in the form of databases that they receive from a partner or third party, model data in the form of source code that they develop, and result data in the form of articles they publish.

For Bob, primary data comes mainly from patients in hospitals. For obvious privacy reasons, Bob cannot directly access that primary data. Instead, he relies on organizations that offer anonymized databases publicly available for research (for example, the National Cancer Institute). Parts of these databases are totally anonymous and can be given open access. Other records contain detailed genetic information that, by their nature, could be used to identify the patient. Those databases have authenticated access, and researchers only can access them after being identified and committing to respect standard good practices in handling medical data. Thus, even if the access is restricted to identified and authenticated people, the data can be FAIR.

For Charlie, keeping his research data FAIR is tricky. He partners with some laboratories that have the appropriate resources to collect data from animals. Since obtaining this experimental data is expensive, the laboratory keeps some rights of use, and Charlie has to sign a "Data Use Agreement" contract. This allows him to use the data only for the declared purpose, and he is unable to redistribute it. In this case, the data would not be FAIR. However, the laboratory agrees to release the data for public use after two or three articles have been published from that source, as they consider that the data has already yielded enough results. From that moment, the data could be considered FAIR. Some websites collect already released databases (e.g., International Brain Lab) or collect data directly from laboratories for researchers' use (e.g., Human Connectome Project).

Bob and Charlie transform the databases they obtain, develop and apply models. They then write and publish articles. It is increasingly common that journals in the modelling field require the source code to be available. Bob and Charlie, like most researchers, use GitHub, but they have other options as we mentioned with Alice. Additionally, interdisciplinary fields with large communities often have collaborative and open-science platforms where many researchers collaborate in large distributed teams (e.g., COMOB, Allen Institute). In those projects, FAIR principles are a basic need. Concerning accessibility, all the data must be perfectly identified by its metadata. Accessibility has to be transparent to the researchers so the source code of their models can retrieve and process the data in a single step. All the platforms mentioned above have high standards of FAIR-ness and offer APIs based on open standards.

**Accessibility and Open Access**

It is important to distinguish between the "Accessibility" FAIR principle and the "Open Access" practice.

The open-access philosophy states that research data and especially research results (articles) should be available online, free of charge, or from other barriers. This is usually achieved using legal open licenses such as Creative Commons or similar ones.

The open access movement rose in the context of articles and scientific literature by the end of the 90s and the beginning of the 2000s, in the dawn of the internet era. The new technologies (publishing online, print-on-demand, easier distribution...) made the cost of publication lowered dramatically, but at the same time, some editorial houses kept increasing their fees to access scientific journals and started practices such as "bundling" to force libraries to buy subscriptions in bulk. In our academic system, researchers are pressured to publish in prestigious, high-impact journals since their academic valuation highly depends on publication metrics. Most often, journals do not offer remuneration per authoring of scientific articles. Furthermore, researchers often peer-review articles for free, with the incentive of gaining status in their research field. Under those circumstances, the role and the business model of the traditional editorial houses started to be questioned. For several years, discontent grew in the scientific community. Some researchers proposed a boycott (e.g., Tim Gowers against Elsevier), while others defended revolutionary tactics (e.g., Aaron Swartz's Guerrilla Open Access Manifesto) that brought shadow sites to the forefront. These sites offered free and unrestricted access to vast amounts of scientific literature (e.g., Sci-Hub, LibGen) but unauthorized by their copyright holders and thus unlawful in many jurisdictions. In parallel, pre-publication sites such as arXiv that make access to scientific articles free and open have gained much popularity. It is nowadays common to find in arXiv pre-release versions (after peer review and with the final layout) almost identical to the journal-published articles. Other authors directly avoid journals and publish in arXiv (with the consequences it entails, such as loose or lacking review and lack of certifiable merits).

More recently, the open access movement has brought new journals and editorial practices that guarantee access to research articles at no cost. For instance, the Public Library of Science (PLOS) is a non-profit publishing house that advocates for Open Access, releasing all its published articles with Creative Commons licenses. In turn, PLOS brought the practice of pay-to-publish, a scheme that moves the publication fees to the authors or their institutions. While this model is defended by many researchers and publishers, regrettably some deceptive journals exploit this model by charging authors with publication fees without making any quality check or review of the submitted articles. The increasing tendency, however, is to have low-cost journals published only online that can have their small publication costs covered by universities and institutions.

The FAIR principles as described above do not, in essence, interfere with the open access practice, and they do not prescribe open licenses. FAIR is focused on all research data in general, not only articles, and it keeps its recommendations limited to technical aspects such as protocols and APIs and the presence of metadata.

However, the choice of a license for the data does impact the degree of FAIR-ness. While the Findability principle is quite independent of the chosen license, the Accessibility principle is heavily affected by it. Open licenses allow for the redistribution of the data, making access to infrastructure more resilient, durable, and decentralized. It removes barriers and makes use of the right to data more effective. The choice of license has a bigger effect on the principle of reusability in terms of its "legal" and other technical and architectural requirements.

FAIR data and open access are intertwined practices, and researchers need to consider both perspectives, especially in light of developing trends and policies. Recently, the U.S. government issued a memorandum (Ensuring Free, Immediate, and Equitable Access to Federally Funded Research) to all federal agencies establishing immediate access at no cost to all U.S.-funded research. This means that all research paid for with public money must be released in an open format, free of charge. This memorandum includes research data, such as research databases and other primary sources of information. Similar policies can be expected soon in the E.U. countries. Although not yet a binding policy, the European Commission already supports FAIR principles.

**MaRDI's proposal concerning Accessibility**

The efforts of MaRDI are, on the one hand, geared towards fulfilling the technical needs to have this network of federated repositories: creating APIs and setting standard formats and protocols to access information through the MaRDI portal. On the other hand, MaRDI aims to spread the FAIR culture amongst researchers by providing training on the practices and tools that will improve their data management.

One of the main MaRDI outputs is our portal, which will help researchers to find and access mathematical research data. The portal itself does not create a new gigantic repository to collect all mathematical research data. Instead, it facilitates the creation of a network of federated domain-specific repositories, making the already existing projects more connected, interoperable, and accessible from a single entry point.

In order to enable standardized retrieval of mathematical research data and their metadata, i.e. to make mathematical research data accessible to machines, the MaRDI consortium has decided to set up an API during the five-year funding period (see p.37, 53 of the proposal). This API will be integrated into the MaRDI Portal, the envisioned one-stop contact point for mathematical research data for the scientific community, by FIZ Karlsruhe and Zuse Institute Berlin.

Take as an example, the API of zbMath Open that has similarities to our portal. zbMath Open is a reviewing service for articles in pure and applied mathematics, where you can find 4.4 million bibliographic entries with reviews or abstracts of scholarly literature in mathematics. It has developed an open API offering the bibliographic metadata of each contribution. You can use this in different ways: to provide references for Wikipedia or Mathoverflow, for so-called data-driven decision making, or even for plagiarism detection (see for instance, this article).

### In Conversation with Johan Commelin

In the second episode of the interview series Data Date, Johan and Christiane talk about mathematical research data in the Lean project, the importance of Github, accessibility in this context, and connected knowledge graphs.

### Pizza and Data at StuKon22

Who would have thought that Pizza and Data go so well together? Very well as we found out at the DMV Student Conference in early August that was held at the MPI MiS in Leipzig.

Three days of StuKon saw presentations of Bachelor or Master theses from 13 of the participating students and talks and workshops on possible career paths for mathematicians held by representatives of banks, academia, insurance, consulting, and Cybersecurity firms.

The first evening was planned by MaRDI. StuKon participants were invited to enjoy their slices of delicious pizza while talking about their experiences with research data. Tabea Bacher gave a short presentation on MaRDI in a cozy relaxed atmosphere. She introduced the FAIR principles and the participants were challenged with the very broad concept of mathematical research data encompassing proofs, formulae, code, simulation data, collections of mathematical objects, graphs, visualizations, papers and any other digital object arising in research. Some of the common difficulties in (mathematical) research data were illustrated by an example from her own work.

Participants were then encouraged to talk to one another about their experiences and what they would want or need from a MaRDI service. Ideas, problems and questions were illustrated by designing postcards briefly presented after this very educational dinner. From this, three recurring concerns were identified.

The need for a formula finder ranked high on the list of concerns raised by the students, this was also mentioned in the last MaRDI Newsletter. The second problem that was brought up was research being published in a language not mastered by the researcher that wants to build on it. It has to be translated first. One could argue that the translation could be done with available tools or not to bother with the translation at all. Translated articles are not made available in a public domain and often remain on personal computers so that the next interested party has to repeat this process for themselves. Wouldn’t it be nice to have a service that collected translations of articles and excerpts and made them accessible? If only to determine if the paper really holds the information you need. And last but not least, the students felt that theses that expand and explain a research paper or proof in detail should be linked to that paper or proof, respectively. These are often Bachelor or Master theses that are rarely published on the university servers, let alone somewhere else. They felt if these were linked to a dense proof or paper, it would help understand the research better - or at least more easily - and give context to the problem.

While there were other issues raised, these were the main points discussed by the StuKon participants. As the organisers we feel that it is important to include the next generation in mathematics in the discussion on FAIRness of research data. It seems that everybody left with MaRDI stuck in their heads. Hopefully they will remember it as a place to consult and possibly contribute in future research careers.

image credit: Bernd Wannenmacher

### The Future of Digital Infrastructures for Mathematical Research

At the DMV Annual Meeting (2022-09-12 – 09-16), we hosted a MaRDI-Mini-Symposium: "The Future of Digital Infrastructures for Mathematical Research". As mathematics becomes increasingly digital and algorithms, proof assistants, and digital databases become more and more involved in mathematical research, questions arise on handling this mathematical research data that accumulates alongside a publication; storage, accessibility, reusability, and quality assurance. Speakers shared their experience with existing solutions and their visions and plans on how a well-developed integrated infrastructure can further facilitate mathematical research.

The slides of all talks can be accessed via the MaRDI-website.

### NFDI4Culture Music Award

This award, presented in two different categories, is given by the musicological community in NFDI4Culture and it intends to recognize music-related or musicological projects and undertakings. Applications may be submitted by 30. September 2022. The funds (up to 3000 EUR) associated with the award are earmarked for expenses that contribute to the goals of NFDI4Culture and must be used by the end of the year 2023.

**More information:**

### FAIR4Chem Award: The FAIRest dataset in chemistry!

This award is given for published chemistry research datasets that best meet the FAIR principles and thus make a significant contribution to increasing transparency in research and the reuse of scientific knowledge. NFDI4Chem will award the FAIRest dataset with prize money of 500 €, supported by the Fonds der Chemischen Industrie (FCI). Submission deadline is November 15, 2022.

**More information:**

On the first Monday of every month at 4 pm, the NFDI hosts a live InfraTalk on youtube. Here, participants of the individual consortia talk about important topics to a general audience -- for instance, Harald Sack on Knowledge Graphs (March 7, 2022).

https://www.youtube.com/playlist?list=PL08nwOdK76QlnmEB659qokiWN3AC-kqFS- Danish librarians have set up "How to FAIR: a Danish website to guide researchers on making research data more FAIR" https://doi.org/10.5281/zenodo.3712065. On accessibility, they say "Conducting research is often a team effort. Even before collecting the data, it is important to consider who will get access to the data, under which conditions, and what permissions they will have." and provide lots of use cases from all across the sciences https://www.howtofair.dk/how-to-fair/access-to-data/
FDM Thüringen's Research Data Scarytales promises to "take you on an eerie journey and show you in short stories what scary consequences mistakes in data management can have". The multiple player game comprises of stories based on real events and is designed to avoid potential pitfalls and traps in your Research Data Management plan.

**1st issue - Findability**

Welcome to the very first issue of the MaRDI (Mathematical Research Data Initiative) Newsletter. Research data in mathematics comes in many different flavors: papers, formulae, theorems, code, scripts, notebooks, software, models, simulated and experimental datasets, libraries of math objects with properties of interest... In short, the list is as long as mathematical research data is diverse.

Unfortunately, there is no straightforward or standard way to make these digital objects available for future generations of researchers. Availability, however, is not the only concern. In an ideal world, mathematical research data would be

**FAIR:** **F**indable, **A**ccessible, **I**nteroperable, and **R**eusable.

MaRDI is a part of the German National Research Data Infrastructure (NFDI) and it is dedicated to building infrastructures to make mathematical research data FAIR. Work on solutions for some of the major problems we face today started last year; from understanding the state-of-the-art technology of a field all the way along the research pipeline to establishing standards for peer review. As part of this process it is especially important for us to engage you, the mathematics community, early on so have a look at the list of our upcoming workshops!

This issue of the Newsletter is dedicated to the F in FAIR: to findability and what this means for mathematics.

licensed under CC BY-NC-SA 4.0.

We explore two aspects of what Findable means. First, we will focus on how to find data created by other researchers and then we discuss how to make sure your own data is findable for the math community.

In each newsletter, we will also publish an episode of our interview series on math and data: "Data Dates", introduce you to the people behind the MaRDI project, and offer some reading recommendations on the topic.

### Have you ever…

- tried searching for a formula?
- seen a reference to a homepage that is long gone?
- put code on your personal webpage because you didn't know how and where else to publish it?
- browsed through the publications of your coauthor's coauthors looking for that one result that you almost remembered but not quite?
- not been able to find something you needed to keep going into the research direction you fancied?

**Then you are not alone!**

To find out, where people search for math data, we ask you to answer our very short multiple-choice survey:

**Where do you look for mathematical research data?**

You will see the results here or right after submitting your answer.

### How to find research data?

In the near-infinite resource aka World Wide Web, where do you find your research data? Where are the concentrating resource “hubs”? How is MaRDI proposing to help on the Findability challenges?

**Data and FAIR principles**

Modern science, including mathematics, relies increasingly on research data. Research data is the factual material required to verify research findings and in mathematics, this can also be the knowledge written up in an article.

Types of research data would include literature, such as books and articles, databases of experimental data, simulation-generated data, taxonomies (exhaustive listings of the examples of a given category of objects), workflows, and frameworks (for instance software stacks with all the programs used in a research project), etc. Even a single formula could be considered research data. To set up good practices in the scientific community, Wilkinson et al published the FAIR Guiding Principles for scientific data management and stewardship. These principles are Findability, Accessibility, Interoperability, and Reusability.

In this article, we will introduce the Findability principle, with a focus on mathematical sciences, in connection with the infrastructure that is being developed by MaRDI.

For more information about what research data is and how to manage it (especially for researchers in German-speaking countries), you can visit Forschungdaten.info (in German). For a comprehensive introduction to the FAIR principles, you can visit the Go-Fair portal.

**Findability**

Findability is the first of the FAIR principles; it is also the most basic one because if you can't find some data, you can't re-use it in any way, it is as if it does not exist.

When we try to find (research) data, we may face two situations: either we know that something exists and we are looking for it specifically, or we don't know exactly what we want and we look for anything related to a search term. In the first case, rather than finding that data, our problem is *locating* it somewhere in the physical or virtual space. In the second, our problem is to *examine* all the data available (in a certain catalog) for a certain characteristic that we are interested in.

Both problems can be solved by using a few tools. Firstly, each piece of data needs to have a unique reference or identification, so that we can build lookup tables for the location of each dataset. Secondly, together with the ID, we need other *metadata* that describes the data with some useful information (type, subject, authors, etc). Thirdly, we need to build comprehensive catalogs that gather all the metadata of the datasets and build search engines, which are algorithms to retrieve things from the catalogs.

Thus, the Findability principle can be concretized to the following recommendations:

- (Meta)data is assigned a globally unique and persistent identifier.
- Data is described with rich metadata.
- Metadata clearly and explicitly includes the identifier of the data it describes.
- (Meta)data is registered or indexed in a searchable resource.

The classical approach for searching and finding data has been dominated by the publication paradigm: You look for a specific publication, or for any publication related to a certain topic, that will contain the information you are interested in. However, in reality, you often want to find a theorem, a formula, or any concrete information rather than a publication. For instance a specific expression of a Bessel function, a particular representation of a given group, or the proof that certain differential equations have unique solutions. This approach requires re-thinking how we structure and manage research data. We discuss next the available places to find research data and then the MaRDI proposal for such a comprehensive approach.

**Where to look for research data**

For mathematical articles, books, and other classically published works, a reference includes title, author, year, etc. While this is easily usable and readable by a human, it is not always consistent in format and it does not provide a means to locate and access that information. The two de-facto standard catalogs that collect mathematical literature and also assign a unique identifier are:

- The ZentralBlatt Mathematik (unique identifier: Zb number), archived in zbMath by the FIZ Karlsruhe - Leibniz Institute and
- The Mathematical Reviews (unique identifier: MR number), archived in MathSciNet by the American Mathematical Society.

While these unique identifiers are helpful in referencing a piece of mathematical literature and these platforms are useful in finding works in a specific math domain, their catalogs are much less comprehensive when it comes to other research data (databases, media, online resources, etc). It also has the drawback that the authors cannot control the existence or the metadata of an entry, and MathSciNet is a subscription-based service*.

Another notable mention is arXiv, which is a de-facto standard platform for pre-publications. Here the actual paper is offered publicly thus making it Accessible. Furthermore, any work in arXiv also gets a unique ID and can be found via the catalog search. The focus here is also on literature, although there is limited support for datasets related to a paper. When it comes to non-literature research data, the panorama is much coarser. swMath, a sister project to zbMath, is a catalog of mathematical software packages (computer algebra, numerics, etc) and a cross-referencing record of their citations articles in zbMath. zbMath also features a full-text search of formulas, which is being improved within the MaRDI framework.

There are also general-purpose identifiers and catalogs for data. One of the most standardized identifiers for online resources is the Digital Object Identifier (DOI), which references any digital object. Unlike a URL, the DOI is linked to a particular file and not to the server or website where it is hosted. The DOI website resolves the DOI number to the most up-to-date URL to access the data, so the DOI also serves as a locator in addition to being a unique identifier. Usually, publishers assign a DOI to new publications but authors can also obtain a DOI in other registration agencies. Some open repositories offer free DOI registration. For instance, Zenodo is a general-purpose repository for open data, which hosts quite a few mathematical research datasets. See our article "Publishing on open repositories" where we talk more about Zenodo.

Currently, for pure research databases (experimental data, simulations data, etc), there is no universally accepted repository in mathematics. There are a few curated collections of mathematical objects, such as the Online Encyclopedia of Integer Sequences (OEIS), the SuiteSparse Matrix Collection, and the NIST Digital Library of Mathematical Functions. The reality is that many researchers rely on open repositories for access to data. Unfortunately, in contrast to biological repositories where researchers can find standardized catalogs of proteins or genetic encodings, mathematical catalogs are neither for general-purpose use nor very interoperable.

**MaRDI's proposal concerning Findability**

Unfortunately, most data-based mathematical research is still published either without the datasets, or the datasets are hosted on university servers accessible only through personal websites of the researchers involved.

MaRDI aims to, on the one hand, provide the necessary ground infrastructure to properly publish research data in federated repositories (using standards and practices according to the FAIR principles), and on the other, it plans to spread awareness within the math research community on the problems and proposed solutions that publishing research data entails.

Here we will name a few of the initiatives related to the Findability principle.

The Scientific Computing Task Area (TA2) is preparing a benchmark framework to compare existing and new algorithms and methods to solve specific problems. For instance, there are several dozens of methods to solve a linear system Ax=b, with different performance and different technology stacks, depending on the size of the matrix A, if it is sparse or dense, if we look for exact or approximate solutions, etc. So far there is no centralized catalog where a "user" (for instance a computational biologist) can go to choose the best method for their particular application. This catalog and benchmark will make finding symbolic and numerical algorithms much easier and it aspires to be a major reference when looking for such algorithms.

The tool for this is building a knowledge graph of numerical algorithms. A knowledge graph is an abstract representation of a set of concepts, objects, events, or anything related to a domain of study, as nodes, and formal relations between them (edges) that can be read by humans or computers unambiguously. The biggest collective effort to build a knowledge graph is Wikidata. In this mathematical knowledge graph, nodes will be the algorithms themselves as concepts, but also papers related to them, software packages implementing them, benchmarks, and connections to other databases. It will then be possible to navigate the knowledge graph to find semantical information, such as which algorithms extend a given one, where can we find implementations, how do they perform in comparison, etc.

Another effort aimed at Findability in MaRDI is the Mathematical Entity Linking (MathEL), or a way to extract and compare conceptual information from mathematical formulas. The concept of a particular equation (for instance the Klein-Gordon equation, the General Relativity equation, etc) can be expressed in many different forms, variables can be named differently, notations for derivatives or tensors may differ, and groupings and substitutions can occur. The MathEL sub-project aims to retrieve the conceptual information of formulas, propose annotation standards for introducing semantic information into formulas (for instance referencing a WikiData node or other knowledge graph node), to mine large corpora of research data (for instance the Zb catalog or the arXiv repository) and to create user interfaces to retrieve concept and source information, such as question-answering engines.

To illustrate this, here is a sneak peek into the MaRDI portal, under development, which will integrate the MathWebSearch search engine as a MediaWiki component. The formula search can find Wikipages based on formula expressions denoted in LaTeX on the pages on the MaRDI portal. This test wiki page contains a couple of math formulas. This search portal should be able to find those formulas when queried in the search box. With the TeX and BaseX configuration, you can try an input like " V=4/3 \pi r^3 " or " V=\frac{4}{3} \pi r^3 " and it will find the Wiki page with the test formula. Also, with " V = 4/3 \pi ?s^3 " you can find variable substitutions. Other common re-writings are not yet recognized, such as " V = \frac{4\pi}{3} r^3 " but the core search engine is also under active development. The same engine is used in zbMath formulae search. Plans for MaRDI include to make entities in a Wikibase knowledge graph findable through formula search.

In subsequent articles, we will expose other tasks being carried out within MaRDI** that exemplify the other FAIR principles (for instance open interfaces, or descriptions of workflows).

* *MR Lookup** offers limited services to non-subscribers. As of 2021, ZbMath became zbMATH-open and requires no subscription.*

***The funded MaRDI proposal can be accessed **here**.*

Taking some data from a project, we try to prepare it according to the FAIR principles. Follow us in our attempt to make it FAIR on the first try.

**Publishing research data in open repositories**

We are IMAGINARY, a math communication association, part of the MaRDI consortium and we develop and organize math exhibitions as our main activity. Using data that we collected about Earth grids for one of our recent projects on climate change, we will take you through how we almost painlessly set up data in a public repository.

Our latest exhibition is the "10-minute museum on the climate crisis mathematics", where we describe mathematical modeling and places where maths is used in climate science. We all know that the latitude and longitude grid is the most common way of creating a reference system on the Earth. Did you know there are other ways to divide the Earth into small regions that can be particularly useful in numerical models?

Quite excited by this, we contacted a couple of climate researchers who were able to prepare for us the sets of geographic nodes and edges that make those grids. Then another one of our collaborators took that data and converted it into a 3D-printable model by adding thickness to the edges and checking the structural integrity of the ensemble so that it could be a physical object. Finally, a 3D printing company made the objects that we used in our exhibition.

As this dataset was not used in a way that contributed to existing knowledge, it was not suitable for a publication in a journal. However, it occurred to us that the data that was gathered and processed was niched and specific enough to be the basis for others to re-use and build on.

Being a company committed to Free and Open Source licenses, we wanted to not only make the data available but FAIR as well.

**Git (GitHub, GitLab)**

Since we were dealing with software files, the most convenient platform for publishing and developing is GitHub. Git is an efficient version control software and any organization of code should start here. GitHub and GitLab are probably the most popular platforms to host projects. However, as a publishing tool, it could be considered almost as a kind of personal website (actually, you can host and serve a git repository in your server) and it is a live and working tool. This means that the published data can change at any time. Github does not offer, by default, a guarantee of stability (although there are archive options), a standardized identifier, or a good way to search and find your data. Also, it keeps a record of all previous versions so all the dirty work is on the public.

Our GitHub page was our collaboration tool within the team. It was not intended as a publication method; it just happened that we left it to be publicly available. Having data available somewhere does not automatically make it FAIR. We wanted to have an identifier associated with it and we knew that some repositories offered that.

**Zenodo**

Zenodo is one such open-access general-purpose repository. It is hosted by the CERN infrastructure and funded in part by the European Commission. Researchers in any scientific area use it to make a copy of their work findable and accessible to the public. These works can be articles or books in pre-print or, in some cases, already published by traditional publishing houses but also databases, data files, images or any digital asset that their research relies upon.

Zenodo offers a Digital Object Identifier (DOI) if the work does not already have one. In this case, the DOI contains a "zenodo" string in it. For instance, 10.5281/zenodo.6538815.

This was a perfect fit for our data and as a bonus, creating our entry on Zenodo was not difficult!

Firstly, we created an account. A valid email is all you need. You can also link it to your ORCiD to determine the author(s) uniquely.

Secondly, we made a new upload draft. You can choose the type of document (publication, poster, dataset, image, video, software, physical object, etc.) and fill in the form with the title, authors, publication date (can be in the past), description, and several other fields.

For the authors, we added the ORCID of those who had it. We also used "IMAGINARY" as an author, even though it was not a physical person but a company.

We requested a new DOI since we did not have any. The DOI can be "reserved" during the draft process, so you know it in advance and can use it in the documents you prepare.

For the actual content, we used a zip file with the master branch of the GitHub repository. You can also link your Zenodo account to your GitHub account so that whenever you make a "release" in GitHub, a snapshot is automatically published in Zenodo.

Finally, we submitted the draft. Take note: once published, you can't add, delete or modify the files associated with a DOI, which is the main point of the DOI. You would have to make new versions with a new DOI. Thus, we recommend that you double- and triple-check before clicking submit. In case you make an erroneous submission, you can write an email to the Zenodo administrators for help.

**Wikipedia / Wikidata**

We now have an identifier that would make our data easy to find if you have it, or if you happen to search in Zenodo's search box. But now, we wanted to increase our Findability. We needed to include our data in places where people often look for information and Wikipedia / Wikidata are the perfect places for that.

Wikipedia is the universally known collaborative encyclopedia. With more than 6 million articles in English, it would be easy to find an article relating to your data. However, before advertising your data on Wikipedia by editing general-interest articles, you must be familiar with the core principles of Wikipedia content: Neutral point of view, Verifiable, and No original research. That is to say, only link to research and data published elsewhere and do not hijack articles for self-promotion.

In our case, we found an article on Discrete global grid. Since our work provides an example of such grids, it could be of general interest. Additionally, as there are no other examples of 3D-printable grids that we are aware of, we decided to add a link in the "External references" section.

We then had a look at Wikidata. Wikidata is the data backbone for Wikipedia. In contrast with Wikipedia, which is made of articles, Wikidata is made of entries; every entry can be an object, an abstract concept, a person, a feeling, a math research article..., essentially anything. Every entry lists some properties of the item in a structured form. It is human-readable but also planned to be machine-readable, meaning one day some AI or search engine can obtain knowledge from such an enormous database, which aspires to have all human knowledge structured. As such, it is a suitable place to catalog research data. Many researchers index there their articles (listing title, authors, DOI...), databases, models, etc. But many don't, so it is not yet a comprehensive research (or general) catalog. It is also less intuitive as a search tool than Wikipedia (there is no full text to read), and it can be challenging to retrieve useful information by hand.

In our case, searching for "Earth grid" produced nothing, while "Earth system grid" brought us to the US Energy department portal, and we learned that "Grid in Earth sciences" is the title of a concrete published article. We finally found the Wikidata entry on "Discrete Global Grid" (linked in the Wikipedia article) which is about the concept, but not much information therein. We could have created a Wikidata entry and have our data listed as an instance (example) of a Discrete Global Grid, but we found that our 3D data would have more context in the Wikipedia article. Therefore, we decided not to put our reference in Wikidata.

After asking some colleagues, we found that a more typical use case would be the following: A published research article uses a dataset. Then a Wikipedia page references the published article as a source. By creating a reference in Wikipedia, an entry in Wikidata is created. Then a (different) entry in Wikidata representing the dataset is linked to the entry representing the published article. This way, there is a path from Wikipedia to the research data referenced in Wikidata. Hopefully, eventually, the dataset is used in other publications (referenced in other Wikipedia pages) and Wikidata can keep track of all the works derived from that dataset.

**Assessing the FAIRness**

At this point, we were wondering, how can we tell if our data is really FAIR? How well did we do? Fortunately, there is also a tool to assess that!!

The Automated FAIR Data Assessment Tool from FAIRsFAIR data initiative accepts any working reference, a DOI for instance, and tries to determine its FAIRness from its metadata. It generates a summarised report with individual scores and a final global mark. Luckily for us, Zenodo handles that metadata quite well and makes it available via the HTML code on the Zenodo page itself.

So how did we do? On a scale from 0 to 3, our grand score is: "moderate" or 2.

To improve that score, we could have edited the metadata and added more details; however, that is still a feature under development in Zenodo (e.g., supporting the citation file format), and it may be a bit cumbersome to edit that metadata on other platforms.

**Conclusion**

Overall we were satisfied with this experiment of making our data FAIR. The GitHub workflow is a bit difficult to learn but it is nowadays part of software development. An added benefit is that it can integrate into FAIR workflows. Zenodo was a success: easy to use, takes care of most of the metadata, and provides free DOIs. Wikipedia is not difficult, but you need to restrain your interest in getting visibility from undermining the general interest of an encyclopedia. About Wikidata, we concluded that it is not for our use case (although it might be for other research data). Finally, the FAIR data assessment tool is great not only to evaluate but also to educate on good practices and improving your FAIRness. Probably there are still many tools and hints that we can discover, but so far it was not so hard a trip to make.

We hope that reading about our experience encourages you to re-evaluate and want to improve the FAIRness of your data.

**In Conversation with Cedric Villani**

In the first episode of the interview series Data Date, Cedric Villani joins Christiane Görgen for a brief exchange of thoughts about Math & Data.

**OpenML hackathon at Dagstuhl castle**

Sebastian Fischer and Oleksandr Zadorozhnyi, of the MaRDI task area Statistics and Machine Learning, participated in an OpenML hackathon held in late March at the headquarters of the Leibniz Center for Informatics at Dagstuhl, Germany.

OpenML is an open-source platform for sharing datasets, algorithms, experiments, and results. The hackathon was initiated by Bernd Bischl, one of the key players behind OpenML and a Co-Spokesperson in MaRDI. Researchers from other parts of Germany, France, the Netherlands, Poland, and Slovenia were present to discuss topics such as data quality on OpenML, an extension of its established services to new data formats, and new computational tasks.

The review article "Datasheets for datasets" provided fruitful exchanges on future improvement of data and metadata quality. In particular, support for non-tabular data formats such as images was discussed and will now be embedded by transitioning from the attribute-related file format to parquet. The so-far available eight types of tasks, including regression, classification, and clustering, will be extended to new tasks which are typical for graphical modeling. As this is one of the main use cases and an important topic for both Sebastian and Oleksandr, discussions on the problem of graphical-model structure estimation from a given dataset, embedding into the current set of tasks available on OpenML, addition of different evaluation measures or criteria for model selection and storage of graph-specified datasets within the OpenML framework were had with Jan van Rijn. The evaluation measures and criteria for model selection allow for the comparison of estimated graphs to some given ground truth, a procedure that is not normally part of the ML workflow.

Sebastian also presented their collaborative work with Michael Lang on the mlr3oml R package. This package connects the OpenML platform to the open-source machine learning mlr3 package in R, another crucial aspect of the MaRDI task area.

The hackathon was rounded out with social activities like a walk through the forest. The good weather aside, special thanks needs to be given to Joaquin Vanschoren, the OpenML founder, whose supply of water to the whole group during the hike was the other reason why everyone made it back to the castle in good spirits!!!

All in all the week in Wadern was a pleasant and fruitful one for all the participants.

We will also be introducing you to the people who shape MaRDI with their expertise and vision for mathematical research data. They will appear in a series of "Making MaRDI" interviews available via our Twitter account. Stay tuned!

**Call for seed funds 2023**

These funds support scientists from all fields of research within engineering, relating to the development and implementation of innovative ideas in data management. The grant is equivalent to the funding of a full-time doctoral position for one year. If necessary, the funding can be split between project partners.

**More information:**

- To learn about the Nationale Forschungsdateninfrastruktur, the community of which MaRDI is just one small part, read the 2021 article by Nathalie Hartl, Elena Wössner, and York Sure-Vetter in Informatik Spektrum. See doi.org/10.1007/s00287-021-01392-6

- Christiane Görgen and Claudia Fevola explain in a short review article the role repositories can play in the MaRDI infrastructure. They use MathRepo as an example, a small math research-data repository hosted at the Max Planck Institute for Mathematics in the Sciences in Leipzig. See arxiv.org/abs/2202.04022

- The interim report of the European Commission Expert Group on FAIR data discusses how to turn FAIR into reality. See doi.org/10.2777/1524

- Thomas Koprucki and Karsten Tabelow have been two of the driving forces in the early stages of MaRDI. Together with Ilka Kleinod they discussed mathematical models as an important type of mathematical research data in a 2016 article for the Proceedings in Applied Mathematics and Mechanics: doi.org/10.1002/pamm.201610458

Our Newsletter "Math & Data Quarterly" is prepared by our partner IMAGINARY. You can unsubscribe easily at any time.