Math & Data Quarterly
News and insights into the realm of mathematical research data

Welcome to the very first issue of the MaRDI (Mathematical Research Data Initiative) Newsletter. Research data in mathematics comes in many different flavors: papers, formulae, theorems, code, scripts, notebooks, software, models, simulated and experimental datasets, libraries of math objects with properties of interest... In short, the list is as long as mathematical research data is diverse.
Unfortunately, there is no straightforward or standard way to make these digital objects available for future generations of researchers. Availability, however, is not the only concern. In an ideal world, mathematical research data would be
FAIR: Findable, Accessible, Interoperable, and Reusable.
MaRDI is a part of the German National Research Data Infrastructure (NFDI) and it is dedicated to building infrastructures to make mathematical research data FAIR. Work on solutions for some of the major problems we face today started last year; from understanding the state-of-the-art technology of a field all the way along the research pipeline to establishing standards for peer review. As part of this process it is especially important for us to engage you, the mathematics community, early on so have a look at the list of our upcoming workshops!
This issue of the Newsletter is dedicated to the F in FAIR: to findability and what this means for mathematics.
licensed under CC BY-NC-SA 4.0.
We explore two aspects of what Findable means. First, we will focus on how to find data created by other researchers and then we discuss how to make sure your own data is findable for the math community.
In each newsletter, we will also publish an episode of our interview series on math and data: "Data Dates", introduce you to the people behind the MaRDI project, and offer some reading recommendations on the topic.


Have you ever…
- tried searching for a formula?
- seen a reference to a homepage that is long gone?
- put code on your personal webpage because you didn't know how and where else to publish it?
- browsed through the publications of your coauthor's coauthors looking for that one result that you almost remembered but not quite?
- not been able to find something you needed to keep going into the research direction you fancied?
Then you are not alone!
To find out, where people search for math data, we ask you to answer our very short multiple-choice survey:
Where do you look for mathematical research data?
You will see the results here or right after submitting your answer.

How to find research data?
In the near-infinite resource aka World Wide Web, where do you find your research data? Where are the concentrating resource “hubs”? How is MaRDI proposing to help on the Findability challenges?
Data and FAIR principles
Modern science, including mathematics, relies increasingly on research data. Research data is the factual material required to verify research findings and in mathematics, this can also be the knowledge written up in an article.
Types of research data would include literature, such as books and articles, databases of experimental data, simulation-generated data, taxonomies (exhaustive listings of the examples of a given category of objects), workflows, and frameworks (for instance software stacks with all the programs used in a research project), etc. Even a single formula could be considered research data. To set up good practices in the scientific community, Wilkinson et al published the FAIR Guiding Principles for scientific data management and stewardship. These principles are Findability, Accessibility, Interoperability, and Reusability.
In this article, we will introduce the Findability principle, with a focus on mathematical sciences, in connection with the infrastructure that is being developed by MaRDI.
For more information about what research data is and how to manage it (especially for researchers in German-speaking countries), you can visit Forschungdaten.info (in German). For a comprehensive introduction to the FAIR principles, you can visit the Go-Fair portal.
Findability
Findability is the first of the FAIR principles; it is also the most basic one because if you can't find some data, you can't re-use it in any way, it is as if it does not exist.
When we try to find (research) data, we may face two situations: either we know that something exists and we are looking for it specifically, or we don't know exactly what we want and we look for anything related to a search term. In the first case, rather than finding that data, our problem is locating it somewhere in the physical or virtual space. In the second, our problem is to examine all the data available (in a certain catalog) for a certain characteristic that we are interested in.
Both problems can be solved by using a few tools. Firstly, each piece of data needs to have a unique reference or identification, so that we can build lookup tables for the location of each dataset. Secondly, together with the ID, we need other metadata that describes the data with some useful information (type, subject, authors, etc). Thirdly, we need to build comprehensive catalogs that gather all the metadata of the datasets and build search engines, which are algorithms to retrieve things from the catalogs.
Thus, the Findability principle can be concretized to the following recommendations:
- (Meta)data is assigned a globally unique and persistent identifier.
- Data is described with rich metadata.
- Metadata clearly and explicitly includes the identifier of the data it describes.
- (Meta)data is registered or indexed in a searchable resource.
The classical approach for searching and finding data has been dominated by the publication paradigm: You look for a specific publication, or for any publication related to a certain topic, that will contain the information you are interested in. However, in reality, you often want to find a theorem, a formula, or any concrete information rather than a publication. For instance a specific expression of a Bessel function, a particular representation of a given group, or the proof that certain differential equations have unique solutions. This approach requires re-thinking how we structure and manage research data. We discuss next the available places to find research data and then the MaRDI proposal for such a comprehensive approach.
Where to look for research data
For mathematical articles, books, and other classically published works, a reference includes title, author, year, etc. While this is easily usable and readable by a human, it is not always consistent in format and it does not provide a means to locate and access that information. The two de-facto standard catalogs that collect mathematical literature and also assign a unique identifier are:
- The Mathematical Reviews (unique identifier: MR number), archived in MathSciNet by the American Mathematical Society and
- The ZentralBlatt Mathematik (unique identifier: Zb number), archived in zbMath by the FIZ Karlsruhe - Leibniz Institute.
While these unique identifiers are helpful in referencing a piece of mathematical literature and these platforms are useful in finding works in a specific math domain, their catalogs are much less comprehensive when it comes to other research data (databases, media, online resources, etc). It also has the drawback that the authors cannot control the existence or the metadata of an entry, and MathSciNet is a subscription-based service*.
Another notable mention is arXiv, which is a de-facto standard platform for pre-publications. Here the actual paper is offered publicly thus making it Accessible. Furthermore, any work in arXiv also gets a unique ID and can be found via the catalog search. The focus here is also on literature, although there is limited support for datasets related to a paper. When it comes to non-literature research data, the panorama is much coarser. swMath, a sister project to zbMath, is a catalog of mathematical software packages (computer algebra, numerics, etc) and a cross-referencing record of their citations articles in zbMath. zbMath also features a full-text search of formulas, which is being improved within the MaRDI framework.
There are also general-purpose identifiers and catalogs for data. One of the most standardized identifiers for online resources is the Digital Object Identifier (DOI), which references any digital object. Unlike a URL, the DOI is linked to a particular file and not to the server or website where it is hosted. The DOI website resolves the DOI number to the most up-to-date URL to access the data, so the DOI also serves as a locator in addition to being a unique identifier. Usually, publishers assign a DOI to new publications but authors can also obtain a DOI in other registration agencies. Some open repositories offer free DOI registration. For instance, Zenodo is a general-purpose repository for open data, which hosts quite a few mathematical research datasets. See our article "Publishing on open repositories" where we talk more about Zenodo.
Currently, for pure research databases (experimental data, simulations data, etc), there is no universally accepted repository in mathematics. There are a few curated collections of mathematical objects, such as the Online Encyclopedia of Integer Sequences (OEIS), the SuiteSparse Matrix Collection, and the NIST Digital Library of Mathematical Functions. The reality is that many researchers rely on open repositories for access to data. Unfortunately, in contrast to biological repositories where researchers can find standardized catalogs of proteins or genetic encodings, mathematical catalogs are neither for general-purpose use nor very interoperable.
MaRDI's proposal concerning Findability
Unfortunately, most data-based mathematical research is still published either without the datasets, or the datasets are hosted on university servers accessible only through personal websites of the researchers involved.
MaRDI aims to, on the one hand, provide the necessary ground infrastructure to properly publish research data in federated repositories (using standards and practices according to the FAIR principles), and on the other, it plans to spread awareness within the math research community on the problems and proposed solutions that publishing research data entails.
Here we will name a few of the initiatives related to the Findability principle.
The Scientific Computing Task Area (TA2) is preparing a benchmark framework to compare existing and new algorithms and methods to solve specific problems. For instance, there are several dozens of methods to solve a linear system Ax=b, with different performance and different technology stacks, depending on the size of the matrix A, if it is sparse or dense, if we look for exact or approximate solutions, etc. So far there is no centralized catalog where a "user" (for instance a computational biologist) can go to choose the best method for their particular application. This catalog and benchmark will make finding symbolic and numerical algorithms much easier and it aspires to be a major reference when looking for such algorithms.
The tool for this is building a knowledge graph of numerical algorithms. A knowledge graph is an abstract representation of a set of concepts, objects, events, or anything related to a domain of study, as nodes, and formal relations between them (edges) that can be read by humans or computers unambiguously. The biggest collective effort to build a knowledge graph is Wikidata. In this mathematical knowledge graph, nodes will be the algorithms themselves as concepts, but also papers related to them, software packages implementing them, benchmarks, and connections to other databases. It will then be possible to navigate the knowledge graph to find semantical information, such as which algorithms extend a given one, where can we find implementations, how do they perform in comparison, etc.
Another effort aimed at Findability in MaRDI is the Mathematical Entity Linking (MathEL), or a way to extract and compare conceptual information from mathematical formulas. The concept of a particular equation (for instance the Klein-Gordon equation, the General Relativity equation, etc) can be expressed in many different forms, variables can be named differently, notations for derivatives or tensors may differ, and groupings and substitutions can occur. The MathEL sub-project aims to retrieve the conceptual information of formulas, propose annotation standards for introducing semantic information into formulas (for instance referencing a WikiData node or other knowledge graph node), to mine large corpora of research data (for instance the Zb catalog or the arXiv repository) and to create user interfaces to retrieve concept and source information, such as question-answering engines.
To illustrate this, here is a sneak peek into the MaRDI portal, under development, which will integrate the MathWebSearch search engine as a MediaWiki component. The formula search can find Wikipages based on formula expressions denoted in LaTeX on the pages on the MaRDI portal. This test wiki page contains a couple of math formulas. This search portal should be able to find those formulas when queried in the search box. With the TeX and BaseX configuration, you can try an input like " V=4/3 \pi r^3 " or " V=\frac{4}{3} \pi r^3 " and it will find the Wiki page with the test formula. Also, with " V = 4/3 \pi ?s^3 " you can find variable substitutions. Other common re-writings are not yet recognized, such as " V = \frac{4\pi}{3} r^3 " but the core search engine is also under active development. The same engine is used in zbMath formulae search. Plans for MaRDI include to make entities in a Wikibase knowledge graph findable through formula search.
In subsequent articles, we will expose other tasks being carried out within MaRDI** that exemplify the other FAIR principles (for instance open interfaces, or descriptions of workflows).
* MR Lookup offers limited services to non-subscribers. As of 2021, ZbMath became zbMATH-open and requires no subscription.
**The funded MaRDI proposal can be accessed here.

Taking some data from a project, we try to prepare it according to the FAIR principles. Follow us in our attempt to make it FAIR on the first try.
Publishing research data in open repositories
We are IMAGINARY, a math communication association, part of the MaRDI consortium and we develop and organize math exhibitions as our main activity. Using data that we collected about Earth grids for one of our recent projects on climate change, we will take you through how we almost painlessly set up data in a public repository.
Our latest exhibition is the "10-minute museum on the climate crisis mathematics", where we describe mathematical modeling and places where maths is used in climate science. We all know that the latitude and longitude grid is the most common way of creating a reference system on the Earth. Did you know there are other ways to divide the Earth into small regions that can be particularly useful in numerical models?
Quite excited by this, we contacted a couple of climate researchers who were able to prepare for us the sets of geographic nodes and edges that make those grids. Then another one of our collaborators took that data and converted it into a 3D-printable model by adding thickness to the edges and checking the structural integrity of the ensemble so that it could be a physical object. Finally, a 3D printing company made the objects that we used in our exhibition.
As this dataset was not used in a way that contributed to existing knowledge, it was not suitable for a publication in a journal. However, it occurred to us that the data that was gathered and processed was niched and specific enough to be the basis for others to re-use and build on.
Being a company committed to Free and Open Source licenses, we wanted to not only make the data available but FAIR as well.
Git (GitHub, GitLab)
Since we were dealing with software files, the most convenient platform for publishing and developing is GitHub. Git is an efficient version control software and any organization of code should start here. GitHub and GitLab are probably the most popular platforms to host projects. However, as a publishing tool, it could be considered almost as a kind of personal website (actually, you can host and serve a git repository in your server) and it is a live and working tool. This means that the published data can change at any time. Github does not offer, by default, a guarantee of stability (although there are archive options), a standardized identifier, or a good way to search and find your data. Also, it keeps a record of all previous versions so all the dirty work is on the public.
Our GitHub page was our collaboration tool within the team. It was not intended as a publication method; it just happened that we left it to be publicly available. Having data available somewhere does not automatically make it FAIR. We wanted to have an identifier associated with it and we knew that some repositories offered that.
Zenodo
Zenodo is one such open-access general-purpose repository. It is hosted by the CERN infrastructure and funded in part by the European Commission. Researchers in any scientific area use it to make a copy of their work findable and accessible to the public. These works can be articles or books in pre-print or, in some cases, already published by traditional publishing houses but also databases, data files, images or any digital asset that their research relies upon.
Zenodo offers a Digital Object Identifier (DOI) if the work does not already have one. In this case, the DOI contains a "zenodo" string in it. For instance, 10.5281/zenodo.6538815.
This was a perfect fit for our data and as a bonus, creating our entry on Zenodo was not difficult!
Firstly, we created an account. A valid email is all you need. You can also link it to your ORCiD to determine the author(s) uniquely.
Secondly, we made a new upload draft. You can choose the type of document (publication, poster, dataset, image, video, software, physical object, etc.) and fill in the form with the title, authors, publication date (can be in the past), description, and several other fields.
For the authors, we added the ORCID of those who had it. We also used "IMAGINARY" as an author, even though it was not a physical person but a company.
We requested a new DOI since we did not have any. The DOI can be "reserved" during the draft process, so you know it in advance and can use it in the documents you prepare.
For the actual content, we used a zip file with the master branch of the GitHub repository. You can also link your Zenodo account to your GitHub account so that whenever you make a "release" in GitHub, a snapshot is automatically published in Zenodo.
Finally, we submitted the draft. Take note: once published, you can't add, delete or modify the files associated with a DOI, which is the main point of the DOI. You would have to make new versions with a new DOI. Thus, we recommend that you double- and triple-check before clicking submit. In case you make an erroneous submission, you can write an email to the Zenodo administrators for help.
Wikipedia / Wikidata
We now have an identifier that would make our data easy to find if you have it, or if you happen to search in Zenodo's search box. But now, we wanted to increase our Findability. We needed to include our data in places where people often look for information and Wikipedia / Wikidata are the perfect places for that.
Wikipedia is the universally known collaborative encyclopedia. With more than 6 million articles in English, it would be easy to find an article relating to your data. However, before advertising your data on Wikipedia by editing general-interest articles, you must be familiar with the core principles of Wikipedia content: Neutral point of view, Verifiable, and No original research. That is to say, only link to research and data published elsewhere and do not hijack articles for self-promotion.
In our case, we found an article on Discrete global grid. Since our work provides an example of such grids, it could be of general interest. Additionally, as there are no other examples of 3D-printable grids that we are aware of, we decided to add a link in the "External references" section.
We then had a look at Wikidata. Wikidata is the data backbone for Wikipedia. In contrast with Wikipedia, which is made of articles, Wikidata is made of entries; every entry can be an object, an abstract concept, a person, a feeling, a math research article..., essentially anything. Every entry lists some properties of the item in a structured form. It is human-readable but also planned to be machine-readable, meaning one day some AI or search engine can obtain knowledge from such an enormous database, which aspires to have all human knowledge structured. As such, it is a suitable place to catalog research data. Many researchers index there their articles (listing title, authors, DOI...), databases, models, etc. But many don't, so it is not yet a comprehensive research (or general) catalog. It is also less intuitive as a search tool than Wikipedia (there is no full text to read), and it can be challenging to retrieve useful information by hand.
In our case, searching for "Earth grid" produced nothing, while "Earth system grid" brought us to the US Energy department portal, and we learned that "Grid in Earth sciences" is the title of a concrete published article. We finally found the Wikidata entry on "Discrete Global Grid" (linked in the Wikipedia article) which is about the concept, but not much information therein. We could have created a Wikidata entry and have our data listed as an instance (example) of a Discrete Global Grid, but we found that our 3D data would have more context in the Wikipedia article. Therefore, we decided not to put our reference in Wikidata.
After asking some colleagues, we found that a more typical use case would be the following: A published research article uses a dataset. Then a Wikipedia page references the published article as a source. By creating a reference in Wikipedia, an entry in Wikidata is created. Then a (different) entry in Wikidata representing the dataset is linked to the entry representing the published article. This way, there is a path from Wikipedia to the research data referenced in Wikidata. Hopefully, eventually, the dataset is used in other publications (referenced in other Wikipedia pages) and Wikidata can keep track of all the works derived from that dataset.
Assessing the FAIRness
At this point, we were wondering, how can we tell if our data is really FAIR? How well did we do? Fortunately, there is also a tool to assess that!!
The Automated FAIR Data Assessment Tool from FAIRsFAIR data initiative accepts any working reference, a DOI for instance, and tries to determine its FAIRness from its metadata. It generates a summarised report with individual scores and a final global mark. Luckily for us, Zenodo handles that metadata quite well and makes it available via the HTML code on the Zenodo page itself.
So how did we do? On a scale from 0 to 3, our grand score is: "moderate" or 2.
To improve that score, we could have edited the metadata and added more details; however, that is still a feature under development in Zenodo (e.g., supporting the citation file format), and it may be a bit cumbersome to edit that metadata on other platforms.
Conclusion
Overall we were satisfied with this experiment of making our data FAIR. The GitHub workflow is a bit difficult to learn but it is nowadays part of software development. An added benefit is that it can integrate into FAIR workflows. Zenodo was a success: easy to use, takes care of most of the metadata, and provides free DOIs. Wikipedia is not difficult, but you need to restrain your interest in getting visibility from undermining the general interest of an encyclopedia. About Wikidata, we concluded that it is not for our use case (although it might be for other research data). Finally, the FAIR data assessment tool is great not only to evaluate but also to educate on good practices and improving your FAIRness. Probably there are still many tools and hints that we can discover, but so far it was not so hard a trip to make.
We hope that reading about our experience encourages you to re-evaluate and want to improve the FAIRness of your data.

In Conversation with Cedric Villani
In the first episode of the interview series Data Date, Cedric Villani joins Christiane Görgen for a brief exchange of thoughts about Math & Data.
OpenML hackathon at Dagstuhl castle
Sebastian Fischer and Oleksandr Zadorozhnyi, of the MaRDI task area Statistics and Machine Learning, participated in an OpenML hackathon held in late March at the headquarters of the Leibniz Center for Informatics at Dagstuhl, Germany.
OpenML is an open-source platform for sharing datasets, algorithms, experiments, and results. The hackathon was initiated by Bernd Bischl, one of the key players behind OpenML and a Co-Spokesperson in MaRDI. Researchers from other parts of Germany, France, the Netherlands, Poland, and Slovenia were present to discuss topics such as data quality on OpenML, an extension of its established services to new data formats, and new computational tasks.
The review article "Datasheets for datasets" provided fruitful exchanges on future improvement of data and metadata quality. In particular, support for non-tabular data formats such as images was discussed and will now be embedded by transitioning from the attribute-related file format to parquet. The so-far available eight types of tasks, including regression, classification, and clustering, will be extended to new tasks which are typical for graphical modeling. As this is one of the main use cases and an important topic for both Sebastian and Oleksandr, discussions on the problem of graphical-model structure estimation from a given dataset, embedding into the current set of tasks available on OpenML, addition of different evaluation measures or criteria for model selection and storage of graph-specified datasets within the OpenML framework were had with Jan van Rijn. The evaluation measures and criteria for model selection allow for the comparison of estimated graphs to some given ground truth, a procedure that is not normally part of the ML workflow.
Sebastian also presented their collaborative work with Michael Lang on the mlr3oml R package. This package connects the OpenML platform to the open-source machine learning mlr3 package in R, another crucial aspect of the MaRDI task area.
The hackathon was rounded out with social activities like a walk through the forest. The good weather aside, special thanks needs to be given to Joaquin Vanschoren, the OpenML founder, whose supply of water to the whole group during the hike was the other reason why everyone made it back to the castle in good spirits!!!
All in all the week in Wadern was a pleasant and fruitful one for all the participants.

We will also be introducing you to the people who shape MaRDI with their expertise and vision for mathematical research data. They will appear in a series of "Making MaRDI" interviews available via our Twitter account. Stay tuned!


Call for seed funds 2023
These funds support scientists from all fields of research within engineering, relating to the development and implementation of innovative ideas in data management. The grant is equivalent to the funding of a full-time doctoral position for one year. If necessary, the funding can be split between project partners.
More information:


- To learn about the Nationale Forschungsdateninfrastruktur, the community of which MaRDI is just one small part, read the 2021 article by Nathalie Hartl, Elena Wössner, and York Sure-Vetter in Informatik Spektrum. See doi.org/10.1007/s00287-021-01392-6
- Christiane Görgen and Claudia Fevola explain in a short review article the role repositories can play in the MaRDI infrastructure. They use MathRepo as an example, a small math research-data repository hosted at the Max Planck Institute for Mathematics in the Sciences in Leipzig. See arxiv.org/abs/2202.04022
- The interim report of the European Commission Expert Group on FAIR data discusses how to turn FAIR into reality. See doi.org/10.2777/1524
- Thomas Koprucki and Karsten Tabelow have been two of the driving forces in the early stages of MaRDI. Together with Ilka Kleinod they discussed mathematical models as an important type of mathematical research data in a 2016 article for the Proceedings in Applied Mathematics and Mechanics: doi.org/10.1002/pamm.201610458
Our Newsletter "Math & Data Quarterly" is prepared by our partner IMAGINARY. You can unsubscribe easily at any time.