09-2023 6th issue content

Information on cookies

On the basis of your freely given consent, which can be revoked at any time, your visit to our website is recorded using Matomo, an open source web analytics software program. The information produced will be used solely for statistical purposes and to improve the website and server. No personal data will be stored or shared with third parties. For more information, please refer to the Privacy Policy on our website. By clicking “Accept”, you consent to our use of cookies for analytical purposes. If you do not agree to this, please click “Reject”. In principle, you can visit our website without cookies being enabled. This does not apply in the case of essential cookies.

Welcome

Welcome back to the Newsletter on mathematical research data—this time, we are discussing a topic that is very much at the core of our interest and that of our previous articles: what is mathematical research data? And what makes it special?

Our very first newsletter delved into a brief definition and a few examples of mathematical research data. To quickly recap, research data are all the digital and analog objects you handle when doing research: this includes articles and books, as well as code, models, and pictures. This time, we zoom into these objects, highlight their properties, needs, and challenges (check out the article "Is there Math Data out there?" in the next section of this newsletter), and explore what sets them apart from research data in other scientific disciplines. We also report from workshops and lectures where we discussed similar questions, present an interview with Günter Ziegler, and invite you to events to learn more.

*"Yesterday at the Math Bazaar exchanging recent research data..."*

Download the illustration

by Ariel Cotton, licensed under CC BY-SA 4.0.

We start off with a fun survey. It is again just one multiple choice question. This time, we created a decision tree, which will guide you to answer the question:

What type of mathematician are you?

You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.

The decision tree is available as a poster for download, licensed under CC BY 4.0.

Research Data in Math

Is there Math Data out there?

“Mathematics is the queen and servant of sciences”, according to a quote by Carl F. Gauss. This opinion of Gauss can be a source of philosophical discussions. Is Mathematics even a science? Why does it play a special role? Connecting these questions to our concerns, what is the relationship between research data and these philosophical questions? We cannot arrive at a conclusion in this short article, but it is a good starting point to discuss the mindset (the philosophy, if you wish) that should be adopted regarding research data in mathematical science.

A wide agreement is that a science is any form of study that follows the scientific method: observation, formulation of hypotheses, experimental verification, extraction of conclusions, and back to observation. In most sciences (natural sciences and, to a great extent, also social sciences), observation requires gathering data from nature in the form of empirical records. In contrast, in pure mathematics observations can be made simply by reflection on known theory and logic. In natural sciences, nature is the ultimate judge of the validity or invalidity of a theory. This experimental verification also requires gathering research data in the form of empirical records that support or refute a hypothesis. In contrast, in mathematics, experimental verification is substituted by formal proofs. Such characteristics have prompted some philosophers to claim that mathematics is not really a science, but a meta-science because it does not rely on empirical data. More pragmatically, it can tempt some researchers and mathematicians to say that (at least, pure) mathematics does not use research data. But as you will guess, in the Mathematical Research Data Initiative (MaRDI), we advocate for quite the opposite view.

Firstly, some parts of mathematics do use experimental data extensively. Statistics (and probability) is the branch of mathematics for analyzing large collections of empirical records. Numerical methods are practical tools to perform computations in experimental data. This is the case for pure mathematics as well, where we can build lists of records (prime numbers, polytopes, groups…) that are somewhat experimental.

Secondly, research data are not only empirical records. Data are any raw piece of information upon which we can build knowledge (we discussed the difference between data, information, and knowledge in the previous newsletter). When we talk about research data, we mean any piece of information that researchers can use to build new knowledge in the scientific domain in question, mathematics in this case. As such, articles and books are pieces of data. More precisely, theorems, proofs, formulas, and explanations are individual pieces of data. They have traditionally been bundled into articles and books, and stored in paper, but nowadays are largely available in digital form and accessible through computerized means.

Types of data

In modern mathematical research, we can find many types of data:

Documents (articles, books) and their constituent parts (theorems, proofs, formulas…) are data. Treating mathematical texts as data (and not only as mere containers where one deposits ideas in written form) recognizes that mathematical texts deserve the same treatment as other forms of structured data. In particular, FAIR principles and data management plans also apply to texts.

Literature references are data. Although bibliographic references are part of mathematical documents, we mention them separately because references are structured data. There is a defined set of fields, (such as author, title, publisher…), there are standard formats (e.g. bibTeX), and there are databases of mathematical references (e.g. zbMATH, MathSciNet,...). This makes bibliographic references one of the most curated type of research data (especially in Mathematics) .

Formalized mathematics is data. Languages that implement formal logic like Coq, HOL, Isabelle, Lean, Mizar, etc, are a structured version of the (unstructured) mathematical texts that we just mentioned. They contain proofs verifiable by software and are playing an increasingly vital role in mathematics. Data curation is essential to keep those formalizations useful and bound to their human-readable counterparts.

Software is data. From small scripts that help in a particular problem to wide libraries that integrate into larger frameworks (Sage, Mathematica, MATLAB…). Notebooks (Jupyter,...) are a form of research data that mix text explanations and interactive prompts, so they need to be handled as both documents and software.

Collections of objects are data. Classifications play a major role in mathematics. Either gathered by hand or produced algorithmically, the result can be a pivotal point on which many other works will derive from. Although this output result of a classification can have more applications than the process to arrive at it, it is essential that both input algorithm (or manual process) and the output classification are clearly documented, so that the classification can be verified and reproduced independently, apart from being reused in further projects.

Visualizations and examples are data. Examples and visual realizations of mathematical objects (including images, animations, and other types of graphics) can be very intricate and have an enormous value for understanding and developing a theory. Although examples and visualizations can be omitted in more spartan literature, if provided, they deserve a full research data curation as other research data essential to logical proofs.

Empirical records are data. Of course, raw collection of natural information, intended to be processed to extract knowledge of the data itself, or from the statistical method, are data that need special tools to handle. This applies to statistical databases, but also to machine learning models that require vast amounts of training data.

Simulations are data. Simulations are lists of records not measured from the outside world, but generated from a program. This is usually a representation of a state of a system, including possibly some discretizations and simplifications of reality in the modeling process. As with collections, this output simulation data is as necessary as the input source code that generates it. Simulation data is what allows us to extract conclusions, whereas the reproducibility verification requires that the processing input-to-output be performed by a third party, allowing the recognition of flaws or errors in either the input or the output, or allowing for the rerun of the simulation with different parameters.

Workflow documentations are data. More general than simulations, workflows involve several steps of data acquisition, data processing, data analysis, and extraction of conclusions in many scientific researches. An overview of the process is in itself a valuable piece of data, as it gives insights into the interplay of the different parts. A numerical algorithm can be individually robust and performant, but it may not be the best fit for the task at hand. We can only spot such issues when we have a good overview of the entire process.

The building of mathematics

One key difference between mathematics and other sciences is the existence of proofs. Once a result is proven, it is true forever, as it cannot be overruled by new evidence. The Pythagorean theorem, for instance, is today as valid and useful as it was in the times of Pythagoras (or even in the earlier times of ancient Babylonians and Egyptians, who knew and used it. However, the Greeks invented the concept of proof, turning mathematics from a practice into a science). The Book of Elements by Euclid, written circa 300 BC, one of the most relevant books in the history of mathematics and mankind, perfectly represents the idea that mathematics is a building, or a network, in which each block is built on top of others, in a chain starting with some predetermined axioms. The image shows the dependency graph of propositions in Book I of the Elements.

Dependency graph of propositions in Book I of Euclid’s Elements (source). Proposition I.47 is the Pythagorean theorem.

Imagine now that we extend the above graph to include all propositions and theorems from all mathematical literature up to the current state of research. That huge graph would have millions of theorems and dependency connections, and will be futile to draw on paper. This graph does not exist yet physically or virtually except as an abstract concept. Parts of this all-mathematics graph are stored in the brains of some mathematicians, or in literature as texts, formulas, and diagrams. The breakthrough of our times is that it is conceivable to materialize this graph with today’s technology, in the form of a knowledge graph similar to those being developed at MaRDI or Wikidata. The benefits of having such a graph in a computer system are many: we will be able to find any known theorem that applies to our problems, access the fundamental blocks of literature where those results were established, find and verify logical connections in complex proofs, facilitating a panoramic view of mathematics and its different areas.

The crucial point is that to succeed in such an endeavor, we must realize that mathematical knowledge is composed of pieces of data, that require FAIR and complex data management and a particular infrastructure to handle data at this scale. Although it is not completely out of MaRDI’s scope, MaRDI itself does not have a goal of creating a knowledge graph of all mathematical theorems but instead focuses on the research data management required by today’s researchers. The most advanced project aiming to fulfill this all-mathematics graph is probably within the LEAN community (see also our interview with Johan Commelin).

Mathematics as a tool

The “special role” of mathematics amongst sciences comes from the role of tool that it plays in any other science, to the point that a science is not considered mature enough until it has a mathematical formalization. The fact that mathematics can be used as the tool for doing science is the so-called “unreasonable effectiveness of mathematics in the natural sciences”. But once this role of mathematics as a tool is accepted, we must admit that, in theory, it is a very reliable tool. It is so, foremost, because of the logical building process that we described above. A proven theorem will not fail unexpectedly, the rules of logic will not cease to exist tomorrow. But in practice, relying on tools that someone else developed requires, first, that one can trust the tool to execute its intended goal; and second, that one can learn how to use the tool effectively. This entails responsibility from mathematics as a science and from mathematicians as a community with respect to other sciences and researchers.

As happens with physical tools, a craftsman must know their tools well in order to use them efficiently. But also any modern toolmaker must state clearly the technical characteristics of the tool, the intended use, the safety precautions, its quality standards and regulations, etc. In our analogy, mathematicians must take care of impeccable preparation of the results they produce, especially when talking about algorithms and methods that will probably be applied by researchers in other fields of science.

Think of the calculus used in quantitative finance, statistical hypothesis tests to analyze data in medicine, or computers tracking the exact location of spaceships. If mathematicians did not get their derivatives and integration right, these methods will not provide reliable results, leading to wrong conclusions, often even putting people’s lives in danger. It is of utmost importance to be able to fully trust at least the theoretical basis, especially since applied science has to deal with rounding errors, components of nature that were not integrated into the original model, and the possibility of human failure. This requires a verifiability of the results.

Concerning the mastering of the use of a tool, mathematical production must take into account its future reusability as tools for other scientists. This means appropriate documentation, using appropriate standards for interoperability with existing tools, using legal licenses that allow unencumbered reusability, and in general following some form of agreed good practices of the community that can help as guidelines for the research practice.

Modern science in the age of information and computation depends entirely on research data, but different fields have adapted their methods and practices with uneven success. Mathematics is not especially well placed in terms of managing research data and software in comparison to other fields.

Software development, especially in the open source community, has been facing data management problems for decades, meaning that some of the solutions are currently standard practices in the industry. For instance, version control (with git as a de-facto standard tool) is a basic practice to track changes and improvements to source code (could be any document or any data). If we couple the version control with a public repository (GitHub, GitLab…), we get a reliable method of publishing software and working collaboratively. Once a project has many contributors, one will face merging problems, when different teams develop in different directions. A solution is a continuous integration scheme, with automated tests, that guarantee your modifications will not break other parts of the project if adopted. The amount of security and verification in the industry for any new development in big software projects (think for instance on new Linux kernel releases) is certainly unparalleled in most software projects in the scientific research community (with notable exception efforts like xSDK). This is often excused as research is in its nature experimental (in the sense of untested and unfinished), but academic and theoretical research should not have lower standards than industry research.

Data Dates

The video is available under the CC BY 4.0 license. You are free to share and adapt it, when mentioning the author (MaRDI).

In Conversation with Günter Ziegler

"There's nothing more successful than success" Günter Ziegler says in our latest data date: best practices will be embraced by the community. We talk about what's his combinatorical view on research data, the need for classifications, and the difference between everlasting mathematical results and theories in physics.

Mathematics Meets Data: Highlights from MaRDI's Barcamp

What better way to get researchers to find out that research-data management is their topic than with a Barcamp? That way, every participant can explore their own experiences, questions, and approaches.

On July 4th, MaRDI hosted its first Barcamp on Research-Data Management in Mathematics at Bielefeld University's Center for Interdisciplinary Research. It was a joint effort involving the Bielefeld mathematics faculty, MaRDI, BiCDaS, and the Bielefeld Competence Center for Research Data.

The day began with a casual breakfast, where attendees mingled, discussed expectations, and chatted about questions. A poster showcasing research data types served as a useful conversation starter (find the download link for the poster in the welcome section of this newsletter issue).

Before the session pitches commenced, Lars Kastner and Pedro Costa-Klein delivered brief talks on code reproducibility and best practices for using Docker in the Collaborative Research Center 1456 (Mathematics of the Experiment) in Göttingen, respectively.

The session pitches revealed that the Barcamp had appealed to many young researchers unfamiliar with the topic. To address this, an introductory session on "What is research data?" kicked off the discussions. Meanwhile, those more experienced with research data management discussed ways to engage the mathematical community with the topic.

One of the defining features of a Barcamp is its participant-driven agenda. Attendees had the unique opportunity to shape the discussions and focus on the topics most pertinent to their research and data management needs. This resulted in a diverse set of topics. One session on research data management plans matched experts from the Competence Center and mathematicians to exchange perspectives and requirements. A smaller group's discussions centered on Binderhub, whereas another tackled research data repositories and their adherence to FAIR principles. Additional sessions explored the peculiarities of mathematical research data, the importance of good documentation, and a hands-on session on an online databasethat collects and discusses ideas on FAIR data.

This Barcamp offered the mathematics community an exceptional platform to exchange insights and inquiries regarding research-data management within their discipline.

Teaching research-data management

A survey conducted in the summer of 2021 in German mathematics departments revealed that teaching mathematicians estimate the awareness and knowledge of their students regarding good scientific practice, authorship attributions, the FAIR principles, and research software as too low. Unfortunately, these are classical research-data management (rdm) topics. Motivated by that need and by successful, cross-disciplinary rdm courses at Bielefeld and Leipzig universities, six lectures in research-data management for mathematicians took place in Leipzig in the summer term 2023. To the teacher's knowledge, this was the first of its kind. The large group of attendees came from a variety of career levels including six undergraduate students, two PhD students, two postdocs, and five MaRDIans. This contributed to lively discussions centered around properties and common problems of mathematical research data, metadata standards for papers and the difficulties in deciding appropriate metadata for mathematical results, the scientific method, good scientific practice, and how to write, cite, and document mathematics. Feedback for the course was very good, with students appreciating the interactive atmosphere, the time allocated for questions, and the informal nature of the classes. A one-day course of maths rdm in Magdeburg in October will build on these first successful sessions and discuss questions of reproducibility and repositories, in addition to introductory topics. Lecture notes for both are now in the making. They will be made publicly available for a second installment next summer term for free use and reuse by any mathematician interested in the topic of rdm.

MaRDMO Workshop at the NFDI-MatWerk Conference

The "1st Conference on Digital Transformation in Materials Science and Engineering - NFDI-Matwerk Conference" took place in Siegburg between 26-29.06.2023. With 30 talks, 17 posters, 10 workshops, and 160 participants (on-site and online), the conference provided an ideal setting for the urgently needed transformation in materials science. In addition to status updates from each NFDI-MatWerk task area and various interdisciplinary use cases, the conference initiated collaborations between different NFDI consortia and new community participants, emphasizing their role in shaping the future of NFDI-MatWerk. Several NFDI consortia, namely NFDI4Chem, NFDI4energy, DAPHNE4NFDI, and FAIRmat, also gave keynote presentations, highlighting the need for collaboration.

Marco Reidelbach from TA4 attended the conference on behalf of the MaRDI consortium to present MaRDMO, a plugin for the Research Data Management Organiser (RDMO) for documenting, publishing, and searching interdisciplinary workflows. Though participation was low at the 100-minute demonstration, discussions vital for the further development of MaRDMO ensued. The central point of the discussion was the automation of the documentation process to minimize additional work for researchers, thereby increasing the acceptance of MaRDMO. We also discussed the use of RDMO, which on paper appears to be an ideal interface to all research disciplines, but was completely unknown to the workshop participants. Here, the NFDI in particular is also called upon to take a clear stand. A good two-thirds of the consortia have declared their support for RDMO, while the remaining consortia want to rely on alternatives or are still undecided.

Overall, the NFDI-MatWerk consortium conference showed that the defining infrastructural issues, far from the concrete content, differ little or not at all from the issues in the MaRDI consortium and the other consortia at the conference. The construction of knowledge graphs and the harmonization of ontologies are central problems that require a joint effort and make it necessary to leave one's own comfort zone.

MaRDI at CoRDI

MaRDI was present at the first Conference on Research Data Infrastructure (CoRDI), held in Karlsruhe between 12 - 14 September 2023. This interdisciplinary event brought all the NFDI consortia together, during which they presented their projects in general and detailed discussions. The conference was a unique opportunity to exchange experiences and ideas amidst a wide range of communities with different needs, but share common challenges and solutions regarding Research Data.

MaRDI presented three talks and two posters. The general conference proceedings are linked in the recommended further reading section at the end of this newsletter issue. We provide links to individual sections here:

Talks:

MaRDI. Building Research Data Infrastructures for Mathematics and the Mathematical Sciences. Renita Danabalan, Michael Hintermüller, Thomas Koprucki, Karsten Tabelow.
MaRDIFlow: A Workflow Framework for Documentation and Integration of FAIR Computational Experiments. Pavan L. Veluvali, Jan Heiland, Peter Benner.
Building Ontologies and Knowledge Graphs for Mathematics and its Applications. Björn Schembera, Frank Wübbeling, Thomas Koprucki, Christine Biedinger, Marco Reidelbach, Burkhard Schmidt, Dominik Göddeke, Jochen Fiedler

Posters:

MaRDMO Plugin. Document and Retrieve Workflows Using the MaRDI Portal. Marco Reidelbach, Eloi Ferrer, Marcus Weber.
Spreading the Love for Mathematical Research Data. Tabea Bacher, Christiane Görgen, Tabea Krause, Andreas Matt, Daniel Ramos, Bianca Violet.

NFDI4friends

Math Meets Information Specialists, October 09 - 11, 2023, MPI MiS, Leipzig

MaRDI invites information specialists, librarians, data stewards, and mathematicians to discuss mathematical research data, present their own ideas and services, and make new connections in a three-day noon-to-noon workshop with talks, hands-on sessions, and a barcamp. The workshop will be held in German.

More information:

in German

Data-Driven Materials Informatics, March 4 - May 24, 2024

The aim of this long program at IMSI is to bring together a diverse scientific audience, both between scientific fields (physical sciences, materials sciences, biophysics, etc.) and within mathematics (mathematical modeling, numerical analysis, statistics, data analysis, etc.), to make progress on key questions of materials informatics.

More information:

in English

RDM with LinkAhead, September 29, 2023, online

At the NFDI4Chem Stammtisch, the research data management software LinkAhead will be introduced. This agile, open-source software toolbox enables professional data management in research where other approaches are too rigid and inflexible. It will make your data findable and reusable.

More information:

in English
in German

NFDI Code of Conduct

The Consortial assembly, comprising the speakers of each consortium, voted on 27 June 2023 to adopt the code of conduct for the NFDI. This Code of Conduct is intended to provide a binding framework for effective collaboration within the NFDI association.

More information:

in German