Math & Data Quarterly
News and insights into the realm of mathematical research data
Welcome back to a new issue of the MaRDI newsletter after a hopefully prosperous and colorful summer (or winter, depending on your location on Earth). Mathematical research data is usually independent of seasons, color, or location. However, our current topic does combine geographic maps and colors – the four-color theorem.
by Ariel Kahtan, licensed under CC BY-SA 4.0.
The main article of our newsletter will take you on a journey through time, technology, and philosophy, raising questions such as: What constitutes a proof? How much can we trust computers (and their programmers)? and How much social aspect is there in writing mathematical proofs? In the Data Date section, you will meet Yves Bertot, a computer scientist who is the maintainer of the official Coq four-color theorem repository on GitHub. In our one-click survey, we ask you to answer the question:
Do you trust computer-based proofs in mathematics?
In June, we asked you to select one "item" to symbolize the entirety of mathematics. Here are your results:
Some of you also suggested new items. Among these were the Hopf fibration, axioms, and a topological manifold.
A lost and found proof
The four-color theorem is a famous result in modern mathematics folklore. It states that every possible map (any division of the plane in connected regions, like a world map divided into countries, or a country divided into regions) can be colored with at most four colors in a way such that adjacent regions do not share the same color. Like many famous theorems in mathematics, it has a simple enough statement that anyone can understand and a fascinating story behind it. This was the first major theorem in pure mathematics for which its proof absolutely needed the assistance of a computer. But more interestingly for us, the story of the four-color theorem is a story of how Research Data in Mathematics became relevant and how mathematicians were led to debate questions such as “How do we integrate computers into pure mathematical research?”, or more fundamentally, “What constitutes a proof?”. The story goes like this:
A problem of too many cases
In 1852, a young Englishman by the name of Francis Guthrie was playing to color the counties of England, avoiding neighboring counties to have the same color. In the England map case, he managed to do it with four colors, and he wondered if any conceivable map would also be colorable with only four colors. The problem of coloring regions of a map is equivalent to coloring the vertices of a planar graph. To build that graph, take one vertex per region and join two vertices if the corresponding regions are adjacent (see the illustration accompanying this article). If you start with a map, you obtain a planar graph (the edges can be drawn so they do not cross each other), and every planar graph can be converted into a map. This is therefore a problem in graph theory. Francis asked his brother, Frederick, a student of mathematics, who in turn asked his professor, Augustus de Morgan. De Morgan could not solve it easily and shared it with Arthur Cayley and William R. Hamilton, and after a while, it caught some interest in the mathematical community.
It was more than 20 years later, in 1879, when Alfred Kempe, a barrister with some mathematical inclinations, published a proof with many elegant ideas [Kem79]. Kempe’s proof was accepted as valid for some years, but unfortunately, Percy Heawood found it erroneous in 1890. Kempe’s strategy, however, was right, and it was the basis for all subsequent attempts to prove the conjecture. It went like this: First, find a finite set of “configurations” (parts of a graph) such that every hypothetical counterexample must contain one of such configurations (we call it an unavoidable set of configurations). Second, for each one of these configurations, prove that if it can be colored with five colors, then we can rearrange the colors such that only four are used (we say that it is a reducible configuration). The existence of such a set of unavoidable reducible configurations proves the theorem (if there was a minimal counter-example map needing five colors, it would contain one such configuration, but then it would be possible to re-color it with only four). Kempe’s proof used a set of only four configurations, that he checked one by one. Unfortunately, he failed to see one possible sub-case when checking one of them. Kempe’s original proof could not be fixed, but many mathematicians became gradually convinced that the solution would come from having a (much) larger set of unavoidable reducible configurations. It was a problem of enumerating and checking a huge number of cases.
The problem remained open for eight more decades, with many teams of mathematicians racing for the solution in the 60s and 70s. In 1976, mathematicians Kenneth Appel and Wolfgang Haken finally made a valid proof [AH77a], [AHK77]. Their method, however, was a bit unorthodox. Haken was already a renowned topologist, and Appel was one of the few people who knew how to program the new mainframe computer that their university (University of Illinois) had recently purchased. Together, they developed a theoretical argument that reduced the proof to check 1,936 configurations that had to be analyzed independently. Then, they used the newly available computers to write a program that verified all cases, completing the proof. They published their result as a series of long articles [AH77a], [AHK77], supplements [AH77b], [AH77c], and revised monographs [AH89], filled with long tables and hand-drawn diagrams representing every case and its analysis, collected manually in a pre-databases era.
Appel and Haken's proof was controversial at the time. Many mathematicians could understand the theoretical parts of the proof, but one needed a computer scientist to program the code. Programming was quite a rare skill for mathematicians in 1976. Even if you understood both the math and the algorithm, you needed to implement it. The program was originally made in the assembly language of the IBM 370 mainframe system, which, again, was not available to everybody. And even if you had all the pieces, you would have to trust the processors of the computer, and even then, you could have the uncertainty that maybe there is some typo in the code, maybe you forgot a case to be checked, or maybe you overlooked something on the tedious visual inspections of the diagrams (actually, several minor typos were found and fixed in the original articles). Nevertheless, the mathematical community mostly celebrated the achievement. Postal stamps were issued in commemoration, and it made to the scientific and even mainstream news [AH77d].
In 1995, almost twenty years later, mathematicians Neil Robertson, Daniel Sanders, Paul Seymour, and Robin Thomas tried to replicate Appel and Haken’s proof with a more modern computer. In the process, they found it so hard to verify (understanding and implementing the algorithms) that they eventually gave up and decided to make their own proof [RSST96]. It was a substantial modification, making it more streamlined. In particular, they reduced the number of cases to be checked to 633. Still too many to be checked by hand, but an improvement anyway. They also used a higher-level programming language, C, and they made the source code available through a website and an FTP server at their university.
The last stop in our story came in 2005 by computer scientist Georges Gonthier. Originally aiming for a proof of concept that computers can help in formal logic, Gonthier managed to translate Robertson et al. proof into a fully formalized, machine-readable proof of the four-color theorem, using the system and language Coq [Gon05], [Gon08]. To verify the truthness of the theorem, one does not need to verify the 60 000+ lines of code of the formalized proof, but it is enough to verify that the axioms of the system are correct, that the statement of the theorem is correct, and that the Coq engine spits out a “correct compilation” meaning that every step in the proof code is a logical consequence of the previous ones. You still need the computer, but the things to trust (axioms, system engine, the Type Theory that provides a theoretical framework ) are not specific to the theorem at hand, and arguably thousands of researchers have checked and verified the axioms and the engine for reliability. The four-color theorem was used as a benchmark, as a test example to prove the capabilities of the Coq system. The Feit-Thompson theorem in group theory was another example of a long and tedious proof that was formalized into Coq, removing any doubt that some typo or hidden mistake could ruin the Feit-Thompson proof. But more importantly, it demonstrated that formalized mathematics was capable of describing and verifying proofs of arbitrarily complex and abstract mathematics.
The reader interested in the details of the four-color theorem’s history and the main ideas of the proof can check the references in the “Recommended readings” section of this newsletter. Instead of digging further into history, we will make a particular reading of this story focusing on how it changed our view of mathematics through the lens of Research Data.
The innovations used in the proofs of the four-color theorem prompted two major changes in the philosophical conception of mathematics and proof: first, the use of computers to assist in repetitive verification tasks, and second, the use of computers to assist in logical deductions.
The computer-assisted proofs revolution
When Appel and Haken announced their proof, they sparked many uncommon reactions in and outside the mathematical community. Instead of the cheer and joy of solving a century-old problem, many mathematicians showed unease and dissatisfaction with the fact that a computer was necessary to find and check thousands of cases.
Most of the community accepted the result as valid, but not everybody was so happy. Many believed that a shorter and easier proof could be found by cleaning Appel and Haken’s argument, and they hoped that, although computers were used to find the list of cases to check, once found the verification could be done without computers. In a despicable episode [Wil16], Appel and Haken were once rejected to speak at a university by the head of the mathematics department on the grounds that their proof was completely inappropriate and, at the same time, no professional mathematician would work anymore on the problem to find a satisfactory proof. Thus, it was argued, that they had done more harm than good to the mathematical community, and they did not deserve publicity, nor should students be exposed to these ideas.
In public, intellectuals also had their confrontations. In 1979, philosopher Thomas Tymoczko published a paper, The Four-Color Problem and Its Philosophical Significance [Tym79], in which he defended the idea that the four-color theorem was not proven in the traditional meaning of the word “prove”, and that either it should be considered unproven until a human could read and verify the proof, or the meaning of “proof” should be redefined as a much weaker sense. For Tymoczko, a proof must be convincing, surveyable, and formalizable. While all experts in the four-color theorem were eventually convinced, and the computer provided evidence that there was a formal argument proving the theorem, Tymoczko argued that the proof was not surveyable, meaning that no human could follow all the calculations and details. The mathematician Edward R. Swart replied against Tymoczko with the article The Philosophical Implications of the Four-Color Problem [Swa80], where he defended the validity of Appel and Haken’s method, envisioning a future where computing tools would be integrated into mathematicians' practice.
The philosophical debate in the 70s and 80s also highlighted some social and generational aspects of the mathematical community. Older mathematicians had high concerns that the part of the proof that depended on the computer could have some mistakes, or the computer could have had a “glitch” and made a mistake. A younger generation of mathematicians could not believe that the hundreds of pages with hand-made computations and drawings were reliable. But more importantly, it was the lack of an easy verification procedure that concerned the whole community.
Today, the use of computers is standard practice among mathematicians. Not only as a help in office tasks (exchanging emails, typesetting in LaTeX, publishing on journals or repositories…), but also as a core tool for mathematical research, including numerical algorithms, computational algebra, classification tasks, statistical databases, etc. It took many decades to be convinced that using computers does not necessarily mean doing applied mathematics or being engineering-oriented. In the process, however, we learned that using computers requires some good practices to handle the data associated with them, not the least the FAIR principles of research data, and the importance of open access for verifiability.
The lost and found proofs. A cautionary tale of Research Data.
The four-color theorem is paradigmatic of why a research data perspective is necessary in modern mathematics. This theorem put on the table for the first time philosophical and practical questions of what constitutes a proof, how much we can trust computers (and their programmers), or how much social aspect is there in writing mathematical proofs.
In some sense, the four-color theorem has been “lost” or “almost lost” several times. The first time was when Kempe’s proof was found erroneous. This highlights the importance of verifying proofs and the fact that many errors and gaps can be subtle and difficult to spot.
The second time is more difficult to associate with a precise event. When Appel and Haken’s proof was found, it was verified by them (with the help of a computer) and to some extent by a relatively small group of mathematicians in their community. That was a quite competitive community, a true race took place to be the first to provide complete proof of the four-color theorem. Every expert in the problem had the opportunity to check Appel and Haken’s proof, by comparing the published calculations with their own and having direct exchange with the two authors. All these experts agreed unanimously that the proof was valid, so the mathematical community at large was convinced of its validity, though some dissonant critics argued against that recognition. Then, some years passed, and nobody else was interested in re-checking Appel and Haken’s algorithm. When a new generation of mathematicians (Robertson et al) tried to replicate it, they found it was unfeasible. They had no code to test, the algorithm was not entirely clear, and they would have to enter manually hundreds of cases. The proof was effectively lost*.
Let us note that in the late 70s, computers were highly advanced laboratory equipment. Most mathematicians would not have access or know how to operate these devices. Operating systems were in their infancy, and most programs were only able to run in a specific hardware architecture, depending on the brand or even model (like the IBM 370 mainframe). The most reliable way of sharing a program was to describe the algorithm verbosely or with flow diagrams and leave to the receiver the task of implementing it in their machine architecture. Appel and Haken published their proof originally in two parts [AH77a], [AHK77], and then provided two supplements[AH77b], [AH77c], with more than 400 pages of calculations, as computer outputs (the output interface of the computer was not a screen, but a teletype, it printed characters on paper). Keep in mind that this was before TeX and LaTeX, so all the text was set with a typewriter, hand-written formulas, and hand-drawn diagrams. The supplements were not published directly, but instead, they were produced as microfiches, a support consisting of microphotographs on a film that could be read with an appropriate device with a magnifying lens and a light source, like a film slide. There was only a very limited number of copies of these microfiches deposited in some university libraries.
Appel and Haken faced a new situation when publishing their result. From a practical point of view, they wanted to share as much information as possible to make the proof complete (to be convincing, as surveyable as possible, and to demonstrate it was formally flawless). They just had no repositories, no internet, and no practical way to share scientific code in a way that it would be possible to reuse. In today’s terms, the part of their proof that used computers met none of the FAIR principles. Of course, this judgment would be anachronic and not fair (pun intended). It is precisely because of their pioneering work that the mathematical community started realizing that better standards were necessary.
It took Robertson et al. more than a year to develop a new proof, which was an improvement with respect to Appel and Haken’s in several aspects (fewer cases to check, streamlined arguments). This time, they had some interoperable tools (e.g., the C programming language, which would compile in any personal computer), and exchange methods suited to code (websites, FTP), so they made all the source code fully open and available. In modern terms, we would say that, with respect to Appel and Haken’s data, they improved the findability and accessibility by setting up the web and FTP, the interoperability by using the C language, and the reusability given the later reuse by Gonthier.
Today, we have better FAIR standards (indexing the data, hosting it at recognized repositories, etc.), but again, it would be anachronical to blame them. If the original web and FTP server that hosted the code had disappeared, it would have been the third time that the proof had been lost, and for the very mundane reason that nobody kept a copy. Their servers did disappear eventually, but certainly, there were copies, and today, the original files can be found for instance in arXiv, uploaded by the same authors in 2014 [RSST95]. This is a much more robust strategy for findability and accessibility and aligns with today’s standards.
Interestingly, Robertson et al. deliberately used the computer also for the part of the argument that Appel and Haken had managed to describe by hand. If one needs to use the computer anyway, then let’s use it for all tedious calculations, not only those exceeding human capabilities. In the end, the computer is less prone to make mistakes due to fatigue in a repetitive task.
Gonthier brought that idea to the extreme: What if all the proof could be set up to be checked by the computer, not only the repetitive verification of cases? What if a computer was also less error-prone in following a logical argument? When Gonthier started his work on the four-color theorem, the proof was certainly not lost (Robertson et al.’s proof was the basis for Gonthier), but it was missing some FAIR characteristics it never had. Verifiability, which is the close cousin of Reusability, was still a weak point of the theorem.
In the new millennium, when computers were personal and the internet was transforming the whole society, the old philosophical debate about what constitutes a proof, and how verifiable is a proof that is not entirely surveyable, but that it was formalizable was put again in the spotlight. However, this time, the question wasn’t if computers should have a role in mathematical proofs, but instead, the pressing issue was: what are the good standards and practices that we should have for the role that computers certainly play in mathematics and all sciences?
The formalized mathematics revolution
Security in computing is a field that most people associate with cryptography to send secret messages, identity verification to prove who you are online, or data integrity to be sure that the data is not corrupted by error or tampering. However, there is a branch of computer security that tries to guarantee that your algorithms (theoretically) always provide the intended result without edge cases or special inputs that produce unexpected outputs. A related field of research is Type Theory, which has been in development since the 60s of last century. In programming languages, a “type” is a kind of data or variable, like an integer, a floating point number, a character, or a string. Many programming languages detect type mismatch, like a function that takes an integer as its argument will throw an error if a character is passed. But usually, the type identification is not more sophisticated than basic types. If a function, say, only accepts prime numbers as input, the machine should be able to verify that the passed argument is a prime number and not just an integer. Similarly, the system should verify that a function that claims to produce prime numbers does not produce other integers as output. From this starting point, one can develop a theory and prove that verifying types is equivalent to verifying logical propositions and thus equivalent to proving abstract theorems. A function takes a hypothesis statement as input and produces a thesis statement as output, so verifying that types match when chaining functions is equivalent to checking that you can apply a lemma in a step of proving a theorem. In 1989 Coq ** was released, a software and language for proving theorems based on Type Theory. This was the system used (and developed) by Gonthier to formalize the four-color theorem.
How could one be sure that Robertson et al’s proof (or Appel and Haken’s) did not forget any case in their enumeration? Or that cases were not redundant? Or that all the logical implications produced mechanically in the case verification process were correct? Since the proof was not surveyable by hand, there would always be a cast of doubt. But with the new formalized proof, that doubt dissipates. In this new scenario, one can cast doubts about the Type Theory on which it is based (which is a mathematical theory largely made of theorems proven in the most traditional sense of the word, but some set-theoretical issues still cause itches), or on the software implementation, or on the hardware it runs into. But all these are global concerns, not specific about the four-color theorem. If we accept the Coq system as reliable, then the four-color theorem is a truthful theorem since the formal proof by Gonthier passes the compilation check in Coq.
Certainly, Gonthier was not particularly concerned about the possibility of an error that Appel and Haken or Robertson et al had missed. Instead, Gonthier’s goal was to show that any mathematical proof of a theorem, no matter how abstract, lengthy, or complex it was, would be possible to translate into a formal proof verifiable by a computer. The four-color theorem was a useful benchmark. In fact, a significant part of Gonthier’s work provided the foundation to properly define the objects in question. For instance, to be rigorous, planar graphs are not purely combinatorial objects, they concern intersections of lines in the plane, and hence require continuity and the topology of real numbers. Building all these fundamental blocks meant enormous preparation work even before looking at Robertson et al. proof. The next major proof Gonthier formalized was the Feit-Thompson theorem (any finite group of odd order is solvable), which proof takes 255 pages and was famously convoluted even for the specialists at the time it appeared in 1963. It was another benchmark theorem, which in this case required defining complex algebraic structures (groups, rings, modules, morphisms…), but it also shared some structures with the four-color theorem, so it was somehow a natural development.
Gonthier’s goal was fulfilled, formalized mathematics is considered today a mature enough branch of mathematics. Other theorem provers, such as LEAN, have appeared following the path started by Coq. Recently, some results in mathematics have been proved using such systems, making them useful for finding new results (as opposed to formalizing already known and proven theorems), including the condensed mathematics project by Peter Scholze in 2021, or the polynomial Freiman-Ruzsa conjecture by Tim Gowers, Terence Tao, and others in 2023, both using LEAN.
To avoid losing this formalized proof of the four-color theorem in Coq in the future, it is now part of the standard packages of the Coq installation, has its own GitHub repository, and has a small team of maintainers that ensures that the proof compiles successfully with new releases of the Coq engine (see the interview with Yves Bertot in this newsletter).
The four-color theorem proofs have had some direct continuation in other related works. In 2021, Doczkal [Doc21] used Coq’s formal proof and the provided infrastructure in graph theory, namely the hypermaps structures devised by Gonthier, to formalize a related result in graph theory, Wagner’s theorem for characterizing planar graphs. In a different direction, Steinberger [Ste10] made a new version of Roberson et al.’s proof in 2009, in which he trades volume for simplicity. He gives a large set of configurations, but all of them are simpler to verify, reducing the complexity of the algorithm and the rules to apply. If the reader looks up the “four-color theorem” in arXiv, he, or she, may be surprised with the amount of recent preprints on the topic.
Finally, the four-color theorem already has a prominent place in the history of mathematics. But the ultimate test will come with the passage of time, and there is still room for improvement. Maybe there is a smaller (optimal?) set of configurations. Maybe there is, after all, a proof that can be verified by hand. Maybe the current proofs can be explored visually and in a structured way so that even if computers are needed, no one can claim that a human can’t verify the computer calculations. Maybe the formalized proof will be more thoroughly reused and integrated into a broader graph theory formalization program. Maybe in a few years, proving the four-color theorem will be a standard assignment to undergraduate math students as part of their education, which will include some programming and formalization of maths. Time will tell.
* We have been unable to find any source code from the time of Appel and Haken. Robertson et al. claim that they would need to input many cases by hand to replicate Appel and Haken’s proof, and that, to their knowledge, no mathematician made a full revision of all of Appel and Haken’s work. Maybe some maths history archeologist will find the source code among Appel or Haken's personal effects. If any reader has any further information, we would be happy to amend this text.
** Coq will be renamed as Rocq Prover.
References
- [AH77a] Appel, Kenneth; Haken, Wolfgang (1977), Every Planar Map is Four Colorable. I. Discharging, Illinois Journal of Mathematics, 21 (3): 429–490, https://doi.org/10.1215/ijm/1256049011, MR 0543792
- [AHK77] Appel, Kenneth; Haken, Wolfgang; Koch, John (1977), Every Planar Map is Four Colorable. II. Reducibility, Illinois Journal of Mathematics, 21 (3): 491–567, https://doi.org/10.1215/ijm/1256049012, MR 0543793
- [AH77b] K. Appel, W. Haken. Microfiche supplement to Every planar map is four colorable. Part I and Part II. Illinois J. Math. 21(3): (September 1977). https://doi.org/10.1215/ijm/1256049023
- [AH77c] K. Appel, W. Haken. Microfiche supplement to Every planar map is four colorable. Illinois J. Math. 21(3): (September 1977). https://doi.org/10.1215/ijm/1256049024
- [AH77d] Appel, Kenneth; Haken, Wolfgang (October 1977), Solution of the Four Color Map Problem, Scientific American, vol. 237, no. 4, pp. 108–121, Bibcode:1977SciAm.237d.108A, https://doi.org/10.1038/scientificamerican1077-108
- [AH89] Appel, Kenneth; Haken, Wolfgang (1989), Every Planar Map is Four-Colorable, Contemporary Mathematics, vol. 98, With the collaboration of J. Koch., Providence, RI: American Mathematical Society, https://doi.org/10.1090/conm/098, ISBN 0-8218-5103-9, MR 1025335, S2CID 8735627
- [Doc21] Doczkal, Christian. A Variant of Wagner’s Theorem Based on Combinatorial Hypermaps. In 12th International Conference on Interactive Theorem Proving (ITP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 193, pp. 17:1-17:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.ITP.2021.17
- [Gon05] Gonthier, Georges. A computer-checked proof of the Four Color Theorem. Inria. 2023 (first version 2005). HAL Id: hal-04034866, https://inria.hal.science/hal-04034866
- [Gon08] Gonthier, Georges (2008), Formal Proof—The Four-Color Theorem, Notices of the American Mathematical Society, 55 (11): 1382–1393, MR 2463991. https://www.ams.org/notices/200811/tx081101382p.pdf
- [Kem79] Kempe, A. B. On the Geographical Problem of the Four Colours. American Journal of Mathematics, Vol. 2, No. 3 (Sep., 1879), pp. 193-200 (9 pages) https://doi.org/10.2307/2369235
- [McM20] McMullen, Chris. The Four-Color Theorem and Basic Graph Theory. Zishka Publishing, 2020. ISBN 1941691099, 9781941691090
- [Ric23] Richeson. David S. The Colorful Problem That Has Long Frustrated Mathematicians. Quanta Magazine, March 29, 2023.
- [RSST95] Robertson, Neil; Sanders, Daniel P.; Seymour, Paul; Thomas, Robin (1995, arXiv 2014), Discharging cartwheels. https://doi.org/10.48550/arXiv.1401.6485
- [RSST96] Robertson, Neil; Sanders, Daniel P.; Seymour, Paul; Thomas, Robin (1996), Efficiently four-coloring planar graphs, Proceedings of the 28th ACM Symposium on Theory of Computing (STOC 1996), pp. 571–575, https://doi.org/10.1145/237814.238005, MR 1427555, S2CID 14962541
- [Ste10] Steinberger, John. An unavoidable set of D-reducible configurations. Trans. Amer. Math. Soc. 362 (2010), 6633-6661. https://doi.org/10.48550/arXiv.0905.0043
- [Swa80] Swart, Edward R. The Philosophical Implications of the Four-Color Problem, The American Mathematical Monthly, 87:9, 697-707, (1980) https://doi.org/10.1080/00029890.1980.11995128
- [Tym79] Tymoczko, Thomas. The Four-Color Problem and Its Philosophical Significance. The Journal of Philosophy, Vol. 76, No. 2 (Feb., 1979), pp. 57-83. https://doi.org/10.2307/2025976
- [Wil14] Wilson, Robin. Four Colors Suffice: How the Map Problem Was Solved - Revised Color Edition, Princeton: Princeton University Press, 2014. https://doi.org/10.1515/9780691237565
- [Wil16] Wilson, Robin. Wolfgang Haken and the Four-Color Problem. Illinois Journal of Mathematics 60:1 2016, 149–178. https://doi.org/10.1215/ijm/1498032028. Available online also in Celebratio Mathematica https://celebratio.org/Appel_KI/article/796/
In Conversation with Yves Bertot
Daniel Ramos speaks with Yves Bertot from the Inria Center at Université Côte d’Azur about formal proofs, what it means to maintain a proof repository, and how the FAIR data principles apply to formal proofs. In a fictional scenario, they also discuss whether AI could create formal proofs automatically.
Second NFDI Berlin-Brandenburg Network Meeting:
Ontologies and Knowledge Graphs
It was a surprise for us to learn that within Berlin and Brandenburg alone, more than 120 scientific and other institutions are involved in at least 25 out of the 27 NFDI consortia. Bringing everyone together to set up a regional network for collaboration on overarching topics was the main reason MaRDI initiated the first NFDI in Berlin-Brandenburg (NFDI_BB) meeting held on October 12th, 2023. It was at this meeting that “Ontologies and Knowledge Graphs” was quickly identified as a common topic of interest; especially since it has gained in importance due to the recent launch of the basic services (Base4NFDI) and the sub-project KGI4NFDI. Naturally, this was chosen as the subject of the second NFDI_BB workshop, held on July 11th, 2024, and hosted at the Weierstrass Institute for Applied Analysis and Stochastics (WIAS) in Berlin. A total of 35 participants, coming from 16 different consortia belonging to engineering sciences, humanities and social sciences, life sciences, and natural sciences, attended the meeting.
We started off with an interactive online survey, where the participants were asked several questions, one of them was building a word cloud of important terms related to ontologies and knowledge graphs. Wikidata, wikibase, sparql, and metadata were the ones mentioned most frequently. During the main part of the morning session, the topic “Ontologies and Knowledge Graphs” was presented in connection with the activities of the NFDI in four talks by Tabea Tietz, Lozana Rossenova, Olaf Simons, and Daniel Mietchen.
The afternoon session began with five short presentations on specific ontology concepts of the individual NFDI consortia (Sabine von Mering - NFDI4Biodiversity, Rolf Krahl - DAPHNE4NFDI, Sonja Schimmler - NFDI4DataScience, Frank von Hagel - NFDI4Objects and Aurela Shehu - MaRDI). After these presentations, the audience was polled on the use of (7 out of 24) and possible or planned development (9 out of 24) of ontologies as well as the role of artificial intelligence (AI). Interestingly, 12 out of 25 anticipate that in 10 years AI will use knowledge graphs and 5 out of 25 expect that it will even create them!!
It became clear, in the final discussion round, that this topic was very interesting to the participants and has the potential for cross-consortia collaboration. However, the development of domain-specific ontologies will evolve at different speeds, and this needs to be taken into account. The benefit of such subject-specific regional meetings resonated with everyone, and the interest to continue the series of NFDI_BB meetings in the future on the topic of Research Data Management in the curriculum was high.
If you want to be informed of the next meeting, subscribe to the NFDI_BB mailing list:
https://www.listserv.dfn.de/sympa/info/nfdi_bb.
Data❤️Quest at MEGA
MEGA is the acronym for Effective Methods in Algebraic Geometry. It is a series of biennial international conferences devoted to computational and application aspects of Algebraic Geometry. This year, it took place from July 29th to August 2nd at Max Planck Institute for Mathematics in the Sciences and the University of Leipzig.
During the five days of the conference, almost 200 participants attended plenary and parallel sessions on various topics. All abstracts and slides can be found online. Part of the program was a poster session at the University of Leipzig, Paulinum. MaRDI offered the game Data❤️Quest for all participants.
NFDI4Objects Berlin-Brandenburg
The first NFDI4Objects meets Friends network meeting will take place on November 7, 2024, in Berlin. This event offers ideas and networking on research data management (RDM) focused on archaeological and object-based data. It will showcase projects, discuss helpdesks, quality assurance, and RDM training.
More information:
- in German
NFDI Science Slam 2024 in Berlin
In cooperation with Berlin Science Week, NFDI4DS will host its annual Science Slam this October at the Weizenbaum Institute in Berlin. Researchers present their work in a simple, humorous way, inviting the audience to laugh, cry, and engage with NFDI and its consortia. This year’s motto is “Crossing Boundaries.”
More information:
- in English
GHGA Symposium 2024 in Heidelberg
The GHGA (German Human Genome-Phenome Archive) will hold a public symposium on 15 October 2024 in Heidelberg, focusing on topics related to enabling data sharing for health research in Germany and Europe. The event will feature presentations from GHGA members as well as renowned experts from various European biomedical data initiatives and projects, including the European Genomic Data Infrastructure (GDI), the German National Cohort (NAKO), NFDI4Health and the genome sequencing model project in Germany.
More information:
- in English
Robin Wilson's book Four Colors Suffice: How the Map Problem Was Solved offers an authoritative historical account of the four-color theorem history and the main mathematical ideas involved. The book is suitable for undergraduates. The 2014 edition includes colored images (the previous one only offered black and white images).
A more casual (and much shorter) read by the same author is his article Wolfgang Haken and the Four-Color Problem, published in 2016.
Quanta Magazine published the public outreach article The Colorful Problem That Has Long Frustrated Mathematicians by David S. Richeson in 2023. Quanta also produced a short video on the topic featuring the author.
The Numberphile YouTube channel published the video The Four Color Map Theorem, featuring James Grime, explaining the topic in layman's terms.
Chris McMullen's book "The Four-Color Theorem and Basic Graph Theory", published in 2020 by Zishka Publishing (ISBN 1941691099, 9781941691090), offers high-school-level puzzles and exercises related to the four-color theorem. It also includes an elementary proof of the theorem (which is not actually a proof, but it gives insights).
Mathigon (an online platform that combines an interactive textbook and a virtual personal tutor) offers an interactive course on graphs and networks, which includes a chapter on Map Coloring.
Welcome to the summer MaRDI Newsletter 2024! On the longest day of the year, we discuss “I have no data”—a statement many mathematicians would subscribe to. Our key article answers the most frequent questions and draws a brighter future for everyone—where, for instance, we believe you can search and find theorems rather than articles. Is this data? We believe it is.
by Ariel Cotton, licensed under CC BY-SA 4.0.
If you could select just one "item" (such as an equation, a geometrical object, etc.) to symbolize the entirety of mathematics, what would you choose?
Choose your representative here!
Also, check out our Data Date Interview with Martin Grötschel, workshop reports, and the first exhibition of the MaRDI Station on the ship MS Wissenschaft.
“I’m a mathematician and I use no data. Change my mind.”
At the MaRDI team, we continuously communicate the project's goals and mission to a general audience of mathematicians. We describe the importance of data in modern mathematics and the FAIR principles and show examples of the services that MaRDI will provide to some key communities represented in MaRDI’s Task Areas: computational algebra, numerical analysis, statistics, and interdisciplinary mathematics.
However, our audience often consists of mathematicians working in other areas of mathematics, maybe topology, number theory, harmonic analysis, or logic… who consider themselves not very heavy data users. In fact, the sentence “I have no data” is a statement that many mathematicians would subscribe to.
In this article, we transcribe fictional (but realistic) questions and answers between a “no-data” mathematician and a “research data apostle”.
I do mathematics in the “traditional” way. I read articles and books, discuss with collaborators, think about a problem, and eventually, write and publish papers. I use no data!
Maybe we need clarification on the terms. We call “Research Data” to any information collected, observed, generated, or created to validate original research findings.
If you think of a large database of experimental records collected for statistical analysis, or if you think of the source code of a program, yes, these can be examples of research data. However, there are many other types of research data.
You probably use LaTeX to write your articles and BibTeX to manage your lists of bibliography references. You probably use zbMATH or MathSciNet to find a bibliography and arXiv to discover new papers or to publish your preprints. Your LaTeX source files and your bibliography lists are examples of research data. Without a data management mindset, you wouldn’t have services like zbMATH or arXiv.
But there is more data than electronic manuscripts in your research. If you find a classification of some mathematical objects, that list is research data. If you make a visualization of such objects, that is research data. Every theorem you state and prove can be considered an independent piece of abstract research data. If you have your own workflow to collect, process, analyze, and report some scientific data, that workflow is in itself a valid piece of research data.
Many mathematical objects (functions, polytopes, groups) have properties that you can address in your theorems. For instance “since the integral of this function can be bounded by a constant C<1…”. Such properties are collected in data repositories (DLMF, etc.) that provide consistent and unified references to gather these data.
You should think of research data as any piece of information that can be tagged, processed, and built upon to create knowledge in a research field. This perspective is useful for building and using new technologies and infrastructure that every mathematician can benefit from.
I think you say “everything is data” to give the impression that MaRDI and other Research Data projects are very important… but how does your “data definition” affect me?
It is not a mere definition for the sake of discussion. We believe there is a new research data culture in which mathematicians from all fields should participate. A research data culture is a way to think about how we organize and structure all the human knowledge about mathematics, how we store and retrieve that knowledge, the technical infrastructure we need for that, and ultimately, how we make research easier and more efficient.
Imagine you are looking for some information that you need in your research. When you look for a result, the “unit of data” would be a theorem (probably together with its proof, a bibliographic reference, an authorship…), but not an article or a book on themselves. So, it is more useful to consider that your data is made up of theorems instead of articles.
Then, your theorems will fit into a greater theory in your field. Sure, you can explain this in your article and link to references in your bibliography, but you will probably not link to specific theorems, likely sometimes you will miss some relevant references, and certainly you can’t link to future works retroactively. By thinking about your results as data and allowing knowledge infrastructures to index and process them, your results will be put in a better context for others to find, access, and reuse them. Your results will reference others, and others will reference yours. Furthermore, they will withstand better the evolution and advances in the field.
I thought MaRDI was about building infrastructure to manage big databases and code projects. Since I don’t use databases or program, why should I be interested in MaRDI?
MaRDI is much more than that. It is true that mathematicians working with these types of data (large databases, large source code projects, etc. …) need a reliable infrastructure to host and share data, standards to make data interoperable, and a way to work collaboratively in large projects. MaRDI addresses these needs by setting task groups that develop the necessary infrastructure in each domain (for instance, in computer algebra or statistics).
But as we mentioned above, there are many other types of data: classifications of mathematical objects, literature (books and articles), visualizations, documentation of workflows, etc. MaRDI takes an integral approach to research data and addresses the needs of the mathematical community as a whole.
For instance, MaRDI bases its philosophical grounds on the FAIR principles. The acronym FAIR means that research data should be Findable, Accessible, Interoperable, and Reusable (read our articles on each of these principles applied to mathematical research). These principles are now widely accepted as the gold standard for research data across all scientific disciplines, and they are the grounds for all other NFDI consortia in Germany and other international research data programs.
Following FAIR principles is relevant for all researchers. Your results (your data) should be findable for other researchers, which implies caring about digital identifiers and indexing services. Delegating and thrusting third-party search engines is not a wise strategy. Your research should be accessible, meaning you should be concerned about publication models, the completeness of your data, or your meta-data structure. Your data should be interoperable, meaning you should follow common practices in your community to exchange data. At the very least, this could mean following common notation and conventions for your results so they can be translated across the literature with minimal context adaptations. Finally, you should always keep in mind that the most important FAIR principle is reusability. Reusability is the base of verifiability. Document your thought processes. Sharing insights is as important as sharing facts. Research that is not reused is barren.
MaRDI aims to spread this research data culture by raising awareness of these principles and encouraging discussions to devise best practices or address challenges in concrete, practical cases. Since these discussions affect all mathematicians, it is a good reason to be interested in MaRDI.
Furthermore, MaRDI strives to develop services that best help mathematicians. Aside from the specific services developed for the aforementioned task areas, MaRDI addresses all mathematicians with its main and central MaRDI Portal, a knowledge base to better manage all mathematical knowledge from a research data perspective. MaRDI also lays bridges to communities that can impact mathematics and the research data paradigm, like the formalized mathematics community, which is taking an increasing role in mathematical fields other than logic or theoretical computer science.
Why do you talk about political / philosophical / ethical questions? Shouldn’t MaRDI be just a technical project?
To build an infrastructure for the future of mathematical research data, planning must be accompanied by a serious reflection on the guiding principles. The FAIR principles we mentioned before are not a technical specification of concrete implementations but a set of philosophical rules that researchers should apply to their research data. The implementation and the guiding principles cannot be independent.
MaRDI encourages a debate and calls researchers to decide on challenging situations concerning research data. For instance, which are the best publication practices? Should researchers publish in traditional journals? In Open Access journals? Should they also publish a version (identical or preliminary to the final one) on preprint services such as arXiv? Should the pay-per-publish practice be accepted? How can we ensure the publication quality in that case? These questions are one particular topic related to handling research data; thus, they fall into the area of interest of MaRDI.
MaRDI will not dictate absolute answers to these questions, but it will try to stimulate and facilitate discussion about these delicate topics in the community. It will promote principles and common grounds that the entire community of mathematicians can agree on. Then, MaRDI will help build the necessary infrastructure to put these principles into practice.
MaRDI is neither a regulating agency nor a company offering products and solutions. MaRDI is a community of mathematicians. To be more precise, MaRDI is a set of different communities (computer algebra, numerical analysis, statistics and machine learning, interdisciplinary mathematics) of mathematicians that collaborate to create a common infrastructure and to promote culture for mathematical research data. MaRDI is scoped in Germany but it has a clear universal vocation, other communities of mathematicians from anywhere may complement MaRDI in the future. Thus, MaRDI is a technical project when its members, researchers who face a specific challenge, define technical specifications for the infrastructure to build. But MaRDI is at all times a social and philosophical project since its members endeavor to build the tools for mathematical research in the future.
So, should I rewrite my papers thinking on “Data”?
Research articles and books are and will probably always be the primary means of communicating results between researchers. You should write your papers thinking of your peer mathematicians who will read them. Your research paper is the first place where some theorem is proved. It gives you authorship credit, and as such, it establishes a new frontier of mathematical knowledge. But at the same time, your papers can contain several types of data that can be extracted, processed automatically, and potentially included in other knowledge bases.
Imagine your paper proves a classification result about all manifolds of dimension 6 that satisfy your favorite set of properties. What about other dimensions? What about slightly different properties? Your result fits in a broader picture to which many mathematicians contribute. At some point, it will make sense to collect all these results somewhere to have a more complete presentation. This can be a survey article/book, but sometimes it is better to have it in the form of a catalog. In this case, it would be a list of all manifolds classified by their invariants or by some characteristics. This catalog would serve as a general index, the place to look up what is known about your favorite manifolds, and from this catalog, you can get the references to the original articles.
We can go further and ask whether a catalog is the best information structure we can aim for. At MaRDI, we support knowledge graphs as a way to represent all mathematical knowledge. In a knowledge graph, every node is a piece of information (a manifold, a list of manifolds, an author, an article, an algorithm, a database, a theorem…), and every edge is a knowledge relationship (this list contains this manifold, this manifold is studied in this article, this article is written by this author…)
You can help build this knowledge graph of all mathematics by thinking about and preparing your research data for inclusion in it.
I tried the MaRDI Portal to search for one of my research topics. It returned several article references that look very much like zbMATH Open. Why do we need another search engine?
First, keep in mind that the Portal is still under development. Second, it’s not a surprise that you obtained article references looking like zbMATH Open. It is exactly where they come from. MaRDI does not intend to substitute zbMATH or any other catalog or database, instead, it aims to integrate them in a single place, with a richer structure.
zbMATH is a catalog, the MaRDI portal is a knowledge graph. The MaRDI knowledge graph includes already (partially) the zbMATH catalog, the swMATH software catalog, the Digital Library of Mathematical Functions (DLMF), the Comprehensive R Archive Network (CRAN), and the polyDB database for discrete geometric objects. Eventually, it will also include other sources like arXiv and more. The MaRDI knowledge graph imports the entries of these sources and gives them a structure in the knowledge graph. Some links on the graph are already provided by the sources, like one article reference points to other articles cited in the bibliography. A challenge for the MaRDI KG is to populate many more links between different parts of the graph, like “This R library uses an algorithm described in this article”.
Imagine this future: You learn about a new topic by reading a survey, attending a conference, or following a reference; and you think it could be useful for your research. With a few queries, you find everything published in that research direction. You can also find which researchers and universities or research institutes have people working in that field, in case you want to get in contact. You have easy and instant access to all these publications. You query for some general information that is scattered across many publications (e.g., what is known about my favorite manifolds in any dimension). You get answers that span all the relevant literature. Refining your query, you get more accurate results pointing to specific theorems which are relevant to you. Results found by automatic computation (theorems but also examples, lists, visualizations…) come with the code you can run and verify easily in a computer virtual machine. Mathematical algorithms that can be used as a pure tool to solve a concrete problem can be found and are plug-and-play into any software project. Databases and lists of mathematical objects are linked to publications, and all results are verifiable (maybe even with a formalized mathematics appendix). The knowledge graph gives you an accurate snapshot of the current landscape of mathematical knowledge, and rich connections arise from different fields. You can rely on the knowledge graph not only as a support tool to fetch references but also as your main tool to learn and contribute to mathematical research. This future is not yet here, but it is a driving force for those who build MaRDI.
In Conversation with Martin Grötschel
Reflecting on his extensive career in applying mathematics, Martin Grötschel provides both a retrospective on how mathematics has shifted towards a more data-driven approach and a prospective on what the future of the field might hold.
MaRDI on board the German Science Boat
On May 14, 2024, the ship MS Wissenschaft embarked on its annual tour. During the summer months, the floating science center usually visits more than 30 cities in Germany and Austria. The theme of the exhibition is based on the respective science year in Germany. The theme of this year’s tour is freedom. Adolescents, families, and especially children are invited to visit the interactive exhibition. Admission is free of charge.
MaRDI showcases the cooperative multi-player game “Citizen Quest: Together for a freer world”. Players navigate a city environment, where citizens are confronted with dilemmas and issues related to mathematical research data and various other facets of freedom. The adventure begins aboard a mathematical vessel recently docked in the city’s harbor. Here, advocates of scientific freedom promote the sharing of research data in a FAIR (findable, accessible, interoperable, reusable) manner. The users meet different characters and help them with different quests - for example help a young programmer who wants to make his newly developed computer game about his dog accessible to his friends, debunk a fake news story, learn about ethical aspects of AI surveillance, and help researchers who developed a tool to identify dinosaur footprints get reusable test data... Many small quests need to be solved!
Workshop "Bring Your Own (Mathematical) Model"
On Thursday, 02.05.2024, MaRDI team members hosted the "Introduction to the preparation of mathematical models for the integration in MaRDI's MathModDB (Database of Mathematical Models)" workshop within the framework of the Software & Data Seminar series at Weierstrass-Institute for Applied Analysis and Stochastics (WIAS), Berlin. The workshop focused on templates developed recently by MaRDI's "Cooperation with other disciplines" task area, the templates facilitating the process for researchers wishing to add mathematical models to the MathModDB. The templates are written as Markdown files and designed to have a low barrier to entry allowing users with little to no experience to get started easily while guaranteeing the important details of the mathematical model are gathered. The filled-in templates would then serve as a basis for a further semantically annotated standardized description of the model employing the MathModDB ontology. These semantically enriched models can then be added to the MathModDB knowledge graph (database of mathematical models) making them findable and accessible.
The workshop consisted of two parts: a short introductory talk followed by a hands-on exercise session. The talk gave the participants an introduction to the FAIR principles and to the difficulties of making mathematical models FAIR. Next, MaRDI's approach to that problem was explained which is based on ontologies and knowledge graphs. In preparation for the hands-on exercise session, the MarkDown template was presented and the audience was shown an example of how to use it. During the exercise session, five WIAS researchers with different scientific backgrounds filled in templates with their mathematical models. The topics of the mathematical models included Maxwell equations, thin-film equations, a poro-visco-elastic system, a statistical model about mortality rates, and a least-cost path analysis in archeology. The templates filled-in by the participants will be integrated into the MathModDB knowledge graph by MaRDI experts in the near future.
In general, the workshop was successful in raising awareness about MaRDI and FAIR principles while directly engaging the participants in providing information about the mathematical models they use. There were lively discussions not only about technical details of the templates and possible further improvements, but also about more general concepts and features that would improve for example reusability of model descriptions. MaRDI's MathModDB team plans to develop the template further based on valuable input received during the workshop, and is looking forward to carrying out future field tests.
Workshop on Scientific Computing
The MaRDI "Scientific Computing" task area focuses on implementing the FAIR principles for research data and software in scientific computing. The second edition of their workshop (Oct 16 – 18, 2024, in Magdeburg) aims to unite researchers to discuss the FAIRness of their data, featuring presentations, keynote talks, and discussions on topics like knowledge graphs, research software, benchmarks, workflow descriptions, numerical experiment reproduction, and research data management.
More information:
- in English
EOSC Symposium 2024
The symposium will take place from 21 to 23 October 2024 in Berlin. It is a key networking and idea exchange event for policymakers, funders, and representatives of the EOSC ecosystem. The symposium will feature a comprehensive program, including sessions on the EOSC Tripartite Partnership, collaborations with the German National Data Infrastructure (NFDI), and a co-located invitation-only NFDI event.
More information:
- in English
From proof to library shelf
This workshop on Research Data Management for Mathematics will take place at MPI MiS in Leipzig from November 6-8, 2024. It will offer feature talks, hands-on sessions, and a Barcamp to discuss mathematical research data, present ideas and services, and network.
More information:
- in German
The review Making Mathematical Research Data FAIR: A Technology Overview by Tim Conrad, Eloi Ferrer, Daniel Mietchen, Larissa Pusch, Johannes Stegmuller, and Moritz Schubotz provides a technology review of existing data repositories/portals focusing on mathematical research data.
In the paper A FAIR File Format for Mathematical Software, Antony Della Vecchia, Michael Joswig, and Benjamin Lorenz introduce a JSON-based file format tailored for computer algebra computations, initially integrated into the OSCAR system. They explore broader applications beyond algebraic contexts, highlighting the format's adaptability across diverse computational domains.
The video FAIR data principles in NFDI by NFDI4Cat provides an in-depth exploration of research data management tools and FAIR data principles. The video also examines how these tools are applied in experimental setups. Whether you're a researcher, data enthusiast, or simply curious about the future of data management, this video will provide valuable insights and actionable takeaways.
The position paper A Vision for Data Management Plans in the NFDI by Katja Diederichs, Celia Krause, Marina Lemaire, Marco Reidelbach, and Jürgen Windeck envisions an expanded role for Data management plans within Germany's National Research Data Infrastructure (NFDI), proposing their integration into a service architecture to enhance research data management practices.
Welcome, dear reader, to this year's first MaRDI newsletter. We start off with a look at contributor roles. Or for the working mathematician: who does what (how, where, and when) in a collaborative project?
Our style of doing mathematics—while remaining largely unchanged, verifying statements using logic—has in the past fifty years acquired a variety of possible new inputs and tools. Compare our past newsletters and especially the past key article on mathematical research data to see just a few. Mathematicians nowadays attend conferences and discuss with colleagues, just like they did in the past, and they also meet online, use shared documents on platforms for LaTeX editing or git, consult online databases like the Small Groups library or online encyclopedia of integer sequences. They programsmall scripts to solve problems or find counterexamples, ask the computer to do calculations they know are cumbersome but simple by hand, write project reports and funding proposals, and much more! Our main article will shed light on how doing mathematics has changed over the past decades, showcasing tools supporting our work these days and untangling roles in collaborative and interdisciplinary projects. For a sneak preview, have a look at the illustration below. Whom can you spot doing what?
by Ariel Cotton, licensed under CC BY-SA 4.0.
Which roles did you have when writing a research paper? In our survey, you can mark all roles you ever had. For more information about these roles see the Contributor Roles Taxonomy.
We plan to report on the results in a future newsletter.
Also in this current issue of the newsletter, we additionally report from math software workshops, the love data week, and our friends around the NFDI.
A Modern Toolbox for Mathematics Researchers
What mathematicians need
It is often said that mathematicians only need pen and paper to do their job. Some elaborate on it as a joke by saying that a trash bin is also necessary, unlike the case of philosophers. Other quotes involve the need for coffee as fuel to feed the theorem-producing devices that are mathematicians. Jokes aside, it is true that mathematics research requires, in general, a relatively small infrastructure and instruments compared to other, more experimental research fields. However, two conditions are universally needed for research. Firstly, researchers need access to prior knowledge. This is addressed by creating publications and specialized literature and collecting resources in libraries and repositories to offer access to that knowledge. Secondly, researchers need to interact with other researchers. This is why researchers gather in university departments and meet at conferences. Many mathematicians love chalk and blackboard. While this is another method of writing, it serves an exchange purpose, it allows two or more people, or a small audience, to think simultaneously on the same topic.
These needs (literature and exchange spaces) are universal, have been unchanged for centuries, and will remain the same for the future. The basic toolset (pen and paper, books and articles, university departments, conferences, chalk and blackboard…) will likely stay for a long time. However, the work and practice of science researchers in general, and mathematics researchers in particular, evolves with the times of society and technology. Today’s researchers require some other specialized tools to address specific needs in contemporary research. They use digital means to have practical access to the literature, and to communicate quickly and efficiently with colleagues; they use the computing power of machines to explore new fields of math; they use managing tools to handle big amounts of data, and to coordinate distributed teams to work together. In this article, we will walk through some of the practical tools that changed the practice of mathematics research at some point in history. Some of these tools are technological changes that impacted all of society, such as the arrival of the web and the information era. Some are a consequence of changes in the way the mathematics field evolves, such as the increasingly data-driven research in mathematics that MaRDI aims to help support. We will also review some initiatives that try to change some current common practices (the CRediT system for attributing authorship), and finally, we will speculate with some tools that may one day become a daily resource for mathematicians (the impact of formalized mathematics in the mathematical practice)
From the big savants to an army of experts
We could start revisiting history with the Academy of Athens or the Library of Alexandria as “tools and infrastructure” for mathematicians in ancient times. Instead, we will directly jump to the 17th and 18th centuries, with some giant figures of mathematics history like Descartes, Newton, Leibniz, or Gauß. They were multidisciplinary scientists making breakthroughs in mathematics, physics, applied sciences, engineering, and even beyond, like philosophy. Mathematics at the time was a cross-pollinating endeavor, in which physics or engineering problems motivated mathematics to advance, and math moved the understanding of applied fields forward. The community of these big savants of the time was relatively small, and they mostly knew about each other. They mainly used Latin as a professional communication language, since it was the cultivated language learned in every country. They maintained correspondence by letter between them and published their works generally as carefully curated volumes since publication and distribution was a costly process. Interestingly, it was also in the 17th century when the first scientific journals appeared, the Philosophical Transactions of the Royal Society and the Journal des Sçavans of the French Academy of Sciences (both started around 1665). These were naturally not devoted only to mathematics or to a specific science, but more to a very inclusive notion of science and culture. Still, the article format quickly became the primary tool for scientific communication.
Fast-forward a century, and in the 19 hundreds the practice of scientists changed significantly. Mathematics consolidated as a separate branch, where most mathematicians only researched mathematics. Applications still inspired mathematics, and mathematics still helped solving application problems, but by this century most scientists focused in contributing to one aspect or another. This was a process of specialization, in which the “savants” capable of contributing to many fields were substituted by experts with more profound knowledge in a concrete, narrower field. In this century we start finding exclusively mathematical journals, like the Journal für die reine und angewandte Mathematik (Crelle’s Journal, 1826). By the end of the century (and beginning of the 20th century) this specialization process that branched science into mathematics, physics, chemistry, engineering, etc. also reached each science in particular, branching mathematics into different fields such as geometry, analysis, algebra, applied math, etc. Figures like Poincaré and Hilbert are classically credited as some of the last polymathematicians, capable of contributing significantly to many, almost all the fields of mathematics.
During the 20th century, the multiplication of universities and researchers due to broader and universal access to higher education brought much more specialized research communities. The number of scientific journals proliferated accordingly, and the publication rate and the creation of articles outpaced the classical book format. With so many articles, came the need for new bibliographic tools like catalogs (Zentralblatt MATH and Mathematical Reviews date back to 1931 and 1940, respectively) and other bibliometric tools (impact factors started to be calculated in 1975). These reviewing catalogs have been, for decades, for many mathematicians, a primary means of disseminating and discovering what was new in the research community. Their (now coordinated) Mathematical Subject Classification also brought a much-needed taxonomy to the growing family tree of mathematics branches. The conference format, in some cases with many international participants, also consolidated as a scientific format, a way to disseminate results, and a measure of the prestige of institutions.
The computing revolution
The next period we will consider starts with the irruption of computers, later accelerated with the addition of the Internet and the web. Computers have impacted the practice of mathematics at two levels. On the one hand, computers seen purely as computing machines have opened a new research field in itself, computer science. Many mathematicians, physicists, engineers… turned their attention to computer science in their early days. In particular, mathematicians started exploring algorithms, and branches of mathematics not reachable before computing power was available. For instance, numerical algorithms, chaos and dynamical systems, computer algebra, statistical analysis, etc. These new fields of mathematics have developed specific computer tools in the form of programming languages (Julia, R, …), libraries, computer algebra systems (OSCAR, Sage, Singular, Maple, Mathematica…), and many other frameworks that are now established as essential tools for the daily practice of these mathematicians.
On the other hand, computers have impacted mathematics as they have impacted every other information-handling job: as office automation. Computers help us to manage documents, create and edit texts, share documents, etc. One of the earliest computer tools that most profoundly impacted mathematicians' lives has been the TeX typesetting system. Famously, TeX was created by computer scientist Donald Knuth in more than a decade (1978 - 1989) to typeset his Art of computer programming. However, many use the more popular version LaTeX (or “TeX with macros, ready to use”) released by Leslie Lamport in 1984. Before TeX/LaTeX, including formulas in a text was done either by treating them as images (that someone had to draw manually into the final document, or engrave them into printing plates), or by semi-manual processes of composing formulas by templates on the physical font types. This was a tedious process that was only carried out for printing finished documents, not for drafts or early versions to share with colleagues. With TeX, mathematicians (and physicists, engineers…) could finally describe formulas as they intended to be displayed, and be processed seamlessly with the rest of the text. This had an impact on the speed and accuracy of the publication (printing) process, but also on a new front that just arrived in time: online sharing.
The early history of internet includes first protocols for communicating computers and first military networks of computers (Arpanet), but the real booster for civil research and for society in general was the invention of the World Wide Web by Tim Berners Lee in 1989 at the CERN. It was created as a pure researchers’ tool for exchanging scientific information, and with the goal of becoming a “universal linked information system”. Almost simultaneous to the creation of the WWW was the emergence of another critical tool for today’s mathematicians: the arXiv repository of papers and preprints. Originally available as FTP service (1991) and soon after on the WWW (1993), this repository managed by Cornell University has become a reference and a primary source for posting new works in mathematics and many other research fields. Many researchers offer their preliminary articles there as soon as they are ready to be shared, before sending them to traditional journals for peer-review and publication. ArXiv thus serves a double purpose: it is a repository to host and share results (and prove precedence if necessary), and it is also a discovery tool for many researchers. ArXiv has a mailing list / RSS feed on which you can get daily news about what has been published (or is going to be published) in the specific field of your interest. ArXiv has largely replaced this discovery function previously offered by reviewing services (zbMATH, MR). These catalogs do not host the works; instead, they index and review peer-reviewed articles (zbMATH Open also indexes some categories of arXiv). As indexing tools, these services remain authoritative (complete, with curated reviews, and well maintained), and offer valuable bibliographical data and linked information, but their role as a discovery tool is no longer an undisputed feature.
The data revolution
At this point, we have probably covered the main tools of mathematicians up to the end of the 20th century, and we enter contemporary times. One of the challenges that 21st-century research is facing is data management. Most sciences have always been based on experimentation and data collection, but the scale of data collection has grown to unprecedented levels, often called “big data”. With this term, here we refer both to particular massive datasets bound to a specific project and also to the amount of projects and data (big and small) that flood the landscape of science.
Many data repositories have become essential tools for handling data types other than research papers. In the case of software, Git (2005) and Git repositories (GitHub, 2008) have emerged as the most popular source code management tools, and have mitigated or solved many problems with managing source code versions and the collaborative creation of software.
Digital Object Identifiers (DOIs, 2000) have become a standard for creating reliable, unique, persistent identifiers for files and digital objects on the ever-changing internet. Publishers assign these DOIs to the digital versions of publications, but actually, DOIs are essentially universal labels for any digital asset. Repositories such as Zenodo (2013) offer DOIs and hosting for general-purpose data and digital objects.
In the case of mathematics, as has happened with other sciences, it has dramatically increased its reliance on data, be it experimental (statistics, machine learning…), extensive collections and classifications (groups, varieties, combinatorial…), source code for scientific computing, workflow documentation on interdisciplinary fields, etc. The scientific community, and the mathematics community in particular, has grown bigger than ever, and it is challenging not only to keep track of all the advances, but to keep track of all the methods and replicate all the results by yourself. In response to that, sciences are in the process of building research data infrastructures that help researchers in their daily lives. Here is where MaRDI (and the NFDI for other branches of science) enters as a project to help on that front.
Structuring research data implies, on one hand, creating the necessary infrastructure (databases, search engines, repositories) and guiding principles that govern ethically and philosophically the advancement of science. The FAIR principles (research data should be Findable, Accessible, Interoperable, and Reusable) that we have discussed extensively in previous articles provide a practical implementation of such principles, together with common grounds such as verifiability of results, neutrality of the researcher, or the process of the scientific method. On the other hand, the structuring of research data won’t be successful unless the researchers embrace new practices that are not perceived as imposed duties but are reliable, streamlined tools that make their results better and their work easier.
MaRDI aims to become a daily tool to help mathematicians and other researchers in their jobs. Some of the services that MaRDI will provide include accessing numerical algorithms, richly described, benchmarked, and curated for interoperability; browsing object collections and providing standardized work environments for algebraic computations (software stacks for reproducibility); curating and annotating tools and databases for Machine Learning and Statistical analysis; describing formal workflows in multi-disciplinary research team; and more. All MaRDI services will be integrated into a MaRDI portal that will serve as a search engine (for literature, algorithms, people, formulas, databases, services, et cetera). We covered some of MaRDI services in previous articles and will cover more in the future.
The cooperation challenge
Another challenge many sciences face, and increasingly in mathematics, is the growth of research teams for a concrete research project. In many experimental or modeling fields it is not uncommon to find long lists of 8, 10, or more authors signing an article, since it is the visible output of a research project involving that many people. Different people take different roles: from the person who devised the project, the one who carried out experiments in the lab, the one who analyzed the data, the one who wrote some code or ran some simulations, the one who wrote the text of the paper, etc. Listing all of them as “authors” does not give hints about their roles, and ordering the names by relative importance is a very loose method that does not improve the situation much. This challenge requires a new consensus of good scientific practice that the community accepts and adopts. The most developed proposed solution is the CRediT system (Contributor Roles Taxonomy), a standard classification of 14 roles that intends to cover all possible ways that a researcher can contribute to a research project. The system is proposed by the National Information Standards Organization (NISO), a United States non-profit standards organization for publishing, bibliographic, and library applications.
For reference, we list here the 14 roles and their descriptions:
- Conceptualization: Ideas; formulation or evolution of overarching research goals and aims.
- Data curation: Management activities to annotate (produce metadata), scrub data and maintain research data (including software code, where it is necessary for interpreting the data itself) for initial use and later re-use.
- Formal Analysis: Application of statistical, mathematical, computational, or other formal techniques to analyze or synthesize study data.
- Funding acquisition: Acquisition of the financial support for the project leading to this publication.
- Investigation: Conducting a research and investigation process, specifically performing the experiments, or data/evidence collection.
- Methodology: Development or design of methodology; creation of models.
- Project administration: Management and coordination responsibility for the research activity planning and execution.
- Resources: Provision of study materials, reagents, materials, patients, laboratory samples, animals, instrumentation, computing resources, or other analysis tools.
- Software: Programming, software development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components.
- Supervision: Oversight and leadership responsibility for the research activity planning and execution, including mentorship external to the core team.
- Validation: Verification, whether as a part of the activity or separate, of the overall replication/reproducibility of results/experiments and other research outputs.
- Visualization: Preparation, creation and/or presentation of the published work, specifically visualization/data presentation.
- Writing – original draft: original draft – Preparation, creation and/or presentation of the published work, specifically writing the initial draft (including substantive translation).
- Writing – review & editing: Preparation, creation and/or presentation of the published work by those from the original research group, specifically critical review, commentary or revision – including pre- or post-publication stages.
The recommendation for academics is to start applying these roles to each team member in the research projects, keeping in mind that one or more people can fulfill one or more than one role, and only applicable roles should be used. A degree of contribution is optional (e. g. ‘lead’, ‘equal’, or ‘supporting’).
Using author contributions can be pretty straightforward. For example, imagine a team of four people working on a computer algebra project. Alice Arugula is a professor who had the idea for the project, discussed it with Bob Bean, a postdoc, and both developed the main ideas. Then Bob involved Charlie Cheeseman and Diana Dough, two PhD students who programmed the code, and all three investigated the problem and filled in the results. Bob and Diana wrote the paper, Charlie packed the code into a library and published it in a popular repository, and Alice reviewed everything. They published the paper, and all of them appeared as authors. Following CRediT and publisher guidelines, they included a paragraph at the end of the introduction that reads:
Author contributions
Conceptualization: Alice Arugula, Bob Bean; Formal analysis and investigation: Bob Bean (lead), Charlie Cheeseman, Diana Dough; Software: Charlie Cheeseman, Diana Dough; Data curation: Charlie Cheeseman; Writing - original draft: Bob Bean, Diana Dough; Supervision: Alice Arugula.
For publishers, the CRediT recommendation is to ask the authors to detail their contribution, list all the authors with their roles”, and ensure that all the contributing team assume their share of the responsibility assigned by their role. Technically, publishers are also asked to make the role description machine-readable using the existing XML tag descriptors.
Formalized mathematics
We’ll finish with a more speculative tool that mathematicians may use in their mid-term future. A branch of meta-mathematics that is already mature enough to spread across all other mathematics areas is formalized mathematics. Once just studied theoretically as part of formal logic or as the fundamentals of mathematics, now computers and formal languages can transcribe mathematical definitions, statements, and proofs in a machine-readable and machine-processable way so that a computer can verify a proof. Computer-assisted proofs are now generally accepted in the mainstream (at least concerning symbolic computations), a long time passed from the “shock” of the four-color theorem and other early examples in the 70s and 80s of the necessary role of computing in mathematics. Beyond using a computer for a specific calculation that helps in the course of a proof, formalized mathematics brings the possibility of verifying the whole chain of arguments and logical steps that prove the statement of a theorem from its hypothesis. The system and language Coq was used by Georges Gonthier to formalize the aforementioned four-color theorem in 2005 and the Feit-Thompson conjecture in 2012. More recently, the LEAN system has shown to be helpful in backing up mathematical proofs such as the condensed mathematics project by Peter Scholze in 2021, or the polynomial Freiman-Ruzsa conjecture by Tim Gowers, Terence Tao, and others in 2023.
Proponents of the “formalized mathematics revolution” dream of a future in which all research articles will be accompanied by a machine-readable counterpart that encodes the same statements and proofs as the human-readable part. At some point, AI systems could help translate human-to-machine. Then verification and peer-reviewing the validity of a result will become a trivial run of the code in a system, leaving human intervention to purely language clarity and style of speech matters. Some even speculate that when logical deduction techniques can be made machine-encoded, artificial intelligence systems can be trained to optimize, suggest, or generate new results in collaboration with human mathematicians.
Whether those formalized languages remain a niche tool or become a widespread practice for mathematicians at large, and whether mathematical research will one day be assisted by artificial systems, are open questions to be seen in the next few decades.
In Conversation with Bettina Eick
In this issue of the data date series, decade-long developer of the small groups library in GAP, Bettica Eick, tells us about how research has changed for her with the increasing availability of technical infrastructure.
Data ❤️ Quest
MaRDI developed the fun interactive online adventure Data ❤️ Quest. In this short game, you will find your character at a math conference, where you can interact with different mathematicians. Find out about their kind of mathematical research data and complete their quests. Prompts will guide you through the storyline, the game ends automatically after 5 minutes. Of course, you can play as many times as you like. Visit our MaRDI page with more instructions and link to play.
The five-minute game was initially published for Love Data Week 2024. It is open source and serves as a trial for an expanded multi-player game currently in development.
15th polymake conference
This report is about the 15th polymake Conference, held on February 2nd, 2024, at the Technical University Berlin. It was embedded into the polymake developer meeting, making it a three-day event held by the polymake community.
Polymake is an open source software for research in polyhedral geometry. It deals with polytopes, polyhedra and fans as well as simplicial complexes, matroids, graphs, tropical hypersurfaces, and other objects.
The conference started with a talk by Prof. Volker Kaibel on 'Programming and Diameters of Polytopes'. His illustrative presentation set the tone for a day of engaging sessions and workshops. Attendees were able to select from a range of interactive workshops and tutorials, each allowing participants to follow along. In addition to a Polymake Basics tutorial, topics as serialization, regular subdivision, Johnson solids and quantum groups were covered.
Coming from Leipzig, our motivation for attending was driven by our MARDI-project on phylogenetic trees. Specifically, we aim to improve the ‘FAIR’ness of a website with small phylogenetic trees. Polymake developer Andrei Comăneci played a key role in implementing phylogenetic trees as data types in polymake, and worked with our MaRDI fellow Antony Della Vecchia to make them available in OSCAR, which we use for our computations. They presented their work during the "use case" session on phylogenetic trees and we learned how to make use of their software.
Despite our limited experience with polymake, we felt very welcome in the community and the collaborative spirit of experimentation and creation added a layer of enjoyment to the event. The conference provided a good balance between dedicated time for hands-on exploration and the option to seek guidance in discussions with the developing team.
6th NFDI symposium of the Leibniz Association
There is no doubt that the Leibniz Association (WGL) and its member institutions are very active within the NFDI, judging from the (now) 6th NFDI symposium of the WGL that took place on December 12th, 2023 in Berlin. In fact, it was during a coffee break in the first edition of this symposium, in 2018, where the idea of a NFDI mathematical consortium, MaRDI, was born. Five years on, 26 NFDI consortia across disciplines, including MaRDI, and Base4NFDI are working towards the goals of NFDI in building a FAIR infrastructure for research data. The first-round consortia have submitted their interim reports and are preparing their extension proposals for a possible second funding phase. So, the urgent question and main topic of the 2023 symposium was: How do we move forward? Discussion rounds covered the development of the NFDI and its consortia beyond 2025 as well as the connection to European and international research data initiatives. In particular, participants discussed the question of building "one NFDI" with the bottom-up approach, via disciplines, that have been followed so far. MaRDI was represented by Karsten Tabelow in the corresponding panel discussion. It is certain, that these questions will be the focus for the time to come. What the symposium already made obvious is that everybody is working towards the same goal and is constantly bringing in ideas for reaching it. Last but not least, Sabine Brünger-Weiland from FIZ Karlsruhe had her last official appearance as the official representative in the symposium before retirement. MaRDI owes her the conversation during the coffee break five years ago.
AI Video Series
NFDI4DS produced the video series "Conversations on AI Ethics". Each of the ten episodes features a specific aspect of AI, interviewing well-known experts in the field. All episodes are available on YouTube and the TIB AV Portal.
More information:
BERD Research Symposium
The event is scheduled for June 2024 and encompasses conference-style sessions and a young researchers' colloquium, fostering collaboration and exchange of information in research in business, economics, and social science. The event's primary focus is the collection, pre-processing, and analysis of unstructured data such as image, text, or video data. Registration deadline:
May 1st, 2024.
More information:
- in English
Call for Proposals
You may apply for funding to process and secure research data to continuously expand the offerings of data and services provided by Text+ and make them available to the research community in the long term. Multiple projects between EUR 35,000 and EUR 65,000 can be funded. Additionally, an overhead of 22% is granted on the project sum. The project duration is tied to the calendar year 2025; thus, it is a maximum of 12 months. Application deadline: March 31, 2024
More information:
In this short video "Treffen sich 27 Akronyme, oder: WTF ist NFDI?" Sandra Zänkert explains the idea behind NFDI, its current state and some challenges on the road to FAIR science.
The paper "Computational reproducibility of Jupyter notebooks from biomedical publications" by Sheeba Samuel and Daniel Mietchen examines large-scale reproducibility of Jupyter notebooks associated with published papers. The corpus here is from biomedicine, but much of the methodology also applies to other domains.
The Freakonomics Radio is a popular English language Podcast that recently published an interesting two-part series on academic fraud.
The paper "The Field-Specificity of Open Data Practices", by Theresa Velden, Anastasiia Tcypina provides quantitative evidence of differences in data practices and the public sharing of research data at a granularity of field-specificity rarely reported in open data surveys.
When your open-source project starts getting contributors, it can feel great! But as a project grows, contributors can neglect to document everything. In this situation, the article "Building a community of open-source documentation contributors" by Jared Bhatti and Zachary Sarah Corleissen may help you.
Welcome to this year's final Newsletter, our seventh issue on math and data! We are happy to present you with troves of information on managing your mathematical research data. Over the past issues, we have discussed the implications of the FAIR acronym for mathematics, how to search for and structure math results using knowledge graphs, and the specialties of mathematical research data. We will close this cycle by raising these questions: What do funding bodies and NFDI members recommend in their research-data management guidelines? How can we practice this in maths? What are MaRDI's recommendations?
For answers, check out our interview with math research-data manager Christoph Lehrenfeld; the keynote article on what to write in a research-data management plan; various reports from meetings where these topics were discussed, especially with a community of librarians; and not to mention, the ever-intriguing list of recommended reading.
Enjoy and seasonal greetings!
As always, we start off with an illustration. This time, it depicts the different research data types in mathematics, as discussed in our previous issue.
Hot tip: Send the illustration to your colleagues and friends as a seasonal greeting.
by Ariel Cotton, licensed under CC BY-SA 4.0.
In the previous issue, we asked what type of mathematician you are. All types of mathematicians are represented significantly within our newsletter community. The largest fraction belongs to the Guardians of the Data Vault category. You can also check out this page for more information, including a free poster download.
Here are the results:
Now, back to the current topic - Research Data Management in Mathematics. Have you ever been asked about data handling in a funding proposal? The survey of this newsletter issue deals with:
Data Management: From Theory to Practice
In previous MaRDI newsletter articles, we discussed what is Mathematical Research Data, the guiding principles that define proper, good quality research data (the FAIR principles), and why as a researcher you should care about your data. It is time now to raise the question of how to properly curate your research data in practice.
Research Data Management (RDM) refers to all handling of data related to a research project. It includes a planning phase (written as a formal RDM plan and included in an application for funding agencies), an ongoing data curation and plan revision during the project, and an archival phase at the conclusion of the project.
In this article, we will survey the main points to consider for proper data management. There are, however, more comprehensive and detailed guides that you can use to create your own RDM plan. The MaRDI community has written a report on Research-data management planning in the German mathematical community, and a whitepaper (Research Data Management Planning in Mathematics) that will be helpful in the context of mathematics. You can also get useful resources from other NFDI consortia, such as the FAIRmat Guide to writing a Research Data Management Plan.
Writing an RDM plan
An RDM plan is a document that describes how you and your team will handle the research data associated with your research project. This document is a helpful reference for the researchers on how to fulfill data management requirements. It is a standard requirement nowadays from many funding agencies on their application regulations for projects.
There are several standard key points to consider in an RDM plan. These points were developed by Science Europe and have been adopted across agencies internationally. You can check the evaluation criteria for each point that evaluators are likely to use when reviewing your application. In Germany, the key requirements are given by the German Research Foundation, the DFG.
Data description
First and foremost, you need to know the type of data you will be handling. Start by describing the types of data involved in your project (experimental records, simulations, software code…). It is a good idea to separate data by its provenance: internal data is data generated within the project, whereas external data is data used for the project that is generated elsewhere. When recording internal data, specify the means of data generation (by measuring instruments in the laboratory, by software simulation, written by a researcher…). As for the documentation of external data, include details of any interface/compatibility layer used (for instance format conversions).
Workflows of data are in itself a type of data. If you process data in a complex way (combining data from different sources, involving several steps, using different tools and methods…) the process in itself, the workflow, is something that should be properly documented and treated as research data.
Plan in advance the file formats required for recording, the necessary toolchains, and other aspects that affect interoperability. Prioritize the use of open formats and standards (if you need to use proprietary formats, consider saving both the proprietary and an exported copy to an open format). Finally, estimate the amount of data you will collect, and any other practical needs that you or anyone using the data will require. As you cannot foresee in detail the various data requirements (for instance, you may not know the specific software tools necessary to solve your problem), your RDM plan should be updated at a later stage if the type, volume, or characteristics of your data change significantly.
In mathematics, it is likely that you will generate some text and pdf files for your texts, with graphics from different sources. If your bibliographies grow above a hundred references, you may use separate BibTeX files (.bib) that you can re-use across publications, and constitute a usually overlooked piece of research data.
If your project involves computations, you will have scripts, notebooks, or code files that serve as input for your computation engine. Your system will require a toolchain to work, for instance, a particular installation of Singular, OSCAR, MatLab, a C compiler…, together with some installed libraries and dependencies. You may use an IDE or a particular text editor (while that may seem a personal choice not relevant for other users, it is in fact quite useful to know how some software was developed in practice). This toolchain is also a piece of research data that needs to be curated. You may have output files that require documentation, even if they can be recreated from the inputs. If your project involves third-party databases, these should be properly referenced and sourced.
Documentation and data quality
Data must be accompanied by rich metadata that describes it. Your documentation plan should state the metadata you need to collect, and explain how it will stay attached to the data. Once you have a description of your data, you need to organize it. Create a structure that will accommodate all the generated data. The structure can include some hierarchy in your filesystem, conventions for naming files, or another systematic way to find and identify your data easily. Do not call your files “code.sing”, “paper.tex” or “example3_revised 4 - FINAL2.txt”, instead use meaningful names such as “find_eigenvalues.nb” and start your document with comments explaining what this file is, the author, language, date, references to theory, how to run it, and any other useful information.
Good documentation will be a crucial step in enabling the reusability of the data. If you are developing a software library, you need to document the functions, APIs, and other parts of the software, including references to the theoretical sources that your algorithms are based on. If you are curating a database or a classification of mathematical objects, you need to document the meaning of fields in your tables, the formulae for derived values, etc. If parts of the data can be re-generated (for example, as a result of a simulation), you should describe how to do so, and differentiate clearly between source data and automatically generated data.
Data quality refers to the FAIRness of the data, which needs to be checked and addressed during the implementation phase. At the planning stage, you can provide metrics to evaluate data quality, and provide quality control mechanisms. For instance, checking the integrity of data periodically, and testing whether the whole toolchain can be installed and executed successfully. You can plan a contingency in case some tools become obsolete or unavailable.
Storage and archiving
Storing and archiving research data is not a trivial matter and should be planned carefully. On one hand, it requires security against data loss (or data break in case of sensitive data), and on the other hand, it needs to be accessible to all the researchers involved, in a practical way.
Your storage strategy should take into account the amount of data (large volumes of data are more difficult to move and preserve), its persistence (experiments are recorded usually only once, whereas an article or computer code is rewritten again and again with improvements), the number of people needing write access, or the sensitivity of the data (for instance, data related to people can have private information that needs to be anonymized and access-controlled, whereas non-sensitive data can be put in public repositories).
Backup plans should include keeping several copies of the data, in different physical locations and different media. Important key files (e.g. indices or tables listing contents of other files) should be specially protected and backed up. Synchronizing and keeping up versions of data is also important to avoid unorganized data. If your team has several people needing write access to the same data, you need appropriate tools to avoid conflicting versions, such as online multi-user editors (e.g. Overleaf, Nextcloud/collabora, Google docs…), or version control systems. Remember to always backup a local copy and not keep your only copy on the cloud. A good scientific practice is to keep your storage for at least 10 years after publication or project completion.
Legal and ethical obligations
It is important to be safe with the legal obligations associated with data, and related ethical considerations that may arise.
All data should be associated with an author or an owner, who has the right to decide a license and control its access, usage, and other legal prerogatives. Intellectual property and copyright may apply to some data, like patents, software, commercial products, or publications. Intellectual property protects ideas (but not facts of nature), while copyright protects an expression of an idea. In mathematics, a theorem cannot be protected by intellectual property (since it is a fact of nature), while an algorithm for a practical purpose or its implementation in software could be protected by it. The text of a scientific article is generally protected by copyright, even if the ideas contained therein are free. If the copyright of your written texts is to be transferred to a publisher (as it is standard practice), you should state the conditions that are acceptable within your project and your publication strategy (see next section).
Sensitive data (e.g. medical records, personal information…) require a specific data handling policy with special attention to data protection and access.
In all cases, you should include an appropriate license note, after evaluating all implications. For open licenses, you should prefer standard licenses (for instance Creative Commons or free/open software licenses) instead of crafting your own licenses or adding/removing clauses, which can lead to license incompatibilities and encumber its reusability. See our article on Reusability.
Data exchange
Data exchange involves an integration of the project and its data with the community. Data should be readily found, accessed, operated, and reused by anyone with a legitimate interest in it. In practice, this data exchange will involve a data preservation strategy for the long-term (archiving), as well as a dissemination strategy (using community standards to share data).
While the storage and archiving section above concerns mainly data security and preservation, this data exchange section focuses on the FAIRness (and specially the accessibility) of the data and its exchange within the community. Naturally, both topics overlap. FAIRness of data is heavily affected by the FAIRness of the repositories or hosting solutions that store and make this data accessible. Look for FAIR and reliable repositories in your domain that can host online versions of your data during the implementation of the project and also act as a long-term archiving solution.
In this section, you can include the publication strategy for your research articles. This includes, whether you plan to publish pre-prints or final articles in free repositories (like arXiv), or whether you consider publishing only in open-access journals, etc. Though there are comprehensive catalogs for literature in mathematical research (zbMATH open, MathSciNet), it is always good to ensure that your publications are findable and accessible. For other types of data, you should carefully consider their dissemination, and ensure that your data is listed in relevant catalogs.
Responsibilities
All research data needs someone to take care of it. A person (or a team) must take responsibility for the research data in the project. This responsibility may be that of the owner/author, or someone else. A data steward can be appointed to help with the technical aspects of data management. There can also be teams designated with different responsibilities during different phases of the project (planning, implementation, archiving), but they should be public and well defined, and serve as a contact point during and after the project.
In case the data is meant to be static (no changes in the future), then the responsible person is only answerable for what has been published. If, on the other hand, the data is expected to grow in the future (for instance a growing classification of mathematical objects), a maintainer should be appointed for the future, to take care of advances in the field and incorporate the data to its appropriate place. In case the maintainer can no longer take care of that role, the position should be transferred to another suitable person/team, for as long as the project needs a maintenance team.
MaRDI RDM consulting
We can offer a couple of examples of RDM plans from MaRDI, developed in the context of mathematics by MaRDI members. First, for a project applying statistical analysis to datasets containing student records for a study in didactics [RDMP1]. Second, for a project that develops algorithms and software with applications in robotics [RDMP2]. These are prototypes we prepared along with RDM experts from Leipzig university for mathematics projects planned by our researchers. We handed these out as examples to the community at the DMV annual meetings in 2022 and 2023.
MaRDI can offer consulting services for math projects that need help with creating their own RDM plan, or just figuring out the necessary infrastructure and best practices for a FAIR RDM. You can contact the MaRDI Help Desk for more information.
Tools for keeping your RDM up
There are some existing tools to help researchers plan and fulfill an RDM plan. These can be used in small or individual projects, though they are meant for large projects involving many researchers. We will briefly discuss Research Data Management Organizer (RDMO), a web-based service widely used in German research institutions.
RDMO is a free open-source software developed as a DFG project, meant to run as a web service in your institution's infrastructure. Normally, a data manager (data steward) will play an administrator’s role and install the RDMO software in a server with access to the institution researchers. The data manager will create questionnaires for handling the data of a specific project. That questionnaire will be available in an online form that researchers can fill-in for each piece of data that they create or gather. By analyzing this questionnaire, a standardized file can be exported, that serves as metadata of the described data. Template questionnaires are available, so that all relevant information is included (e.g. the DFG guidelines). The questionnaires can also be used to generate a standardized RDM plan for the project. No data is actually stored or handled in the RDMO platform, RDMO and other RDM tools only handle metadata and help with organization. You still need to store and structure your research data, ensure data quality, apply licenses, manage data exchange, etc. These tasks are not automated by any RDM tool and you are still in charge of implementing your RDM plan.
Some institutions will require you to use this platform to prepare RDM plans for their research projects. Such a case is Math+ excellence cluster at Zuse Institute Berlin. A version of the questionnaire that ZIB researchers use is available in [Quest1] (also published here in XML format; the actual RDMO instance is only available to ZIB users). Using such a system reduces the possibility of unintended omissions, ensures compatibility with the guidelines of their funding agency (DFG), and uniformizes RDM plans across different projects.
MaRDI is also actively using RDMO as part of its task area devoted to interdisciplinary workflows. Workflows are important research data for projects involving researchers from different disciplines, making its management particularly challenging. MaRDI has prepared an RDMO questionnaire that can describe workflows in a MaRDI standard way, you can have a look in [Quest2] (also published here in XML format). Additionally, MaRDI is developing MaRDMO, an RDMO plug-in that can be installed on the instance of RDMO you use (a live demo will be soon available). This plug-in will add the feature of exporting the documented workflow metadata directly to the MaRDI knowledge graph, and make it findable and accessible through the MaRDI portal. This will provide a streamlined method to populate the MaRDI knowledge graph directly from the researchers, with the same tool they used to create an RDM plan and manage their RD metadata.
In Conversation with Christoph Lehrenfeld
To get to know about infrastructural projects within collaborative research centers, Christiane Görgen interviews Christoph Lehrenfeld from Göttingen Scientific Computing about new developments and best practices in research data management.
DMV Annual Meeting
For four days in September (25th-28th), the town of Ilmenau in Thuringia was populated by hundreds of mathematicians from various disciplines and regions across Germany, who had traveled to the annual meeting of the Deutsche Mathematiker-Vereinigung (DMV). The event provided an excellent opportunity to present MaRDI and engage with the mathematical community. MaRDIans from nearly all our task areas were present. On the first day, we held our mini-symposium 'Towards a digital infrastructure for mathematical research', where speakers presented infrastructure services for mathematics they developed or are developing. At the MaRDI stall, we engaged in lively discussions with interested mathematicians, presented the latest version of our Algorithm Knowledge Graph, and distributed information material. The community responded positively to a checklist for technical peer review distributed at the MaRDI stall and to the "What type of mathematician are you?" poster [https://www.mardi4nfdi.de/community/data-type] in specific.We noticed an increase in scientists' awareness of research data management in mathematics and recognition of MaRDI compared to last year's DMV annual meeting in Berlin. It is encouraging to see that awareness regarding FAIR data is growing. Overall, we are pleased with our conference visit and the connections we made in Ilmenau.
Math meets Information Specialists Workshop
The first „Maths meets Information Specialists“ workshop was held from October 9th to 11th as a noon-to-noon event at the Max Planck Institute for Mathematics in Sciences in Leipzig. Organized by MaRDI, it brought together 20 professionals from diverse capacities, including librarians, data stewards, domain experts, and mathematicians. The workshop included talks and interactive elements such as hands-on sessions and barcamps. The focus was on addressing key questions related to the unique characteristics of mathematical research data (for example, what metadata is minimally sufficient to identify a maths object?), and exploring existing services and challenges faced by infrastructure facilities and service providers. This also included the topic of training and addressed the difficulty of raising awareness of rdm topics among mathematicians.
The „Maths meets Information Specialists“ workshop provided valuable insights, discussions, and best practices for the challenges associated with mathematical research data management. Moving forward, the initiative aims to continue fostering collaboration, developing standards, and supporting training efforts to ensure the effective management of mathematical research data. Stay tuned for a follow-up event.
The 3rd MaRDI Annual Workshop in Berlin
In November, we met as the MaRDI team in Berlin for the third run of our annual workshop. Participants from every task area arrived on Tuesday, 28 November to engage with collaborators in neighboring NFDI consortia -- namely 4Biodiversity, KonsortSWD, and 4DataScience -- with strong links to mathematical methods. This was followed by a panel discussion with all speakers on topics of common interest, such as knowledge graphs, community building, and potential areas for interdisciplinary collaboration. After such an inspiring kick-off, the meeting gave ample opportunity for MaRDIans to discuss the status quo and plans for the second half of the five-year funding period. Four new services were proudly presented: a new FAIR file format for saving mathematical objects, now available in OSCAR; a first version of the scientific-computing knowledge graph; software solutions for open interfaces between different computational tools like algorithms in Python and Julia; and ways of annotating and visualizing tex code in the Portal. These sparked lively discussions in subsequent barcamps on how to: present MaRDI services using the upcoming interactive MaRDI station. This can be done, for instance, as video games; by embedding the teaching of math infrastructure services in a curriculum, even if only for one hour per semester; and by integrating these services into our MaRDI Portal. The meeting concluded with a clear focus for the next 2 years: bringing MaRDI services to our users and communities.
First NFDI Berlin-Brandenburg Network Meeting
MaRDI initiated the first NFDI Berlin-Brandenburg network meeting at the Weierstrass Institute (WIAS) Berlin on October 12, 2023 with an aim to set up a local network of all NFDI consortia located within the region. The main goal was to establish contacts between members of the different consortia and identify common fields of interest. We focussed particularly on the mutual benefit of cooperation between projects of consortia in different disciplines.
25 out of the 27 consortia of the NFDI are present within the Berlin-Brandenburg region, involving more than 120 scientific and other institutions. 73 registered participants were to attend the meeting. Most participants belonged to one of 21 different consortia of the NFDI, whereas a few were not affiliated to any of the consortia but attended with an interest in learning about the NFDI and its consortia.
While similar NFDI local communities ("Stammtische") exist in a few other regions in Germany, a formal network of these communities is still elusive and is expected to be initiated by the NFDI headquarters in the future.
The workshop started with participants introducing themselves, getting to know each other, and brainstorming topics for the afternoon's World Café.
Among others, the following points were discussed:
Improving the acceptance and importance of FAIR principles and Research Data Management (RDM)
Role of open-source software for infrastructure technology and sustainability and longevity of services in the NFDI
Teaching RDM and literacy in the use of cross-disciplinary data types
Industry and International Collaborations
Ontologies and Knowledge Graphs (KG)
Importance of teaching the central topics of the NFDI such as RDM and KG, mainly to scientists in early phases of their career. Role of incentives for engagement in data management.
Overall, the atmosphere was open and constructive, focusing on bridging traditional gaps and fostering interdisciplinary cooperation. The meeting enabled us to find a common language and fields of interest, emphasizing the overarching aspects of the NFDI. We expect the venture to grow into future collaborations on topics central to the NFDI, meetings of smaller groups to discuss topics such as teaching RDM or ontologies, and biannual meetings at bigger forums. The main communication channel for information on future NFDI_BB activities will be the NFDI_BB mailing list:
https://www.listserv.dfn.de/sympa/info/nfdi_bb
Kindly register to be informed of future workshops and events.
Workshop on RDM in Modelling in Computer Science
This NFIDxCS workshop is a base for discussing systematic approaches to dealing with research data. The workshop aimed to gather individuals willing to contribute to the handling of research data management. The result will be a manifesto for research data management in modeling research (in Computer Science). Date: March 11, 2024, submission deadline: January 8, 2024.
More information:
- in English
Data Management Plan Tool
The German Federation for Biological Data (GFBio) offers a Data Management Plan (DMP) Tool. It will help you find answers to important questions about the data management of your project, and create a structured PDF file from your entries. You can also get free personal DMP support from their experts.
More information:
- in English
Mailing list “Math and Data Forum”
The MaRDI mailing list “Math and Data Forum” offers news and insights into the realm of mathematical research data as well as a discussion forum for research data management practices and services in mathematics.
More information:
Special ITIT Issue Data Science and AI within the NFDI
Data Science and AI is an interdisciplinary field that is important for many NFDI consortia. This special issue of the journal "it - Information Technology" will focus on recent developments in Data Science and AI in the different consortia. Submission deadline: January 31, 2024.
More information:
- in English
The whitepaper Research Data Management Planning in Mathematics by the MaRDI consortium aims to guide mathematicians and researchers from related disciplines who create research data management (RDM) plans. It highlights the benefits and opportunities of RDM in mathematics and interdisciplinary studies, showcases examples of diverse Math research data and suggests technical solutions that meet the requirements of funding agencies with specific examples.
This guide to writing a Research Data Management Plan by FAIRmat provides you with comprehensive information and practical tips specific to the fields of condensed-matter physics and materials science on creating a data management plan (DMP) that meets the DFG requirements and aligns your research with the FAIR data principles, the DFG code of conduct, and the EU open science policy.
The DFG guidelines on Research Data Management (2021) include this Checklist for planning and description of handling of research data in research projects.
The EMS article "Research-data management planning in the German mathematical community" by Tobias Boege and many other MaRDIans discusses the notion of research data for the field of mathematics and reports on the status quo of research-data management and planning.
Science Europe offers a practical Guide to the International Alignment of Research Data Management. Find the Extended Edition here.
The review Making Mathematical Research Data FAIR: A Technology Overview by Tim Conrad and several other MaRDIans aims to perform a technology review on existing data repositories/portals with a focus on mathematical research data.
- MaRDIan Daniel Mietchen was invited to give an NFDI InfraTalk on Scholia - an open-source profiling tool to explore scholarly knowledge via open data from Wikidata. He is one of the core developers of Scholia, which provides dozens of profile types, each composed of multiple panels that render pertinent information based on a predefined SPARQL query that is parametrized via the Wikidata identifier of the concept to be profiled. It also facilitates collaborative curation of this information in Wikidata. Luckily, the talk was recorded and is available on YouTube for you to watch.
In the rather short article Share and share alike: Top 5 reasons to share your research data! Isabel Chadwick convinces you to share your data by highlighting some key benefits.
Welcome back to the Newsletter on mathematical research data—this time, we are discussing a topic that is very much at the core of our interest and that of our previous articles: what is mathematical research data? And what makes it special?
Our very first newsletter delved into a brief definition and a few examples of mathematical research data. To quickly recap, research data are all the digital and analog objects you handle when doing research: this includes articles and books, as well as code, models, and pictures. This time, we zoom into these objects, highlight their properties, needs, and challenges (check out the article "Is there Math Data out there?" in the next section of this newsletter), and explore what sets them apart from research data in other scientific disciplines. We also report from workshops and lectures where we discussed similar questions, present an interview with Günter Ziegler, and invite you to events to learn more.
by Ariel Cotton, licensed under CC BY-SA 4.0.
We start off with a fun survey. It is again just one multiple choice question. This time, we created a decision tree, which will guide you to answer the question:
What type of mathematician are you?
You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.
The decision tree is available as a poster for download, licensed under CC BY 4.0.
Is there Math Data out there?
“Mathematics is the queen and servant of sciences”, according to a quote by Carl F. Gauss. This opinion of Gauss can be a source of philosophical discussions. Is Mathematics even a science? Why does it play a special role? Connecting these questions to our concerns, what is the relationship between research data and these philosophical questions? We cannot arrive at a conclusion in this short article, but it is a good starting point to discuss the mindset (the philosophy, if you wish) that should be adopted regarding research data in mathematical science.
A wide agreement is that a science is any form of study that follows the scientific method: observation, formulation of hypotheses, experimental verification, extraction of conclusions, and back to observation. In most sciences (natural sciences and, to a great extent, also social sciences), observation requires gathering data from nature in the form of empirical records. In contrast, in pure mathematics observations can be made simply by reflection on known theory and logic. In natural sciences, nature is the ultimate judge of the validity or invalidity of a theory. This experimental verification also requires gathering research data in the form of empirical records that support or refute a hypothesis. In contrast, in mathematics, experimental verification is substituted by formal proofs. Such characteristics have prompted some philosophers to claim that mathematics is not really a science, but a meta-science because it does not rely on empirical data. More pragmatically, it can tempt some researchers and mathematicians to say that (at least, pure) mathematics does not use research data. But as you will guess, in the Mathematical Research Data Initiative (MaRDI), we advocate for quite the opposite view.
Firstly, some parts of mathematics do use experimental data extensively. Statistics (and probability) is the branch of mathematics for analyzing large collections of empirical records. Numerical methods are practical tools to perform computations in experimental data. This is the case for pure mathematics as well, where we can build lists of records (prime numbers, polytopes, groups…) that are somewhat experimental.
Secondly, research data are not only empirical records. Data are any raw piece of information upon which we can build knowledge (we discussed the difference between data, information, and knowledge in the previous newsletter). When we talk about research data, we mean any piece of information that researchers can use to build new knowledge in the scientific domain in question, mathematics in this case. As such, articles and books are pieces of data. More precisely, theorems, proofs, formulas, and explanations are individual pieces of data. They have traditionally been bundled into articles and books, and stored in paper, but nowadays are largely available in digital form and accessible through computerized means.
Types of data
In modern mathematical research, we can find many types of data:
Documents (articles, books) and their constituent parts (theorems, proofs, formulas…) are data. Treating mathematical texts as data (and not only as mere containers where one deposits ideas in written form) recognizes that mathematical texts deserve the same treatment as other forms of structured data. In particular, FAIR principles and data management plans also apply to texts.
Literature references are data. Although bibliographic references are part of mathematical documents, we mention them separately because references are structured data. There is a defined set of fields, (such as author, title, publisher…), there are standard formats (e.g. bibTeX), and there are databases of mathematical references (e.g. zbMATH, MathSciNet,...). This makes bibliographic references one of the most curated type of research data (especially in Mathematics) .
Formalized mathematics is data. Languages that implement formal logic like Coq, HOL, Isabelle, Lean, Mizar, etc, are a structured version of the (unstructured) mathematical texts that we just mentioned. They contain proofs verifiable by software and are playing an increasingly vital role in mathematics. Data curation is essential to keep those formalizations useful and bound to their human-readable counterparts.
Software is data. From small scripts that help in a particular problem to wide libraries that integrate into larger frameworks (Sage, Mathematica, MATLAB…). Notebooks (Jupyter,...) are a form of research data that mix text explanations and interactive prompts, so they need to be handled as both documents and software.
Collections of objects are data. Classifications play a major role in mathematics. Either gathered by hand or produced algorithmically, the result can be a pivotal point on which many other works will derive from. Although this output result of a classification can have more applications than the process to arrive at it, it is essential that both input algorithm (or manual process) and the output classification are clearly documented, so that the classification can be verified and reproduced independently, apart from being reused in further projects.
Visualizations and examples are data. Examples and visual realizations of mathematical objects (including images, animations, and other types of graphics) can be very intricate and have an enormous value for understanding and developing a theory. Although examples and visualizations can be omitted in more spartan literature, if provided, they deserve a full research data curation as other research data essential to logical proofs.
Empirical records are data. Of course, raw collection of natural information, intended to be processed to extract knowledge of the data itself, or from the statistical method, are data that need special tools to handle. This applies to statistical databases, but also to machine learning models that require vast amounts of training data.
Simulations are data. Simulations are lists of records not measured from the outside world, but generated from a program. This is usually a representation of a state of a system, including possibly some discretizations and simplifications of reality in the modeling process. As with collections, this output simulation data is as necessary as the input source code that generates it. Simulation data is what allows us to extract conclusions, whereas the reproducibility verification requires that the processing input-to-output be performed by a third party, allowing the recognition of flaws or errors in either the input or the output, or allowing for the rerun of the simulation with different parameters.
Workflow documentations are data. More general than simulations, workflows involve several steps of data acquisition, data processing, data analysis, and extraction of conclusions in many scientific researches. An overview of the process is in itself a valuable piece of data, as it gives insights into the interplay of the different parts. A numerical algorithm can be individually robust and performant, but it may not be the best fit for the task at hand. We can only spot such issues when we have a good overview of the entire process.
The building of mathematics
One key difference between mathematics and other sciences is the existence of proofs. Once a result is proven, it is true forever, as it cannot be overruled by new evidence. The Pythagorean theorem, for instance, is today as valid and useful as it was in the times of Pythagoras (or even in the earlier times of ancient Babylonians and Egyptians, who knew and used it. However, the Greeks invented the concept of proof, turning mathematics from a practice into a science). The Book of Elements by Euclid, written circa 300 BC, one of the most relevant books in the history of mathematics and mankind, perfectly represents the idea that mathematics is a building, or a network, in which each block is built on top of others, in a chain starting with some predetermined axioms. The image shows the dependency graph of propositions in Book I of the Elements.
Imagine now that we extend the above graph to include all propositions and theorems from all mathematical literature up to the current state of research. That huge graph would have millions of theorems and dependency connections, and will be futile to draw on paper. This graph does not exist yet physically or virtually except as an abstract concept. Parts of this all-mathematics graph are stored in the brains of some mathematicians, or in literature as texts, formulas, and diagrams. The breakthrough of our times is that it is conceivable to materialize this graph with today’s technology, in the form of a knowledge graph similar to those being developed at MaRDI or Wikidata. The benefits of having such a graph in a computer system are many: we will be able to find any known theorem that applies to our problems, access the fundamental blocks of literature where those results were established, find and verify logical connections in complex proofs, facilitating a panoramic view of mathematics and its different areas.
The crucial point is that to succeed in such an endeavor, we must realize that mathematical knowledge is composed of pieces of data, that require FAIR and complex data management and a particular infrastructure to handle data at this scale. Although it is not completely out of MaRDI’s scope, MaRDI itself does not have a goal of creating a knowledge graph of all mathematical theorems but instead focuses on the research data management required by today’s researchers. The most advanced project aiming to fulfill this all-mathematics graph is probably within the LEAN community (see also our interview with Johan Commelin).
Mathematics as a tool
The “special role” of mathematics amongst sciences comes from the role of tool that it plays in any other science, to the point that a science is not considered mature enough until it has a mathematical formalization. The fact that mathematics can be used as the tool for doing science is the so-called “unreasonable effectiveness of mathematics in the natural sciences”. But once this role of mathematics as a tool is accepted, we must admit that, in theory, it is a very reliable tool. It is so, foremost, because of the logical building process that we described above. A proven theorem will not fail unexpectedly, the rules of logic will not cease to exist tomorrow. But in practice, relying on tools that someone else developed requires, first, that one can trust the tool to execute its intended goal; and second, that one can learn how to use the tool effectively. This entails responsibility from mathematics as a science and from mathematicians as a community with respect to other sciences and researchers.
As happens with physical tools, a craftsman must know their tools well in order to use them efficiently. But also any modern toolmaker must state clearly the technical characteristics of the tool, the intended use, the safety precautions, its quality standards and regulations, etc. In our analogy, mathematicians must take care of impeccable preparation of the results they produce, especially when talking about algorithms and methods that will probably be applied by researchers in other fields of science.
Think of the calculus used in quantitative finance, statistical hypothesis tests to analyze data in medicine, or computers tracking the exact location of spaceships. If mathematicians did not get their derivatives and integration right, these methods will not provide reliable results, leading to wrong conclusions, often even putting people’s lives in danger. It is of utmost importance to be able to fully trust at least the theoretical basis, especially since applied science has to deal with rounding errors, components of nature that were not integrated into the original model, and the possibility of human failure. This requires a verifiability of the results.
Concerning the mastering of the use of a tool, mathematical production must take into account its future reusability as tools for other scientists. This means appropriate documentation, using appropriate standards for interoperability with existing tools, using legal licenses that allow unencumbered reusability, and in general following some form of agreed good practices of the community that can help as guidelines for the research practice.
Modern science in the age of information and computation depends entirely on research data, but different fields have adapted their methods and practices with uneven success. Mathematics is not especially well placed in terms of managing research data and software in comparison to other fields.
Software development, especially in the open source community, has been facing data management problems for decades, meaning that some of the solutions are currently standard practices in the industry. For instance, version control (with git as a de-facto standard tool) is a basic practice to track changes and improvements to source code (could be any document or any data). If we couple the version control with a public repository (GitHub, GitLab…), we get a reliable method of publishing software and working collaboratively. Once a project has many contributors, one will face merging problems, when different teams develop in different directions. A solution is a continuous integration scheme, with automated tests, that guarantee your modifications will not break other parts of the project if adopted. The amount of security and verification in the industry for any new development in big software projects (think for instance on new Linux kernel releases) is certainly unparalleled in most software projects in the scientific research community (with notable exception efforts like xSDK). This is often excused as research is in its nature experimental (in the sense of untested and unfinished), but academic and theoretical research should not have lower standards than industry research.
In Conversation with Günter Ziegler
"There's nothing more successful than success" Günter Ziegler says in our latest data date: best practices will be embraced by the community. We talk about what's his combinatorical view on research data, the need for classifications, and the difference between everlasting mathematical results and theories in physics.
Mathematics Meets Data: Highlights from MaRDI's Barcamp
What better way to get researchers to find out that research-data management is their topic than with a Barcamp? That way, every participant can explore their own experiences, questions, and approaches.
On July 4th, MaRDI hosted its first Barcamp on Research-Data Management in Mathematics at Bielefeld University's Center for Interdisciplinary Research. It was a joint effort involving the Bielefeld mathematics faculty, MaRDI, BiCDaS, and the Bielefeld Competence Center for Research Data.
The day began with a casual breakfast, where attendees mingled, discussed expectations, and chatted about questions. A poster showcasing research data types served as a useful conversation starter (find the download link for the poster in the welcome section of this newsletter issue).
Before the session pitches commenced, Lars Kastner and Pedro Costa-Klein delivered brief talks on code reproducibility and best practices for using Docker in the Collaborative Research Center 1456 (Mathematics of the Experiment) in Göttingen, respectively.
The session pitches revealed that the Barcamp had appealed to many young researchers unfamiliar with the topic. To address this, an introductory session on "What is research data?" kicked off the discussions. Meanwhile, those more experienced with research data management discussed ways to engage the mathematical community with the topic.
One of the defining features of a Barcamp is its participant-driven agenda. Attendees had the unique opportunity to shape the discussions and focus on the topics most pertinent to their research and data management needs. This resulted in a diverse set of topics. One session on research data management plans matched experts from the Competence Center and mathematicians to exchange perspectives and requirements. A smaller group's discussions centered on Binderhub, whereas another tackled research data repositories and their adherence to FAIR principles. Additional sessions explored the peculiarities of mathematical research data, the importance of good documentation, and a hands-on session on an online databasethat collects and discusses ideas on FAIR data.
This Barcamp offered the mathematics community an exceptional platform to exchange insights and inquiries regarding research-data management within their discipline.
Teaching research-data management
A survey conducted in the summer of 2021 in German mathematics departments revealed that teaching mathematicians estimate the awareness and knowledge of their students regarding good scientific practice, authorship attributions, the FAIR principles, and research software as too low. Unfortunately, these are classical research-data management (rdm) topics. Motivated by that need and by successful, cross-disciplinary rdm courses at Bielefeld and Leipzig universities, six lectures in research-data management for mathematicians took place in Leipzig in the summer term 2023. To the teacher's knowledge, this was the first of its kind. The large group of attendees came from a variety of career levels including six undergraduate students, two PhD students, two postdocs, and five MaRDIans. This contributed to lively discussions centered around properties and common problems of mathematical research data, metadata standards for papers and the difficulties in deciding appropriate metadata for mathematical results, the scientific method, good scientific practice, and how to write, cite, and document mathematics. Feedback for the course was very good, with students appreciating the interactive atmosphere, the time allocated for questions, and the informal nature of the classes. A one-day course of maths rdm in Magdeburg in October will build on these first successful sessions and discuss questions of reproducibility and repositories, in addition to introductory topics. Lecture notes for both are now in the making. They will be made publicly available for a second installment next summer term for free use and reuse by any mathematician interested in the topic of rdm.
MaRDMO Workshop at the NFDI-MatWerk Conference
The "1st Conference on Digital Transformation in Materials Science and Engineering - NFDI-Matwerk Conference" took place in Siegburg between 26-29.06.2023. With 30 talks, 17 posters, 10 workshops, and 160 participants (on-site and online), the conference provided an ideal setting for the urgently needed transformation in materials science. In addition to status updates from each NFDI-MatWerk task area and various interdisciplinary use cases, the conference initiated collaborations between different NFDI consortia and new community participants, emphasizing their role in shaping the future of NFDI-MatWerk. Several NFDI consortia, namely NFDI4Chem, NFDI4energy, DAPHNE4NFDI, and FAIRmat, also gave keynote presentations, highlighting the need for collaboration.
Marco Reidelbach from TA4 attended the conference on behalf of the MaRDI consortium to present MaRDMO, a plugin for the Research Data Management Organiser (RDMO) for documenting, publishing, and searching interdisciplinary workflows. Though participation was low at the 100-minute demonstration, discussions vital for the further development of MaRDMO ensued. The central point of the discussion was the automation of the documentation process to minimize additional work for researchers, thereby increasing the acceptance of MaRDMO. We also discussed the use of RDMO, which on paper appears to be an ideal interface to all research disciplines, but was completely unknown to the workshop participants. Here, the NFDI in particular is also called upon to take a clear stand. A good two-thirds of the consortia have declared their support for RDMO, while the remaining consortia want to rely on alternatives or are still undecided.
Overall, the NFDI-MatWerk consortium conference showed that the defining infrastructural issues, far from the concrete content, differ little or not at all from the issues in the MaRDI consortium and the other consortia at the conference. The construction of knowledge graphs and the harmonization of ontologies are central problems that require a joint effort and make it necessary to leave one's own comfort zone.
MaRDI at CoRDI
MaRDI was present at the first Conference on Research Data Infrastructure (CoRDI), held in Karlsruhe between 12 - 14 September 2023. This interdisciplinary event brought all the NFDI consortia together, during which they presented their projects in general and detailed discussions. The conference was a unique opportunity to exchange experiences and ideas amidst a wide range of communities with different needs, but share common challenges and solutions regarding Research Data.
MaRDI presented three talks and two posters. The general conference proceedings are linked in the recommended further reading section at the end of this newsletter issue. We provide links to individual sections here:
Talks:
MaRDI. Building Research Data Infrastructures for Mathematics and the Mathematical Sciences. Renita Danabalan, Michael Hintermüller, Thomas Koprucki, Karsten Tabelow.
MaRDIFlow: A Workflow Framework for Documentation and Integration of FAIR Computational Experiments. Pavan L. Veluvali, Jan Heiland, Peter Benner.
Building Ontologies and Knowledge Graphs for Mathematics and its Applications. Björn Schembera, Frank Wübbeling, Thomas Koprucki, Christine Biedinger, Marco Reidelbach, Burkhard Schmidt, Dominik Göddeke, Jochen Fiedler
Posters:
MaRDMO Plugin. Document and Retrieve Workflows Using the MaRDI Portal. Marco Reidelbach, Eloi Ferrer, Marcus Weber.
Spreading the Love for Mathematical Research Data. Tabea Bacher, Christiane Görgen, Tabea Krause, Andreas Matt, Daniel Ramos, Bianca Violet.
Math Meets Information Specialists, October 09 - 11, 2023, MPI MiS, Leipzig
MaRDI invites information specialists, librarians, data stewards, and mathematicians to discuss mathematical research data, present their own ideas and services, and make new connections in a three-day noon-to-noon workshop with talks, hands-on sessions, and a barcamp. The workshop will be held in German.
More information:
- in German
Data-Driven Materials Informatics, March 4 - May 24, 2024
The aim of this long program at IMSI is to bring together a diverse scientific audience, both between scientific fields (physical sciences, materials sciences, biophysics, etc.) and within mathematics (mathematical modeling, numerical analysis, statistics, data analysis, etc.), to make progress on key questions of materials informatics.
More information:
- in English
RDM with LinkAhead, September 29, 2023, online
At the NFDI4Chem Stammtisch, the research data management software LinkAhead will be introduced. This agile, open-source software toolbox enables professional data management in research where other approaches are too rigid and inflexible. It will make your data findable and reusable.
More information:
NFDI Code of Conduct
The Consortial assembly, comprising the speakers of each consortium, voted on 27 June 2023 to adopt the code of conduct for the NFDI. This Code of Conduct is intended to provide a binding framework for effective collaboration within the NFDI association.
More information:
- in German
- A a generic JSON based file format which is suitable for computations in computer algebra is described in the paper A FAIR File Format for Mathematical Software by Antony Della Vecchia, Michael Joswig, and Benjamin Lorenz. This file format is implemented in the computer algebra system OSCAR, but the paper also indicates how it can be used in a different context.
- To understand our world, we classify things. A famous example is the periodic table of elements, which describes the properties of all known chemical elements and classifies the building blocks we use in physics, chemistry, and biology. In mathematics, and algebraic geometry in particular, there are many instances of similar periodic tables, describing fundamental classification results. In his article, The Periodic Tables of Algebraic Geometry, Pieter Belmans invites you on a tour of some of these results. It appeared within the series 'Snapshots of modern mathematics from Oberwolfach'.
- Play with the educational tool Classified graphs. With this open-source web app you can draw any graph, or select one from a collection, and then compute a few invariants, such as the adjacency determinant. In the Identify mode, you are challenged to find out which of the graphs in the collection is shown as a target. The tool is part of Pieter Belmans's project Classified maths.
- In each episode of the podcast "Mathematical Objects", Katie Steckles and Peter Rowlett chat about some aspect of mathematics using a mathematical object as inspiration. The podcast is also available on YouTube.
- Proceedings of the Conference on Research Data Infrastructure (CoRDI):
https://www.tib-op.org/ojs/index.php/CoRDI/issue/view/12
Welcome to the fifth issue of the MaRDI Newsletter on mathematical research data. In the first four issues, we focused on the FAIR principles. Now we move to a topic, which makes use of FAIR data and also implements the FAIR principles in data infrastructures. So without further ado, let me introduce you to the ultimate use case of FAIR data: Knowledge Graphs.
by Ariel Cotton, licensed under CC BY-SA 4.0.
Knowledge graphs are very natural and represent information similar to how we humans think. They come in handy when you want to avoid redundancy in storing data (as it may happen quite often with tabular methods), and also for complex dataset queries.
This newsletter issue offers some insight into the structure of knowledge, examples of knowledge graphs, including some specific to MaRDI, an interview with a knowledge graph expert, as well as news and announcements related to research data.
In the last issue, we asked how long it would take you to find and understand your own research data. These are the results:
Now we ask you for specific challenges when searching for mathematical data. You may choose from the multiple-choice options or enter something else you faced.
Click to enter your challenges!
You will be taken to the results page automatically, after submitting your answer. Additionally, the current results can be accessed here.
The knowledge ladder
We are not sure exactly how humans store knowledge in their brains, but we certainly pack concepts into units, and then relate those conceptual units together. For example, if asked to list animals, nobody remembers an alphabetical list (unless you explicitly train yourself to remember such a list). Instead, you start the list with something familiar, like a dog, then you recall that dog is a pet animal, and then you list other pet animals like cat or canary. Then you recall that canary is a bird, and then you list other birds, like eagle, falcon, owl… when you run out of birds, you recall that birds fly in the air, which is one environment medium. Another environment medium is water, and this prompts you to start listing fishes and sea animals. This suggests that we can represent human knowledge in the form of a mathematical graph: concepts are nodes, and relationships are edges. This structure is also ingrained in language, which is the way humans communicate and store knowledge. All languages in the world, across all cultures, have nouns, verbs, or adjectives, and establish relationships through sentences. Almost every language organizes sentences in a subject-verb-object pattern (or any permutations: SVO, SOV, VSO, etc). The subject and the object are typically nouns or pronouns, the verb is often a relationship. A sentence like “my mother is a teacher” encodes the following knowledge: the person “my mother” is a node 1, “teacher” is a node 2, and “has as a job” is a relational edge from node 1 to node 2. Also, there is a node 3, the person “me”, and a relationship “is the mother of” from node 1 to node 3, (which implies a reciprocal relationship “is a child of” from node 3 to node 1).
On this construction, we can expect to have an abstract representation of human knowledge that we can store, retrieve, and search with a computer. But not all data automatically gives knowledge, and raw knowledge is not all you may need to solve a problem. This distinction is sometimes referred as “knowledge ladder”, although terminology has not been universally agreed upon. In this ladder, data is the lowest level of information, data are raw input values that we have collected with our senses, or with a sensor device. Information is data tagged with meaning; I am a person, that person is called Mary, teaching is a job, this thing I see is a dog, this list of numbers are daily temperatures in Honolulu. Knowledge is achieved when we find relationships between bits of information; Mary is my mother, Mary’s job is teacher, these animals live together and compete for food; pressure, temperature, and volume in a gas are related by the gas law PV=nRT. Insight is discerning. It is singling out the information that is useful for your purpose from the rest, it is finding seemingly unrelated concepts that behave alike. Finally, wisdom is understanding the connections between concepts. It is the ability to explain step by step how concept A relates to concept B. This ladder is illustrated in the image above. From this point of view, “research” means to know and understand all portions of the human knowledge that falls into or close to your domain, and then enlarging the graph with more nodes and edges, for which you need both insight and wisdom.
The advent of knowledge graphs
Knowledge graphs (KG) as a theoretical construction have been discussed in information theory, linguistics and philosophy for at least five decades, but it is only in this century that computers allowed us to implement algorithms and data retrieval at a practical and massive scale. Google introduced its own knowledge graph in 2012, you may be familiar with it. When you look up in Google some person, some place, etc, there is a small box to the right that displays some key information such as birthdate and achievements for a person, opening times for a shop, etc. This information is not a snippet from a website, it is information collected from many sources and packed into a node of a graph. Then those nodes are linked together by some affinity relationship. For instance, if you look up “Agatha Christie”, you will see an “infobox” with her birthdate, deathdate, short description extracted from Wikipedia, a photograph… And also a list of “People also search for” that will bring you to her family relatives such as Archibald Christie, or to other British authors, such as Virginia Woolf.
But probably the biggest effort to bring all human knowledge into structured data is Wikidata. Wikidata is a sister project of Wikipedia. Wikipedia aims to gather all human knowledge in the form of encyclopedic articles, that is, into non-structured human-readable data. Wikidata, by contrast, is a knowledge graph. It is a directed labeled graph, made of triples of the form subject (node) - predicate (edge) - object (node). The nodes and edges are labeled, actually, they contain a whole list of attributes.
The Wikidata graph is not designed to be used directly by humans. It is designed to retrieve information automatically, to be a “base of truth” that can be relied on. For instance, it can check automatically that all languages of Wikipedia state basic facts correctly (birthplace, list of authored books…), and can be used by external services (such as Google and other search engines or voice assistants) to offer correct and verifiable answers to queries.
In practice, nodes are pages, for instance, this one for Agatha Christie. Inside the page, it lists some “statements”, which are the labeled edges to other nodes. For example, Agatha Christie is an instance of a human, her native language is English, and her field of work is crime novel, detective literature, and others. If we compare that page with the Agatha Christie entry in the English Wikipedia, clearly the latter contains more information, and the Wikidata page is less convenient for a human to read. Potentially, all the ideas described with English sentences in Wikipedia could be represented by relationships in the Wikidata graph, but this task is tedious and difficult for a human, and AI systems are not yet sufficiently developed to make this conversion automatically.
In the backend, Wikidata is stored in relational SQL databases (the same Mediawiki software as used in Wikipedia), but the graph model is that of triples subject-predicate-object as defined in the web standard RDF (Resource Description Framework), This graph structure can be explored and queried with the language SPARQL (Simple Protocol And RDF Query Language). Note that usually, we use the verb “query” as opposed to “search” when we want to retrieve information from a graph, database, or other structured sources of information.
Thus, one can access the Wikidata information in several ways. First, one can use the web interface to access single nodes. The web interface has a search function that allows one to look up pages (nodes) that contain a certain search string. However, it is much more insightful to get information that takes advantage of the graph structure, that is, querying for nodes that are connected to some topic by a particular predicate (statement), or that have a particular property. For Wikidata, we have two main tools: direct SPARQL queries, and the Scholia plug-in tool.
The web and API at query.wikidata.org allows to send queries in SPARQL language. This is the most powerful search, you can browse the examples in that site. The output can be a list, a map, a graph, etc. There is a query builder help function, but essentially it requires some familiarity with SPARQL language. On the other hand, Scholia is a plug-in tool that helps querying and visualizing the Wikidata graph. For instance, searching for “covid-19” via Scholia, it will offer a graph of related topics, a list of authors and recent publications on the topic, organizations, etc., in different visual forms.
Knowledge graphs, artificial intelligence, and mathematics
Knowledge graphs are a hot research area in connection with Artificial Intelligence. On the one hand, there is the challenge of creating a KG from a natural language text (for instance, in English). While detecting grammar and syntax rules (subject, verb, object) is relatively doable, creating a knowledge graph requires encoding the semantics, that is, the meaning of the sentence. In the example of a few paragraphs above, “my mother is a teacher”, to extract the semantics we need the context of who is “me” (who is saying the sentence), we need to check if we already know the person “my mother” (her name, some kind of identifier), etc. The node for that person can be on a small KG with family or contextual information, while “teacher” can be part of a more general KG of common concepts.
In the case of mathematics, extracting a KG from natural language is a tremendous challenge, unfeasible with today’s techniques. Take a theorem statement: it contains definitions, hypotheses, and conclusions, and each one has a different context of validity (the conclusion is only valid under the hypotheses, but that is what you need to prove). Then imagine that you start your proof by reduction to absurd, so you have several sentences that are valid under the assumption that the hypotheses of the theorem hold, but not the conclusion. At some point, you want to find a contradiction with your previous knowledge, thus proving the theorem. The current knowledge graph paradigm is simply not suitable to follow this type of argumental line. The most similar thing to structured data for theorems and proofs are formal languages in logic, and there are practical implementations such as LEAN Theorem Prover. LEAN is a programming language that can encode symbolic manipulation rules for expressions. A proof by algebraic manipulation of a mathematical expression can therefore be described as a list of manipulations from an original expression (move a term to the other side of the equal sign, raise the second index in this tensor using a metric…). Writing proofs in LEAN can be tedious but it has the benefit of being automatically verifiable by a machine. There is no need for a human referee. Of course, we are still far from an AI checking the validity of a proof without human intervention, or even figuring out proofs to conjectures on its own. On the other hand, a dependency graph of theorems, derived in a logic chain from some axioms, is something that a knowledge graph like the MaRDI KG would be suitable to encode.
In any case, structured knowledge (in the form of KG or other forms, such as databases) is a fundamental piece to providing AI systems with a source of truth. Recent advances in the field of generative AI include the famous conversation bots ChatGPT and other Large Language Models (LLM), which are impressive in the sense that they can generate grammatically correct text, with meaningful sentences while keeping attention to maintaining a conversation. However, these systems are famous for not being able to distinguish truth from falsehood (to be precise, the AI is trained with text data that is assumed to be mostly true, but it cannot make any logical deductions). If we ask an AI for the biography of a nonexistent person, it may simply invent it trying to fulfill the task. If we contradict the AI with a pure fact, it will probably just accept our input despite its previous answer. Currently, conversational AI systems are not capable of rebutting false claims by providing evidence. However, in the likely future, a conversational AI with access to a Knowledge Base (KG, database, or other), will be able to process queries and generate answers in natural language, but also to check for verified facts, and to present relevant information extracted from the knowledge base. An example in this direction is the Wolfram Alpha plug-in for ChatGPT. With some enhanced algorithms to traverse and explore a knowledge graph, we will maybe witness AI systems stepping up from Knowledge to Insight, or further up the ladder.
One of the mottos of MaRDI is “Your Math is Data”. Indeed, from an information theory perspective, all mathematical results (theorems, proofs, formulas, examples, classifications) are data, and some mathematicians also use experimental or computational data (statistical datasets, algorithms, computer code…). MaRDI intends to create the tools, the infrastructure, and the cultural shift to manage and use all research data efficiently. In order to climb up the “knowledge ladder” from Data to Information and Knowledge, the Data needs to be structured, and knowledge graphs are one excellent tool for that goal.
AlgoData
Several initiatives within MaRDI are based on knowledge graphs. A first example is AlgoData (requires MaRDI / ORCID credentials), a knowledge graph of numerical algorithms. In this KG, the main entities (nodes) are algorithms that solve particular problems (such as linear systems of equations or integrate differential equations). Other entities in the graph are supporting information for the algorithms, such as articles, software (code), or benchmarks. For example, we want to encode that algorithm 1 solves problem X, it is described in article Y, it is implemented on software Z, and it scores p points in benchmark W. A use case would be querying for algorithms that solve a particular type of problem, comparing the candidates using certain benchmarks, and retrieving the code to be used (ideally, being interoperable with your system setup).
AlgoData has a well-defined ontology. An ontology (from the Greek, loosely, “study or discourse of the things that exist”) is the set of concepts relevant to your domain. For instance, in an e-commerce site, “article”, “client”, “shopping cart”, or “payment method” are concepts that need to be defined, and included in the implementation of the e-commerce platform. For knowledge graphs, the list would include all types of nodes, and all labels for the edges and other properties. In general-purpose knowledge graphs, such as Wikidata, the ontology is huge, and for practical purposes the user (human or machine) relies on search/suggestion algorithms to identify the property that fits the most to their intention. In contrast, for specific-purpose knowledge graphs, such as AlgoData, a reduced and well-defined ontology is possible, as it simplifies the overall structure and search mechanisms.
The ontology of AlgoData (as of June 2023, under development) is the following:
Classes:
Algorithm, Benchmark, Identifiable, Problem, Publication, Realization, Software.
Object Properties:
analyzes, applies, documents, has component, has subclass, implements, instantiates, invents, is analyzed in, is applied in, is component of, is documented in, is implemented by, is instance of, is invented in, is related to, is solved by, is studied in, is subclass of, is surveyed in, is tested by, is used in, solves, specializedBy, specializes, studies, surveys, tests, uses.
Data Properties:
has category, has identifier.
We can display this ontology as a graph,
Currently, AlgoData implements two search functions: “Simple search” by matching words on the content, and “Graph search” where we query for nodes in the graph satisfying certain conditions in their connections. The main AlgoData page gives a sneak preview of the system (these links are password protected, but MaRDI team members and any researcher with a valid ORCID identifier can access)
A project closely related to AlgoData is the Model Order Reduction Benchmark (MORB) and its Ontology (MORBO). This sub-project focuses on the creation of benchmarks for algorithms solving Model Reduction (a standard technique in mathematical modelization, to reduce the simulation time for large-scale systems) and has its own knowledge graph and ontology, tailored to this problem. More information on the MOR Wiki and the MaRDI TA2 page.
The MaRDI portal and knowledge graph
The main output from the MaRDI project will also be based on a knowledge graph. The MaRDI Portal will be the entry point to all services and resources provided by MaRDI. The portal will be backed by the MaRDI knowledge graph, a big knowledge graph scoped to all mathematical research data. You can already have a sneak peek to see the work in progress.
The architecture of the MaRDI knowledge graph follows that of Wikidata, and it is compatible with it. In fact, many entries of Wikidata have been imported into the MaRDI KG and vice-versa. The MaRDI knowledge graph will also integrate many other resources from open knowledge, thus leveraging from many projects. A non-exhaustive list would include:
- The MaRDI AlgoData knowledge graph described above.
- Other MaRDI knowledge graphs, such as the MORWiki or the graph of Workflows with other disciplines.
- The zbMATH Open repository of reviews of mathematical publications.
- The swMATH Open database of mathematical software
- The NIST Digital Library of Mathematical Functions (DLMF).
- The CRAN repository of R packages.
- Mathematical publications in arXiv.
- Mathematical publications in Zenodo.
- The OpenML platform of Machine Learning projects.
- Mathematical entries from Wikidata.
- Entries added manually from users.
The MaRDI Portal does not intend to replace any of those projects, but to link all those openly available resources together in a big knowledge graph of greater scope. As of June 2023, the MaRDI KG has about 10 million triples (subject-predicate-object as in the RDF format). As with Wikidata, the ontology is too big to be listed, and it is described within the graph itself (e.g. the property P2 is the identifier for functions from the DLMF database).
Let us see some examples of entries in the MaRDI KG. A typical entry node in the MaRDI KG (in this example, the program ggplot2) is very similar to a Wikidata entry. This page is a human-friendly interface, but we can also get the same information in machine-readable formats such as RDF or JSON.
For the end user, probably it is more useful to query the graph for connections. As with Wikidata, we can query the MaRDI knowledge graph directly in SPARQL. It is a work-in-progress to enable the Scholia plug-in to work with the MaRDI KG. Currently, the beta MaRDI-Scholia queries against Wikidata.
Some queries that are available in the MaRDI KG but not on Wikidata are for instance queries to formulas in DLMF: here formulas that use the gamma function, or formulas that contain sine and tangent functions (the corpus of the database is still small, but it illustrates the possibilities). Wikidata can nevertheless query for symbols in formulas too.
The MaRDI KG is still in an early stage of development, and not ready for public use (all the examples cited are illustrative only). Once the KG begins to grow, mostly from open knowledge sources, the MaRDI team will improve it with some “knowledge building” techniques.
One such technique is the automated retrieval of structured information. For instance, the bibliographic references in an article are structured information, since they follow one of a few formats, and there are standards (bibTeX, Zb/MR number, …).
Another technique is link inference. This addresses the problem of low connectivity in graphs made by importing sub-graphs from multiple third-party sources, which may result in very few links between the sub-graphs. For instance, an article citing some references and a GitHub repository citing the same references are likely talking about the same topic. These inferences can then be reviewed by a human if necessary.
Another enhancement would be to improve search in natural language so that more complex queries can be made in plain English without the need to use SPARQL language.
The latest developments of the MaRDI Portal and its knowledge graph will be presented at a mini-symposium at the forthcoming DMV annual meeting in Ilmenau in September 2023.
- Knowledge ladder: Steps on which information can be classified, from the rawest to the more structured and useful. Depending on the authors, these steps can be enumerated as Data, Information, Knowledge, Insight, Wisdom.
- Data: raw values collected from measurements.
- Information: Data tagged with its meaning.
- Knowledge: Pieces of information connected together with causal or other relationships.
- Knowledge base: A set of resources (databases, dictionaries…) that represent Knowledge (as in the previous definition).
- Knowledge graph: A knowledge base organized in the form of a mathematical graph.
- Insight: Ability to identify relevant information from a knowledge base.
- Wisdom: Ability to find (or create) connections between information points, using existing or new knowledge relationships.
- Ontology: Set of all the terms and relationships relevant to describe your domain of study. In a knowledge graph, the types of nodes and edges that exist, with all their possible labels.
- RDF (Resource Description Framework): A web standard to describe graphs as triples (subject - predicate - object).
- SPARQL (Simple Protocol And RDF Query Language): A language to send queries (information retrieval/manipulation requests) to graphs in RDF format.
- Wikipedia: a multi-language online encyclopedia based on articles (non-structured human-readable text).
- Wikidata: an all-purpose knowledge graph intended to host data relevant to multiple Wikipedias. As a byproduct, it has become a tool to develop the semantic web, and it acts as a glue between many diverse knowledge graphs.
- Semantic web: a proposed extension of the web in which the content of a website (its meaning, not just the text strings) is machine-readable, to improve search engines and data discovery.
- Mediawiki: the free and open-source software that runs Wikipedia, Wikidata, and also the MaRDI portal and knowledge graph.
- Scholia: A plug-in software for Mediawiki, to enhance visualization of data queries to a knowledge graph
- AlgoData: a knowledge graph for numerical algorithms, part of the MaRDI project.
In Conversation with Daniel Mietchen
In this episode of Data Dates, Daniel and Tabea talk about knowledge graphs. Touching on the general concept, how it would help you find the proverbial needle and specific challenges that include mathematical structures. In addition, we also hear about the MaRDI knowledge graph and what this brings to mathematicians.
Leibniz MMS Days
The 6th Leibniz MMS Days, organized by the Leibniz Network "Mathematical Modeling and Simulation (MMS)", took place this year from April 17 to 19 in Potsdam at the Leibniz Institute for Agricultural Engineering and Bioeconomics. A small MaRDI faction, consisting of Thomas Koprucki, Burkhard Schmidt, Anieza Maltsi, and Marco Reidelbach made their way to Postdam to participate.
This year's MMS Days placed a special emphasis on "Digital Twins and Data-Driven Simulation," "Computational and Geophysical Fluid Dynamics," and "Computational Material Science," which were covered in individual workshops. There was also a separate session on research data and its reproducibility in which Thomas introduced the MaRDI consortium with its goals and vision, and promoted two important MaRDI services of the future, AlgoData and ModelDB; two knowledge graphs for documenting algorithms and mathematical models. Marco concluded the session by providing insight into the MaRDMO plugin, which links established software in research data management with the different MaRDI services, thus enabling FAIR documentations of interdisciplinary workflows. The presentation of the ModelDB was met with great interest among the participants and was the subject of lively discussions afterwards and in the following days. Some aspects from these discussions have already been considered in the further design of the ModelDB.
In addition to the various presentations, staff members of the institute gave a brief insight into the different fields of activity of the institute, such as the optimal design of packaging and the use of drones in the field, during a guided tour. The highlight of the tour was a visit to the 18-meter wind tunnel, which is used to study flows in and around agricultural facilities. So MaRDI actually got to know its first cowshed, albeit in miniature.
MaRDI RDM Barcamp
MaRDI, supported by the Bielefeld Center for Data Science (BiCDaS) and the Competence Center for Research Data at Bielefeld University, will host a Barcamp on research-data management in mathematics on July 4th, 2023, at the Center for Interdisciplinary Research (ZiF) in Bielefeld.
More information:
- in English
Working group on Knowledge Graphs
The NFDI working group aims to promote the use of knowledge graphs in all NFDI consortia, to facilitate cross-domain data interlinking and federation following the FAIR principles, and to contribute to the joint development of tools and technologies that enable the transformation of structured and unstructured data into semantically reusable knowledge across different domains. You can sign up to the mailing list of the working group here.
Knowledge graphs in other NFDI consortia can be found for instance at the NFDI4Culture KG (for cultural heritage items) or at the BERD@NFDI KG (for business, economic, and related data items).
More information:
- in English
NFDI-MatWerk Conference
The 1st NFDI-MatWerk Conference to develop a common vision of digital transformation in materials science and engineering will take place from 27 - 29 June 2023 as a hybrid conference. You can still book your ticket for either on-site or online participation (online tickets are even free of charge).
More information:
- in English
Open Science Barcamp
The Barcamp is organized by the Leibniz Strategy Forum Open Science and Wikimedia Deutschland. It is scheduled for 21 September 2023 in Berlin and is open to everybody interested in discussing, learning more about, and sharing experiences on practices in Open Science.
More information:
- in English
- The department of computer science at Stanford University offers this graduate-level research seminar, which includes lectures on knowledge graph topics (e.g., data models, creation, inference, access) and invited lectures from prominent researchers and industry practitioners.
It is available as a 73-page pdf document, divided into chapters:
https://web.stanford.edu/~vinayc/kg/notes/KG_Notes_v1.pdf
and additionally as video playlist:
https://www.youtube.com/playlist?list=PLDhh0lALedc7LC_5wpi5gDnPRnu1GSyRG - Video lecture on knowledge graphs by Prof. Dr. Harald Sack. It covers the topics of basic graph theory, centrality measures, and the importance of a node.
https://www.youtube.com/watch?v=TFT6siFBJkQ The Working Group (WG) Research Ethics of the German Data Forum (RatSWD) has set up the internet portal “Best Practice for Research Ethics”. It bundles information on the topic of research ethics and makes them accessible.
https://www.konsortswd.de/en/ratswd/best-practices-research-ethics/
Welcome to the fourth MaRDI Newsletter! This time we will investigate the fourth and final FAIR principle: Reusability. We consider the R in FAIR to capture the ultimate aim of sustainable and efficient handling of research data, that is to make your digital maths objects reusable for others and to reuse their results in order to advance science. In the words of the scientific computing community, we want mathematics to stand on the shoulders of giants rather than to be building on quicksand.
licensed under CC BY-NC-SA 4.0.
To achieve this, we need to make sure every tiny piece in a chain of results is where it should be, seamlessly links to its predecessors and subsequent results, is true and is allowed to be embedded in the puzzle we try to solve. This last comment is crucial, so we dedicate our main article in this issue of the newsletter to the topic of documentation, verifiability, licenses, and community standards for mathematical research data. We also feature some nice pure-maths examples we made for the love data week, report on the first MaRDI workshop for researchers in theoretical fields who are new to FAIR research data management, and entertain you with surveys and news from the world of research data.
To get into the mood of the topic, here is a question for you:
If you need to (re)use research data you created some time ago, how much time would you need to find and understand it? Would you have the data at your fingertips, or would you have to search for it for several days?
You will be taken to the results page automatically after submitting your answer, where you can find out how long other researchers would take. Additionally, the current results can be accessed here.
On the shoulders of giants
The famous quote from Newton: “If I have seen further, it is by standing on the shoulders of giants" usually refers to how science is built on top of previous knowledge, with researchers basing their results on the works of scientists who came before them. One could reframe it by saying that scientific knowledge is reusable. This is a fundamental principle in the scientific community: once a result is published, anyone can read it, learn how it was achieved, and then use it as a basis for further research. Reusing knowledge is also ingrained in the practice of scientific research as the basis of verifiability. In natural sciences, the scientific method demands that experimental data back your claims. In mathematical research, the logic construction demands mathematical proof of your claims. This means that for a good scientific practice, your results must be verifiable by other researchers, and this verification requires a reuse of not only the mental processes but also the data and tools used in the research.
Research data must be as reusable as the results and publications they support. From the perspective of modern, intensively data-driven science, this demand poses some challenges. Some barriers to reusability are technical, because of incompatibilities of standards or systems, and this problem is largely covered in the Interoperability principle of FAIR. But other problems such as poor documentation or legal barriers can be even bigger obstacles than technical inconveniences.
Reuse of research data is the ultimate goal of FAIR principles. The first three principles (Findable, Accessible, Interoperable) are necessary conditions for effective reuse of data. What we list here as “Reusability” requirements are all the remaining conditions, often more subjective or harder to evaluate, that appeal to the final goal of having a piece of research data embedded in a new chain of results.
To be precise, the Reusability principle requires data and metadata to be richly characterised with descriptors and attributes. Anyone potentially interested in reusing the data should easily find out if that data is useful for their purposes, how it can be used, how it was obtained, and any other practical concerns for reusing it. In particular, data and metadata should be:
- associated with detailed provenance
- released with a clear and accessible data usage license.
- broadly aligned with agreed community standards of its discipline.
Documentation
It is essential for researchers to acknowledge that the research data they generate is a first-class output of their scientific research and not only a private sandbox that helps them produce some public results. Hence, research data needs to be curated with reusability in mind, documenting all details (even some that might seem irrelevant or trivial to its authors) related to its source, scope, or use. In data management, we use the term “provenance” to describe the story and rationale behind that data. Why does it exist, what problem was it addressing, how it was gathered, transformed, stored, used… all this information might be relevant for a third party that first encounters the data and has to judge if it is relevant for themselves or not.
In experimental data, it is important to document exactly what was the purpose of the experiment, which protocol was followed to gather the data, who did the fieldwork (in case that contact information is needed), which variables were recorded, how the data is organized, which software was used, which version of the dataset it is, etc. As an antithesis of the ideal situation, imagine that you, as a researcher, find out about an article that uses some statistical data that you think you could reuse or that you want to look at as a referee. The data is easily available, and it is in a format that you can read. The data, however, is confusing. The fields on the tables have cryptic names such as “rgt5” and “avgB” that are not defined anywhere, leaving you to guess their meaning. Units of the measures are missing. Some registries are marked as “invalid” without any explanation of the reason and without making clear whether those registries were used or not on calculations. Derived data is calculated from a formula, but the implementation in the spreadsheet is slightly but significantly different than the formula in the article. If you re-run the code, the results are thus a bit different from those stated in the article. At some point, you try to contact the authors, but the contact data is outdated, or it is unclear who of several authors can help with the data (you can picture such a scene in this animated short video). Note that in this scenario we describe, the research data might have been perfectly Findable, Accessible and technically good and Interoperable, but without attention to those Reusable requirements, the whole purpose of FAIR data is defeated.
In computer-code data, documentation and good community development practices are non-trivial issues the industry has been addressing for a long time. Communities of programmers concerned by these problems have developed tools and protocols that solve, mitigate, or help manage these issues. Ideally, scientists working on scientific computing should learn and follow those good practices for code management. For instance, package managers for standard libraries, version control systems, continuous integration schemes, automated testing, etc., are standard techniques in the computer industry. While not using any of these techniques and just releasing source code in zip files might not break F-A-I principles, it will make reuse and community development much more difficult.
Documenting algorithms is especially important. Algorithms frequently use tricks, constants that get hard-coded, code patterns that come from standard recipes, parts that handle exceptional cases… Most often, even a very well-commented code is not enough to understand the algorithm, and a scientific paper is published to explain how the algorithm works. The risk is having a mismatch between the article that explains the algorithm, and the released production-ready code that implements it. If the code implements something similar but not exactly what is described in the article, there is a gap where mistakes can enter. Having a close integration between the paper and the code is crucial to prevent the newcomer from having to rework how the described algorithm translates into code.
Verifiability
As we introduced above, independent verification is a pillar of scientific research, and verification cannot happen without reusability of all necessary research data. MaRDI puts a special effort into enabling verification of data-driven mathematical results, by building FAIR tools and exchange platforms for the fields of computer algebra, numerical analysis, and statistics and machine learning.
An interesting example arises in computer algebra research. In that field, output results are often as valuable by themselves as the program that produced them. For instance, classifications and lists are valuable by themselves (see for example the LMFDB or MathDB sites for some classification projects). Once that list is found, it can be stored and reused for other purposes without any need to revisit the algorithm that produced such a list. Hence, the focus is normally on reusability of the output, but forgetting the reusability of the sources. This neglects to describe the provenance of the data, how it was created, which techniques were used to find it. This entails serious risks. Firstly, it is essential to verify that the list is correct (since a lot of work will be carried out assuming it is). Secondly, it is often the case that later research needs a slight variation of the list offered in the first place, so researchers need to modify parameters or characteristics of the algorithm to create a modified list.
In the case of numerical analysis, the output algorithms are usually focused on user reusability, often in the form of computing packages or libraries. However, several different algorithms may compete for accuracy, speed, hardware requirements, etc., so the “verification” process gets replaced by a series of benchmarks that can rate an algorithm in different categories and verify its performance. We have described, in the previous newsletter, how MaRDI would like to make numerical algorithms easier to reuse and benchmark them in different environments.
As for statistical data, our Interoperability issue of the newsletter describes how MaRDI curates datasets with “ground truths,” known facts that we know for sure independently from the data, that allow for the validation of new statistical tools to be applied to the data. In this case, re-using these new statistical tools on new studies increases the corpus of cases where the tool has been successfully used, making each reuse a part of the validation process.
Licenses
We also discussed licenses in our Accessibility issue. Let’s recall that FAIR principles do not prescribe free / open licenses, although those licenses are the best way to allow unrestricted reusability. However, FAIR principles do require a clear statement of the license that applies, be it restrictive or permissive.
Even within free/open licenses, the choice is wide and tricky. In software, open source licenses (e.g. MIT, Apache licenses) refer to the fact that the source code must be provided to the user. Those are amongst the most permissive because with the code one can study, run, or modify it. In contrast, free software licenses (e.g. GPL) carry some restrictions and an ethical/ideological load. For instance, many free licenses include copyleft, which means that any derived work must keep the same license, effectively preventing a company to bundle this software in a proprietary package that is not free software.
In creative works (texts, images…), the Creative Commons licenses are the standard legal tool to explicitly allow redistribution of works. There are several variants, ranging from almost no restrictions (CC0 / Public domain), to including clauses for attribution (CC-BY, attribution), sharing with the same license (CC-SA, share alike), or restricting commercial use (CC-NC, non-commercial) or derivative works (CC-ND, non-derivative), and any compatible combination. For databases, the Open Database License (ODbL) is a widely used open license, along with CC.
The following diagram shows how you can determine which CC license would be appropriate for you to use:
Attention must be paid that CC-ND is not an open license, and CC-NC is subject to interpretation of the term “non-commercial,” which can pose problems. While CC licenses have been defended in court in many jurisdictions, there are always legal details that can pose issues. For instance, the CC0 license intends to waive all rights over a work, but in some jurisdictions, there are rights (such as authorship recognition) that cannot be refrained. Other details concern the license versions. The latest CC version is 4.0, and it intends to be valid internationally without need to “port” or adapt to each jurisdiction, but each CC version has its own legal text and thus provides slightly different legal protection. Please note that this survey article does not provide legal advice, you can find all the legal text and human-readable text on the CC website.
In general, the best policy for open science is to use the least restrictive license that suits your needs and, with very few exceptions, not to add or remove clauses to modify a license. Reusing and combining content implies that newly generated content needs a license compatible with those of the parts that were used. This can become complicated or impossible the more restrictions they have (for instance, with interpretations of commercial interest or copyleft demands). Also, licenses and user agreements can conflict with other policies, such as data privacy; see an example in the Data Date interview in this newsletter.
Community
Perhaps the most synthetic form of the Reusability principle would be “do as the community does or needs” since it is a goal-focused principle: if the community is re-using and exchanging data successfully, keep those policies; if the community struggles with a certain point, act so that reuse can happen.
MaRDI takes a practical approach to this, studying the interaction between and within the mathematics community and other research communities and the industry. We described this “collaboration with other disciplines” in the last newsletter, and we highlighted the concept of “workflow” as the object of study, that is, the theoretical frameworks, the experimental procedures, the software tools, the mathematical techniques, etc. used by a particular research community. By studying the workflows in concrete focus communities, we expect to significantly increase and improve their reuse of mathematical tools, while also setting methods that will apply to other research communities as well.
MaRDI’s most visible output will be the MaRDI Portal, which will give access to a myriad of ‘FAIR’ resources via federated repositories, organized cohesively in Knowledge Graphs. MaRDI services will not only facilitate reusability of research data to mathematicians and researchers in other fields alike but also be a vivid example of best-practices research life. This portal will be a gigantic endeavor to organize FAIR research data, a giant on whose shoulders tomorrow’s scientists can stand. We strive for MaRDI to establish a new data culture in the mathematical research community and in all disciplines it relates to.
In Conversation with Elisabeth Bergherr
In this episode of Data Dates, Elisabeth and Christiane talk about reusability and the use of licenses in interdisciplinary statistical research, students' thesis, and teaching.
Love Data Week
Love Data Week is an international week of actions to raise awareness for research data and research data management. As part of this initiative, MaRDI created an interactive website that allows you to play around with various mathematical objects and learn interesting facts about their file formats.
Research data in discrete math
Mid March, the MaRDI outreach task area hosted the first research-data workshop for rather theoretical mathematicians in discrete math, geometry, combinatorics, computational algebra, and algebraic geometry. These communities are not covered by MaRDI's topic-specific task areas but form an important part of the German mathematical landscape, in particular with the initiative for a DFG priority program whose applicants co-organized the event. A big crowd of over sixty participants spent two days in Leipzig discussing automated recognition of Ramanujan identities with Peter Paule, machine-learned Hodge numbers with Yang He, and Gröbner bases for locating photographs of dragons with Kathlén Kohn. Michael Joswig led a panel in focusing on the future of computers in discrete mathematics research and the importance of human intuition. Antony Della Vecchia presented file formats for mathematical databases, and Tobias Boege encouraged the audience to reproduce published results in a hands-on session with participants finding pitfalls even in the most simple exercise. In the final hour, young researchers took the stage to present their areas of expertise, the research data they handle, and their take-away messages from this workshop: to follow your interests, keep communicating with your peers and scientists from other disciplines, and make sure your research outputs are FAIR for yourself and others. This program made for a very lively atmosphere in the lecture hall and was complemented by involving discussions on mathematicians as pattern-recognition machines, how mathematics might be a bit late to the party in terms of software, whether humans will be obsolete soon, and the hierarchy of difficulty in mathematical problems.
Conference on Research Data Infrastructure
The Conference will take place September 12th – 14th, 2023, in Karlsruhe (Germany). There will be disciplinary tracks and cross-disciplinary tracks.
Abstract submissions deadline: April 21, 2023
More information:
- in English
IceCube - Neutrinos in Deep Ice
This code competition aims to identify which direction neutrinos detected by the IceCube neutrino observatory came from. PUNCH4NFDI is focused on particle, astro-, astroparticle, hadron, and nuclear physics, and is supporting this ML challenge.
Deadline: April 23, 2023
More Information:
- in English
Open Science Radio
Get an overview of all NFDI consortia funded to date, and gain an insight into the development of the NFDI, its organizational structure, and goals in the 2-hour Open Science Radio episode interviewing Prof. Dr. York Sure-Vetter, the current director of the NFDI.
Listen:
- in English
The DMV, in cooperation with the KIT library, maintains a free self-study course on good scientific practice in mathematics, including notes on the FAIR principles. (Register here to subscribe to the free course.)
Edmund Weitz of the University of Hamburg recorded an entertaining chat about mathematics with ChatGPT (in German).
Remember our interview about accessibility with Johan Commelin in the second MaRDI Newsletter? The Xena Project is "an attempt to show young mathematicians that essentially all of the questions which show up in their undergraduate courses in pure mathematics can be turned into levels of a computer game called Lean". It has published a blog post highlighting very advanced maths, which can now be understood using the interactive theorem prover Lean Johan told us about.
On March 14, the International Day of Mathematics was celebrated worldwide. You can relive the celebration through the live blog, which also includes two video sessions with short talks for a general audience—one with guest mathematicians and one with the 2022 Fields Medal laureates. This year, the community was asked to create Comics. Explore the featured gallery and a map with all of the mathematical comic submissions worldwide.
Welcome to the third issue of the MaRDI Newsletter on mathematical research data, and happy holidays! We give you a brief snapshot of the world of interoperability. This is the third and may be one of the most challenging of the FAIR principles; very topic dependent, and much more technical than, say, findability. Its key question is: how do you seamlessly hand a digital object from one researcher to another?
licensed under CC BY-NC-SA 4.0.
We discuss the meaning and implications of interoperability in a number of mathematical disciplines, interview an expert on scientific software, report on workshops that have happened in the mathematical research-data universe, and much more.
We encounter different systems almost everywhere in our lives, both professionally and in everyday situations. Not all of them seem to be interoperable. For example, a navigation app will not be able to interpret equations, and it might not be trivial to ask Mathematica to compile your Julia computations. Think of any two systems—what would a marriage of the two look like? (We understand marriage here to be establishing the base for communication and exchange.)
If you could choose two systems you would like to get married, which ones would you choose?
Did you choose a perfect match in the survey above? You can add more anytime...
Interoperability: Let's play together
In our previous newsletters, we have covered, the Findability and Accessibility principles in FAIR research data. Those are the basic principles that give researchers awareness and access to the existence of research data. In contrast, the remaining two principles, Interoperability and Reusability, are related to what can be done with that data or rather to the quality of it. They have more profound implications for the interactions of the research community as a whole.
Research is almost never conducted in isolation. Researchers build on top of other researchers’ findings, combine different sources with their own insights, and use plenty of tools and methods developed by others. Here we will focus on some technical (and less technical) requirements to make this research community possible: Interoperability.
Interoperability is the capacity to combine pieces from different sources to work together. Standards in science and industry, such as measuring units or the shape of plug connectors, are designed for interoperability. In research, a simple example is language. Most scientific research is nowadays written and published in English. While there may be valid reasons to use other languages (in specific disciplines, in outreach, to foster exchanges in a particular cultural group…), the reality is that using a single lingua franca for scientific research enables comprehension and use of any scientific publication to all researchers. This creates a necessity for researchers to learn and use the English language as part of their research (and life) skills. When it comes to computers, plenty of standards respond to the need for interoperable data, such as file formats or computer languages (pdf, LaTeX, …), some having more success than others.
For research data, interoperability is crucial to enable a research community to collaborate and interact. Interoperability means using a standard set of vocabulary and data models that give a good and agreed representation of the type of research data in question. This effectively sets a standard for data communication. Then each researcher can adapt their tools and methods to process data within those standards.
To be precise, FAIR principles provide a framework for interoperable research data:
- Data and metadata must use a knowledge representation (ontologies, data models) that is shared, broadly applicable, and accessible.
- Such knowledge representation must be itself FAIR.
- When data and metadata reference other data and metadata, their relationship must be qualified (e.g. data X uses algorithm Y in such a way, data Z is derived from dataset W by applying such filtering)
In information science, an ontology is the set of all relevant concepts and relationships for a particular domain. This can be an enumeration or represented by a knowledge graph where nodes are concepts (think of nouns), and edges are qualifiers (think of verbs). This theoretical reflection of the nature of your research data is fundamental to developing useful standards that enable practical interoperability.
The MaRDI project actually devotes a significant part of its efforts to improving the interoperability (and reusability) of research data. Here we provide a brief summary of these interoperability efforts.
Computer Algebra
Computer Algebra concerns calculations on abstract mathematical objects, such as groups, rings, polynomials, manifolds, polytopes, etc. Computations are generally exact (no numerical approximations). Typical use cases of computer algebra are enumeration problems, for instance, finding a list of all graphs with certain properties. For such abstract objects, just data representation is already non-trivial, therefore researchers often build on top of specific frameworks called Computer Algebra Systems (CAS) that implement these data types and methods. Such CASes can be of broad scopes, like Mathematica, Maple, Magma, SageMath, OSCAR, etc, or they can have a focus on a specific domain, like GAP (group theory), Singular (algebraic geometry), Polymake (polytopes and other combinatorial objects), etc. A desirable goal would be to have a common data format to allow interoperability between different software systems without the loss of CAS information, enabling the parsing of files and call of functions from one system to another. This is obviously not an easy task. On the one hand, some of those CASes (e.g. Mathematica, Maple…) are proprietary, their focus is not purely on math research but they also provide tools used in other fields such as engineering or education. Interoperative approaches that use anything other than their provided APIs will therefore likely fail. On the other hand, the specific purpose CASes such as GAP, Singular, or Polymake (incidentally, all three have originated and are maintained by German universities and researchers close to MaRDI) are open-source, and can be used stand-alone but are also integrated into broader CASes such as SageMath (Python based) or OSCAR (Julia based). Turning these specific systems into broad-purpose CASes while also retaining state-of-the-art algorithms from the latest research is already a great success story.
The goals for MaRDI in Computer Algebra are to document and establish workflows, data formats, and guidelines on how to set up databases. By ‘workflows’ we understand this to be the process of generating/retrieving data, setting up an experiment, and obtaining conclusions, which will imply documenting the exact versions of the software (and possibly hardware) used as well as the tech stack (from the operating system to the languages and interpreters and libraries used). This will have benefits such as enabling verification of the results and making further reuse easier. It also provides clear guidelines on which software can be used together, replaced, or mixed, and therefore evaluating its interoperability.
Documenting and establishing data formats means going a step further in the interoperability, not only describing which software or data format the current work adheres to but actually making a system-agnostic description of the data. For instance, if we are using a particular ring of polynomials in several variables with coefficients in a particular field, the data description should make clear how we store and operate the elements of such a ring. Typically, this will follow a data format from a particular CAS, but having an independent description will enable other CASes to implement a compatibility layer to reuse the data. This will become even more relevant when implementing new abstract structures. Eventually, the goal is for all CASes that wish to support a particular data format, to be possible to implement a compatibility layer based on the data description. This is called data serialization, as the goal is to translate internal data structures into a text description, which can be exchanged to another system to be de-serialized, that is, turned into the data structure of the new system with the same semantic information but possibly a different implementation. The MaRDI team is implementing this data serialization in OSCAR, but the goal is to have a system-agnostic specification.
Finally, documenting computer algebra databases will, among other benefits in findability, enable a comprehensive picture of the different systems and the compatibility layers needed to have interoperability amongst them.
Scientific computing
Numerical algorithms are central to scientific computing. Their approximations to exact mathematical quantities come with inherent inexactness and error propagation, due to finite precision in the used data structures. This contrasts with the abstract and exact objects used in Computer Algebra. Typical examples are linear solvers (Ax=b) for different types of matrices (big, small, huge, sparse, dense, stochastic…), or numerical integration methods for ODEs or PDEs. Numerical algorithms are closely associated with applied mathematics, and performance or scalability are relevant factors for choosing one method over another. We already described in the Findability article that MaRDI is building a knowledge graph for those numerical algorithms, together with benchmarks, supporting articles for theoretical background, and other features. But the goal goes beyond creating such a graph just to find algorithms, it is also an ambitious goal to develop an infrastructure to make all these algorithms interoperable.
Researchers implement their algorithms in programming languages such as MATLAB (which is proprietary), or C/C++, Julia, Python, etc, possibly with extension libraries. To implement interoperability between different numerical methods, MaRDI proposes a three-component architecture (driver - connector - implementor). For a particular algorithm, the implementor is the piece of software that contains the actual existing algorithm in whatever language or framework that the author used. The driver is a high-level calling function that contains the semantics of the data, but not the implementation of the algorithm. The same data model can then be used by drivers of different numerical algorithms, even if their implementation uses completely different technologies, thus enabling an interoperable ecosystem. The prototypes of those drivers are being proposed and defined by the MaRDI team. The missing critical piece is the connector, which communicates between the driver and the implementor, which needs to be developed for each algorithm, likely in collaboration with the original author. The MaRDI team is implementing some examples, but the goal is that in the future, any researcher who is developing numerical algorithms can use their preferred technology stack and then easily implement a connector to standard driver functions.
The benchmark comparison between algorithms (planned for the knowledge graph) actually requires this interoperability architecture so that the same test can be executed by different algorithms in equal conditions without a need to adapt the data to fit a particular tech framework.
Statistics and Machine Learning
Typical research data usage in statistics or machine learning include big experimental datasets, frequently coming from other domains. Good examples of this are genetic data or financial data. These datasets contain valuable information that researchers try to extract using statistics or AI techniques. In statistics, for instance, a typical goal is to create a model, meaning to describe a joint probability distribution of all the variables depending on the individual probability distributions of each variable. This means understanding the dependencies between the variables.
A problem often found by statisticians who develop new theoretical methods to extract information from experimental data is that there is only a very limited collection of suitable datasets where they can test new methods. It is difficult to obtain curated data from interdisciplinary teams before the statistical tools are proven useful and robust, which leaves researchers with limited choices to run tests. The most valuable information in curated data includes “ground truths”, that is, relationships between variables that are known externally to the experimental data, via expert knowledge from another field. For instance, in a macroeconomic study, some variables can be related or independent, or their relationship may depend on the presence of a third variable indicator, or even more complex interactions. We may know some of these interactions by knowing government policies or strategies which are not reflected directly in the data. For the statistician, such a "ground truth" is very useful to validate the algorithm used to fit the model. A goal for MaRDI is to collect a broader, curated list of datasets that can be used by statisticians to test and validate modeling techniques. Those datasets need to be cleaned and ready to be used by standard statistical packages (that is, to be interoperable), and to have useful annotated “ground truths” attached to the data for use on interdisciplinary teams. Besides this data collection, MaRDI aims to be a leading example of quality curated data so that experimentalists can adhere to those quality standards.
Another goal concerns machine learning (ML) algorithms. The community around ML is much broader than mathematicians (software developers, data scientists, ML engineers…), and therefore the frameworks used are very diverse. TensorFlow and Torch are two popular tools in the industry, but there are many others. The language R is suitable for statistics and data science, and also for machine learning. An initiative to bring cohesion and interoperability in this software ecosystem is mlr3 (machine learning for the R language), which MaRDI is using and extending. The mlr3 project brings different R packages together (often based on or operating on other frameworks), providing unified naming conventions, and a full suite of tools (learners, benchmarks, analyzers, importers/exporters, …), making R and mlr3 a competitive integrated framework for ML.
We can see a couple of examples of how MaRDI is bridging interoperability gaps in this field. A first example: in machine learning (as in the statistics case we saw earlier), there is a great need for more quality datasets (training, evaluation…). OpenML is a web service that allows sharing of datasets and ML tasks within the ML community. MaRDI is helping to build mlr3oml, an interoperability interface between mlr3 and OpenML. MaRDI also builds and stores “curated quality datasets” in OpenML that can be used for testing and benchmarking, and also as a model of good practices.
A second example: Many learning algorithms in ML are treated as black boxes, they come from different ML techniques and have different implementations. However, a significant part of these algorithms come from some neural network techniques that have some common characteristics: architecture, loss function, optimizer… The package mlr3torch, being developed with MaRDI, aims to “open” some of those black boxes giving greater control of those parameters.
Cooperation with other disciplines
MaRDI strives to bring together mathematical methods and the people who use them. Today this collaboration requires much more than having a common spoken language and publishing in international journals, nowadays data languages are crucial. MaRDI aims to understand and document how researchers in disciplines other than mathematics use (or would like to use) mathematical research data. Hence, the “interoperability” between mathematics and other fields is key. For the past year, MaRDI has collected a series of case studies from other NFDI (the German National Research Data Infrastructure program) consortia, other research groups, and also in the industry, to document through a series of templates how they work and use research data. The key concept is the “workflow”, meaning the documentation of the whole process of setting a theoretical framework, hypothesis to scrutinize, experiment model, data acquisition, technical equipment, metadata association, data processing, software used, data analysis techniques, extraction of results, publications… everything that is directly related to data management, but also its research context. Several examples of workflows can be found on the MaRDI portal TA4 page. Currently, the collected information is textual, highlighting the data acquisition process (and its metadata), and the mathematical model used. In the future, both the (meta)data and the model will be formalized by means of ontologies and model pathway diagrams (graphs) to enable further uses of the research data, such as reproducibility, replacing methods and techniques by newer or more performant ones, or enabling reusability by other researchers.
By looking at the case studies, one can observe that most researchers implement “island solutions” adapted to their specific needs, even if those solutions may be very professional and optimized. There is a great potential to increase interoperability and exchange. MaRDI aims to leverage a change in mathematical data management and analysis to support researchers, in the belief that such a shift will be broadly welcome within the research community.
MaRDI portal
The MaRDI portal will be the single entry point to all the MaRDI services and resources collected by the different task areas. The portal team is currently building a knowledge graph of mathematical research data by retrieving information from other sources (for instance, WikiData, swMATH for documenting mathematical software, package repositories to improve the information granularity of some mathematical software, zbMATH Open to retrieve publications, etc.). This requires a lot of interoperability efforts using the respective APIs since the volume of data is not manageable by hand. Some automation and AI techniques are being considered to foster this process. In due time, all the different MaRDI teams will start producing their output goals, and the portal team will manage the integration within the portal. For instance, the knowledge graph of numerical algorithms will be integrated into the knowledge graph of the MaRDI portal. The statistical datasets collections will also be described as entities in the MaRDI knowledge graph, and so on. In a sense, the portal needs to create interoperability layers between the internal task areas of MaRDI.
All in all, the interoperability principle is an enabling condition for building and strengthening a community. That is the driving goal of all the efforts from MaRDI that we described here. This enabling condition turns into an actual collaboration when the data is reused across different projects and researchers, which will be the topic of our fourth article in this series, about Reusability.
In Conversation with Ulrike Meier Yang
In the third episode of the interview series Data Date, Ulrike and Christiane talk about mathematical research data in the xSDK project, the importance of guidelines, three levels of interoperability, and automated testing.
MaRDI annual workshop 2022
Mid November, the whole MaRDI team met at WIAS in Berlin for their second annual workshop. The kickoff in Leipzig one year before had provided an enthusiastic start for the consortium and for building infrastructure for mathematical research data in Germany. The slogan at the time was to spend the coming twelve months doing two things: listening (zuhören) and simply getting started (einfach anfangen)!
Now the team looked back, recapped, and planned for the second year and further into the future. Over the course of three days, approximately forty people met in person including some participating online to first present each task area's updates and vision, discuss current issues in interactive small-group BarCamps, and finally decide on the upcoming route. The event was kicked off with a keynote talk by Martin Grötschel, who stressed the importance to follow a bottom-up process and potential projects' pitfalls drawn from his learned experience. This was followed by NFDI's Cord Wiljes describing potential benefits of cross-consortial collaborations. There was plenty of lively discussion centered around possible career paths of women in maths and data and how MaRDI could live up to the central expectations of the Portal, link knowledge graphs, best deal with the very diverse mathematical research data in management plans, and build a community. BarCamps developed ideas and new work packages, like the setting up of an editorial team for the Portal. All throughout many participants compiled self-designed sheets of bingo to collect #MaRDI_buzzwords. The long and pleasant days were accompanied by a visit to the computer-games museum and a conference dinner. At the end of the workshop, the MaRDI team drew the conclusion to best spend the coming year building on the previous "listening and getting started" and now focusing on two different tasks: networking with the community (vernetzen) and cross collaboration (zusammenarbeiten) within the consortium. This will link MaRDI's expertise across different institutions and will ensure that resulting services reach and engage with potential users early on, making them truly useful for the working mathematician.
MaRDI Movies
The first in this series of short, entertaining, and informative videos is called 'Mardy, the happy math rabbit'. Follow Mardy through the pitfalls of reproducing software results: An introduction to software review in mathematics by Jeroen Hanselmann.
MOM workshop on MaRDI, OSCAR, and MATHREPO
In November, MaRDI's task area for Computer Algebra invited their community to ZIB and TU Berlin for the "MOM workshop on MaRDI, OSCAR and MATHREPO". Over the course of two days, some twenty people met in person to discuss how to deal with databases, polytopes, triangulations, graded rings, polynomials, gröbner bases, finite point configurations and the like. Particularly important were questions on how to save an object, where to store it long-term, how to seamlessly interact with databases, and how to reproduce a computation.
The MaRDI organisers presented serialisation and workflow efforts and led an exercise in reproducibility where the participants were asked to rerun published research outputs. Some could be redone quite well, others were not so easy to reproduce. A number of examples came from the mathematical research-data repository MathRepo, co-maintained by MaRDI's Tabea Bacher. The awarding of the FAIRest MathRepo page of 2022 was part of the workshop. A jury of interested workshop participants took a closer look at the contributions previously nominated by the audience and judged them according to the FAIR principles. The highly deserved winner was Tobias Boege from Aalto University for his entry on Selfadhesivity in Gaussian conditional independence structures In addition to very good documentation, by compressing files and using the MPDL Repository keeper as longterm storage solution, he found a way to make huge amounts of his research data FAIRly available, which was an unusually difficult problem.
Alheydis Geiger from the Max Planck Institute for Mathematics in the Sciences, Leipzig, presented a user story of OSCAR. In her paper she and her collaborators combined different computer algebra systems, such as OSCAR, Macaulay 2, Magma, Julia, Polymake, Singular and more, to investigate self-dual matroids from canonical curves. The Graded Ring Database was introduced in a talk by Alexander M. Kasprzyk from the the University of Nottingham. Focusing on the mathematical meaning of the research data in the data base as well as technical and accessibility matters.
In a final session, researchers split up into two smaller groups to discuss. The first group collected both computer algebra and general software systems used by the participants and discussed which system was best suited for what research questions. In the other group technical peer reviewing was discussed: how it can be done and why it would be necessary (for more on technical peer reviewing watch the MaRDI Movie Mardy, the happy math rabbit).
MaRDI Workshop on scientific computing—A platform to discuss the “HOW”
From October 26 to 28, 2022, the first MaRDI Workshop on Scientific Computing took place at WWU, Münster. About 40 people from the scientific computing community and from MaRDI came together to learn and talk about research data in three densely packed days of exchange.
The introductory talk by Thomas Koprucki on MaRDI was followed by blocks of talks on topics such as: workflows and reproducibility, ontologies and knowledge graphs or benchmarks. Ten invited speakers presented their projects: for example, Ulrike Meier Yang (see video interview above) introduced the extreme-scale scientific software development kit xSDK, Benjamin Uekermann presented preCICE, a general-purpose simulation coupling interface, Andrea Walther talked about 40 years of developing ADOL-C, which is a package for automatic differentiation of algorithms and FitBenchmarking and an open source tool for comparing data analysis software was presented by Tyrone Rees.
As one of the main goals of the organizers was to bring together researchers from the scientific computing community and related disciplines to learn from different projects and related expertise, speakers were encouraged to present work in progress, open problems or report on personal experiences; not only to talk about the "WHAT" but also to share the "HOW". It can be said that this concept worked out. This was noticeable in both the coffee breaks, which were characterized by lively conversations and in the afternoon of October, 27th that was devoted entirely to discussions. There were several discussion groups focused on a variety of topics, such as workflows and reproducibility, knowledge graphs, research software, benchmarks, training and awareness, ... The training and awareness group discussed how to deal with software that is not associated with a paper—there are some journals that might publish on such topics, but it is difficult to get the recognition deserved- and which career level is best approached for research data management topics. After the discussion in groups, the results were presented to everyone. One of the ideas, that was discussed a lot when the groups reconvened, was the possibility of providing better job security for software engineers by making them permanent employees of universities and having the projects they work on pay the university for their services.
Mario Ohlberger, co-spokesperson at MaRDI and co-organizer of the workshop, said there was great feedback for the event. The workshop created a new platform for exchange and generated many new impulses for MaRDI. Many participants had never been to such a workshop before, they were happy to find others that are passionate about the same topics and are willing to exchange ideas.
Digital Humanities meet Mathematics (DiHMa.Lab)
The first session of DiHMa.Lab took place in September with a workshop organized jointly by the Ada Lovelace Center for Digital Humanities and MaRDI’s interdisciplinary task area, TA4. Over a course of two days, about thirty people from archeology, philology, literary sciences, history, cultural studies, research-data management and of course mathematics came together in this hybrid event to identify and discuss various interconnections, exchange experiences and come up with ideas on how to improve the cooperation and understanding of each other's research. The main focus of the workshop was to engage with both NFDI consortia—NFDI4Memory, NFDI4Objects, Text+, NFDI4Culture, KonsortSWD, MaRDI—and institutes involved in social sciences and humanities research and to familiarize everyone with the methods, problems, questions, and research data of the represented fields.
To that end researchers presented examples of (mathematical) research data and their handling in various projects from digital humanities. For instance, Nataša Djurdjevac Conrad (ZIB) talked about a project where the spreading of wool-bearing sheep in ancient times was analyzed by using agent-based models. Christoph von Tycowicz (ZIB) presented instances of geometric morphometrics used to determine installation sites of ancient sundials or changing facial expressions during the aging process. Tom Hanika (Uni Kasel) and Robert Jäschke (IBI - HU Berlin) spoke about formal concept analysis and order theory and how it can be applied and yield interesting results when analyzing literary works or art.
What these projects have in common is that they avoid black box situations, where a method is applied without really knowing how it works and therefore making it a matter of chance to interpret the results in a fitting manner. In order to obtain reliable results it is necessary for mathematicians to understand the complex questions and data arising in digital humanities and researchers from digital humanities to be careful in applying mathematical methods and understand them first as to be able to choose “the right“ method and to correctly interpret the results. Achieving that enables successful collaborations and contributes to entirely new mathematical questions. This then opens up rich sources for novel questions in digital humanities.
All in all, it was a very successful workshop, resulting in the idea of DiHMa.Lab establishing a „marketplace for methods“ where digital humanities questions could be posted and liked by mathematicians – preferably proposing also a method. Moreover, the participants were very open, accommodating, and interested in the topics and concerns from the different fields, eager to learn new methods, to see what is possible if „we“ join forces, and what new questions arise.
New consortia and an initiative for basic services
On November 4, the Joint Science Conference (GWK) decided to fund seven additional consortia as well as an initiative for the realization of cross-consortia basic services Base4NFDI within the framework of the National Research Data Infrastructure (NFDI). As in the two previous years, the decision by the GWK follows the recommendations of the NFDI expert panel appointed by the German Research Foundation (DFG).
More information:
- in German
International Love Data Week 2023
Love Data Week is an international celebration of data, hosted by the Inter-university Consortium for Political and Social Research (ICPSR), that takes place every year during the week of Valentine's day (in 2023: February 13 - 17). Universities, nonprofit organizations, government agencies, corporations, and individuals around the world are encouraged to host and participate in data-related events and activities held either online or in-person locally. The theme this year is Data: Agent of Change.
More information:
- in English
In October, The Netherlands hosted the "1st international conference on FAIR digital objects" with over 150 professionals signing the Leiden Declaration on FAIR Digital Objects. This is deemed to be "an opportunity for all of us working in research, technology, policy and beyond to support an unprecedented effort to further develop FAIR digital objects, open standards and protocols, and increased reliability and trustworthiness of data".
A group of MaRDI team members together with external experts have written a new article highlighting the status quo, the needs and challenges of research-data management plans for mathematics: a preprint is already available here.
The ICPSR published a guide to data preparation and archiving in 2020. Even though addressed to social scientists, the presented guidelines can be applied to any field.
The "Making MaRDI" Twitter series we announced in the previous Newsletter has been launched and integrated into the website. There are currently four profiles presenting the work that Karsten Tabelow, Tabea Bacher, Christian Himpe, and Ilka Agricola carry out in the consortium.
Welcome to the second issue of the MaRDI Newsletter. In each newsletter, we talk about various research-data themes that might be of interest to the mathematical community, in particular finding data that is relevant to advance your research, ensuring other people can access your files, solving the difficult problem of managing files between coauthors, and preserving your results such that your peers can build their research on those.
The FAIR principles for sustainable research-data management are important to us, so we present them individually in a series of articles. This issue of the Newsletter is dedicated to the A in FAIR: accessibility and what this means for mathematics.
licensed under CC BY-NC-SA 4.0.
In each newsletter, we also publish an episode of our interview series "Data Dates", tell you about an event that happened in the MaRDI universe, and offer some reading recommendations on FAIR topics.
In our last newsletter issue, we asked you to enter 3 methods you commonly use to search for/find mathematical research data. Here are the results to that survey:
Share your accessibility nightmare (or a success story)!
We will feature a selection of your stories in an upcoming newsletter (anonymously).
FAIR access to research data
Access to research information is the most fundamental principle for spreading science across the scientific community and society. Publishing and making research results available is a cornerstone of research. This, however, is not exempt from issues. On the one hand, some research is either private, restricted within the industry, or protected by intellectual property. On the other hand, other barriers exist while accessing data in the form of technical incompatibilities, paywalls, bad metadata, or just incomplete data.
The Accessibility principle of FAIR data is the idea that all the relevant data connected to a research result should be properly available. This concerns which data is available, to whom it is accessible, how it is technically stored and retrieved, and how it is classified and managed. This principle is rooted in the scientific fundament of reproducibility and verifiability: other researchers should be able to repeat and independently verify the published results. While this is especially important in the experimental sciences, it also applies to the domain of mathematics.
The FAIR principles state that research data is Accessible when it respects the following recommendations:
- The data is accessible over the internet, possibly after authentication and authorization. The means of access (protocols) must be open, free, and universal, and those protocols must include authentication and authorization whenever necessary.
- The metadata must be available together with the data, and it must persist even after the data is no longer available.
It is important to note the "possibly after authentication and authorization" sentence. It is a common misunderstanding that FAIR accessibility implies free of cost or under open licenses. That is not the case. Free-of-cost publication and open licenses fall into the domain of the Open Access principles. While FAIR and Open Access have points in common, we will see examples where non-open access databases can be FAIR; or open access articles and research data which are not FAIR because metadata or appropriate protocols are missing.
Standards and protocols are a fundamental element in FAIR accessible data. Many tasks, especially those that are repeated in the same way, are performed much more efficiently by machines than by humans. That is why computers are very important when dealing with research data, too. In terms of accessibility, any storage location would ideally provide interfaces where machines can automatically access research data, also referred to as Application Programming Interfaces or APIs.
The research data behind the articles
Let us see three stories of fictional mathematicians that use some research data as a fundamental part of their research. They handle different types of data (databases, classifications, source code, articles...), which can also have different origins (produced by themselves or from a third party). They face different challenges to keep their research data FAIR.
Alice is a mathematician working in computational algebra. She makes intensive use of software, but in her published articles, she often uses sentences such as "using software XX, we can see that...". Her scripts in the form of source code, software packages, toolchains, and her computed results are research data that, if omitted from the published results, are not FAIR data, making her results difficult to be validated or replicated. She is aware of that problem, and she wants to solve that, so she decided to set up a server in her math department with some files with her source code, and she mentions that those files exist on her personal website; maybe she even puts the URL to the code in her articles. However, she has changed from university several times, thus changing her servers and websites, and many files and projects related to older articles are now lost. In order to be fully FAIR compliant, she needs to ensure that the data is bound to a metadata reference and to the research article, that it is accessible through standard internet protocols, and plan for a long-term archive that does not disappear when she changes her job position. Ideally, she would assign a DOI to the source code and host it in some long-time archiving (e.g., Zenodo, GitHub, MathRepo, or others ). Furthermore, she needs to make the code Interoperable and Reusable, which we will discuss in forthcoming issues. The MaRDI project aims to help mathematicians in this situation improve their FAIR data management.
Alice also participates in a collaborative project to classify all instances of her favorite algebraic objects. She and other colleagues have set up an online catalog listing all the known examples, the invariants they use to classify, and bibliographical information. At the moment, this catalog contains a few hundred items; Alice and her team will need to provide download options, filters, and means to retrieve information from the database beyond the graphical web interface. They will need to provide the results in formats that can be further processed with standard tools. That is, they will need an API and standardized formats to allow other researchers to use that database effectively in their own research projects.
Bob and Charlie are mathematicians modelling biological processes. Bob models tumor growth in human cancer, and Charlie neurological activity in animals. They handle three types of data: experimental specimen data in the form of databases that they receive from a partner or third party, model data in the form of source code that they develop, and result data in the form of articles they publish.
For Bob, primary data comes mainly from patients in hospitals. For obvious privacy reasons, Bob cannot directly access that primary data. Instead, he relies on organizations that offer anonymized databases publicly available for research (for example, the National Cancer Institute). Parts of these databases are totally anonymous and can be given open access. Other records contain detailed genetic information that, by their nature, could be used to identify the patient. Those databases have authenticated access, and researchers only can access them after being identified and committing to respect standard good practices in handling medical data. Thus, even if the access is restricted to identified and authenticated people, the data can be FAIR.
For Charlie, keeping his research data FAIR is tricky. He partners with some laboratories that have the appropriate resources to collect data from animals. Since obtaining this experimental data is expensive, the laboratory keeps some rights of use, and Charlie has to sign a "Data Use Agreement" contract. This allows him to use the data only for the declared purpose, and he is unable to redistribute it. In this case, the data would not be FAIR. However, the laboratory agrees to release the data for public use after two or three articles have been published from that source, as they consider that the data has already yielded enough results. From that moment, the data could be considered FAIR. Some websites collect already released databases (e.g., International Brain Lab) or collect data directly from laboratories for researchers' use (e.g., Human Connectome Project).
Bob and Charlie transform the databases they obtain, develop and apply models. They then write and publish articles. It is increasingly common that journals in the modelling field require the source code to be available. Bob and Charlie, like most researchers, use GitHub, but they have other options as we mentioned with Alice. Additionally, interdisciplinary fields with large communities often have collaborative and open-science platforms where many researchers collaborate in large distributed teams (e.g., COMOB, Allen Institute). In those projects, FAIR principles are a basic need. Concerning accessibility, all the data must be perfectly identified by its metadata. Accessibility has to be transparent to the researchers so the source code of their models can retrieve and process the data in a single step. All the platforms mentioned above have high standards of FAIR-ness and offer APIs based on open standards.
Accessibility and Open Access
It is important to distinguish between the "Accessibility" FAIR principle and the "Open Access" practice.
The open-access philosophy states that research data and especially research results (articles) should be available online, free of charge, or from other barriers. This is usually achieved using legal open licenses such as Creative Commons or similar ones.
The open access movement rose in the context of articles and scientific literature by the end of the 90s and the beginning of the 2000s, in the dawn of the internet era. The new technologies (publishing online, print-on-demand, easier distribution...) made the cost of publication lowered dramatically, but at the same time, some editorial houses kept increasing their fees to access scientific journals and started practices such as "bundling" to force libraries to buy subscriptions in bulk. In our academic system, researchers are pressured to publish in prestigious, high-impact journals since their academic valuation highly depends on publication metrics. Most often, journals do not offer remuneration per authoring of scientific articles. Furthermore, researchers often peer-review articles for free, with the incentive of gaining status in their research field. Under those circumstances, the role and the business model of the traditional editorial houses started to be questioned. For several years, discontent grew in the scientific community. Some researchers proposed a boycott (e.g., Tim Gowers against Elsevier), while others defended revolutionary tactics (e.g., Aaron Swartz's Guerrilla Open Access Manifesto) that brought shadow sites to the forefront. These sites offered free and unrestricted access to vast amounts of scientific literature (e.g., Sci-Hub, LibGen) but unauthorized by their copyright holders and thus unlawful in many jurisdictions. In parallel, pre-publication sites such as arXiv that make access to scientific articles free and open have gained much popularity. It is nowadays common to find in arXiv pre-release versions (after peer review and with the final layout) almost identical to the journal-published articles. Other authors directly avoid journals and publish in arXiv (with the consequences it entails, such as loose or lacking review and lack of certifiable merits).
More recently, the open access movement has brought new journals and editorial practices that guarantee access to research articles at no cost. For instance, the Public Library of Science (PLOS) is a non-profit publishing house that advocates for Open Access, releasing all its published articles with Creative Commons licenses. In turn, PLOS brought the practice of pay-to-publish, a scheme that moves the publication fees to the authors or their institutions. While this model is defended by many researchers and publishers, regrettably some deceptive journals exploit this model by charging authors with publication fees without making any quality check or review of the submitted articles. The increasing tendency, however, is to have low-cost journals published only online that can have their small publication costs covered by universities and institutions.
The FAIR principles as described above do not, in essence, interfere with the open access practice, and they do not prescribe open licenses. FAIR is focused on all research data in general, not only articles, and it keeps its recommendations limited to technical aspects such as protocols and APIs and the presence of metadata.
However, the choice of a license for the data does impact the degree of FAIR-ness. While the Findability principle is quite independent of the chosen license, the Accessibility principle is heavily affected by it. Open licenses allow for the redistribution of the data, making access to infrastructure more resilient, durable, and decentralized. It removes barriers and makes use of the right to data more effective. The choice of license has a bigger effect on the principle of reusability in terms of its "legal" and other technical and architectural requirements.
FAIR data and open access are intertwined practices, and researchers need to consider both perspectives, especially in light of developing trends and policies. Recently, the U.S. government issued a memorandum (Ensuring Free, Immediate, and Equitable Access to Federally Funded Research) to all federal agencies establishing immediate access at no cost to all U.S.-funded research. This means that all research paid for with public money must be released in an open format, free of charge. This memorandum includes research data, such as research databases and other primary sources of information. Similar policies can be expected soon in the E.U. countries. Although not yet a binding policy, the European Commission already supports FAIR principles.
MaRDI's proposal concerning Accessibility
The efforts of MaRDI are, on the one hand, geared towards fulfilling the technical needs to have this network of federated repositories: creating APIs and setting standard formats and protocols to access information through the MaRDI portal. On the other hand, MaRDI aims to spread the FAIR culture amongst researchers by providing training on the practices and tools that will improve their data management.
One of the main MaRDI outputs is our portal, which will help researchers to find and access mathematical research data. The portal itself does not create a new gigantic repository to collect all mathematical research data. Instead, it facilitates the creation of a network of federated domain-specific repositories, making the already existing projects more connected, interoperable, and accessible from a single entry point.
In order to enable standardized retrieval of mathematical research data and their metadata, i.e. to make mathematical research data accessible to machines, the MaRDI consortium has decided to set up an API during the five-year funding period (see p.37, 53 of the proposal). This API will be integrated into the MaRDI Portal, the envisioned one-stop contact point for mathematical research data for the scientific community, by FIZ Karlsruhe and Zuse Institute Berlin.
Take as an example, the API of zbMath Open that has similarities to our portal. zbMath Open is a reviewing service for articles in pure and applied mathematics, where you can find 4.4 million bibliographic entries with reviews or abstracts of scholarly literature in mathematics. It has developed an open API offering the bibliographic metadata of each contribution. You can use this in different ways: to provide references for Wikipedia or Mathoverflow, for so-called data-driven decision making, or even for plagiarism detection (see for instance, this article).
In Conversation with Johan Commelin
In the second episode of the interview series Data Date, Johan and Christiane talk about mathematical research data in the Lean project, the importance of Github, accessibility in this context, and connected knowledge graphs.
Pizza and Data at StuKon22
Who would have thought that Pizza and Data go so well together? Very well as we found out at the DMV Student Conference in early August that was held at the MPI MiS in Leipzig.
Three days of StuKon saw presentations of Bachelor or Master theses from 13 of the participating students and talks and workshops on possible career paths for mathematicians held by representatives of banks, academia, insurance, consulting, and Cybersecurity firms.
The first evening was planned by MaRDI. StuKon participants were invited to enjoy their slices of delicious pizza while talking about their experiences with research data. Tabea Bacher gave a short presentation on MaRDI in a cozy relaxed atmosphere. She introduced the FAIR principles and the participants were challenged with the very broad concept of mathematical research data encompassing proofs, formulae, code, simulation data, collections of mathematical objects, graphs, visualizations, papers and any other digital object arising in research. Some of the common difficulties in (mathematical) research data were illustrated by an example from her own work.
Participants were then encouraged to talk to one another about their experiences and what they would want or need from a MaRDI service. Ideas, problems and questions were illustrated by designing postcards briefly presented after this very educational dinner. From this, three recurring concerns were identified.
The need for a formula finder ranked high on the list of concerns raised by the students, this was also mentioned in the last MaRDI Newsletter. The second problem that was brought up was research being published in a language not mastered by the researcher that wants to build on it. It has to be translated first. One could argue that the translation could be done with available tools or not to bother with the translation at all. Translated articles are not made available in a public domain and often remain on personal computers so that the next interested party has to repeat this process for themselves. Wouldn’t it be nice to have a service that collected translations of articles and excerpts and made them accessible? If only to determine if the paper really holds the information you need. And last but not least, the students felt that theses that expand and explain a research paper or proof in detail should be linked to that paper or proof, respectively. These are often Bachelor or Master theses that are rarely published on the university servers, let alone somewhere else. They felt if these were linked to a dense proof or paper, it would help understand the research better - or at least more easily - and give context to the problem.
While there were other issues raised, these were the main points discussed by the StuKon participants. As the organisers we feel that it is important to include the next generation in mathematics in the discussion on FAIRness of research data. It seems that everybody left with MaRDI stuck in their heads. Hopefully they will remember it as a place to consult and possibly contribute in future research careers.
image credit: Bernd Wannenmacher
The Future of Digital Infrastructures for Mathematical Research
At the DMV Annual Meeting (2022-09-12 – 09-16), we hosted a MaRDI-Mini-Symposium: "The Future of Digital Infrastructures for Mathematical Research". As mathematics becomes increasingly digital and algorithms, proof assistants, and digital databases become more and more involved in mathematical research, questions arise on handling this mathematical research data that accumulates alongside a publication; storage, accessibility, reusability, and quality assurance. Speakers shared their experience with existing solutions and their visions and plans on how a well-developed integrated infrastructure can further facilitate mathematical research.
The slides of all talks can be accessed via the MaRDI-website.
NFDI4Culture Music Award
This award, presented in two different categories, is given by the musicological community in NFDI4Culture and it intends to recognize music-related or musicological projects and undertakings. Applications may be submitted by 30. September 2022. The funds (up to 3000 EUR) associated with the award are earmarked for expenses that contribute to the goals of NFDI4Culture and must be used by the end of the year 2023.
More information:
FAIR4Chem Award: The FAIRest dataset in chemistry!
This award is given for published chemistry research datasets that best meet the FAIR principles and thus make a significant contribution to increasing transparency in research and the reuse of scientific knowledge. NFDI4Chem will award the FAIRest dataset with prize money of 500 €, supported by the Fonds der Chemischen Industrie (FCI). Submission deadline is November 15, 2022.
More information:
On the first Monday of every month at 4 pm, the NFDI hosts a live InfraTalk on youtube. Here, participants of the individual consortia talk about important topics to a general audience -- for instance, Harald Sack on Knowledge Graphs (March 7, 2022).
https://www.youtube.com/playlist?list=PL08nwOdK76QlnmEB659qokiWN3AC-kqFS- Danish librarians have set up "How to FAIR: a Danish website to guide researchers on making research data more FAIR" https://doi.org/10.5281/zenodo.3712065. On accessibility, they say "Conducting research is often a team effort. Even before collecting the data, it is important to consider who will get access to the data, under which conditions, and what permissions they will have." and provide lots of use cases from all across the sciences https://www.howtofair.dk/how-to-fair/access-to-data/
FDM Thüringen's Research Data Scarytales promises to "take you on an eerie journey and show you in short stories what scary consequences mistakes in data management can have". The multiple player game comprises of stories based on real events and is designed to avoid potential pitfalls and traps in your Research Data Management plan.
Welcome to the very first issue of the MaRDI (Mathematical Research Data Initiative) Newsletter. Research data in mathematics comes in many different flavors: papers, formulae, theorems, code, scripts, notebooks, software, models, simulated and experimental datasets, libraries of math objects with properties of interest... In short, the list is as long as mathematical research data is diverse.
Unfortunately, there is no straightforward or standard way to make these digital objects available for future generations of researchers. Availability, however, is not the only concern. In an ideal world, mathematical research data would be
FAIR: Findable, Accessible, Interoperable, and Reusable.
MaRDI is a part of the German National Research Data Infrastructure (NFDI) and it is dedicated to building infrastructures to make mathematical research data FAIR. Work on solutions for some of the major problems we face today started last year; from understanding the state-of-the-art technology of a field all the way along the research pipeline to establishing standards for peer review. As part of this process it is especially important for us to engage you, the mathematics community, early on so have a look at the list of our upcoming workshops!
This issue of the Newsletter is dedicated to the F in FAIR: to findability and what this means for mathematics.
licensed under CC BY-NC-SA 4.0.
We explore two aspects of what Findable means. First, we will focus on how to find data created by other researchers and then we discuss how to make sure your own data is findable for the math community.
In each newsletter, we will also publish an episode of our interview series on math and data: "Data Dates", introduce you to the people behind the MaRDI project, and offer some reading recommendations on the topic.
Have you ever…
- tried searching for a formula?
- seen a reference to a homepage that is long gone?
- put code on your personal webpage because you didn't know how and where else to publish it?
- browsed through the publications of your coauthor's coauthors looking for that one result that you almost remembered but not quite?
- not been able to find something you needed to keep going into the research direction you fancied?
Then you are not alone!
To find out, where people search for math data, we ask you to answer our very short multiple-choice survey:
Where do you look for mathematical research data?
You will see the results here or right after submitting your answer.
How to find research data?
In the near-infinite resource aka World Wide Web, where do you find your research data? Where are the concentrating resource “hubs”? How is MaRDI proposing to help on the Findability challenges?
Data and FAIR principles
Modern science, including mathematics, relies increasingly on research data. Research data is the factual material required to verify research findings and in mathematics, this can also be the knowledge written up in an article.
Types of research data would include literature, such as books and articles, databases of experimental data, simulation-generated data, taxonomies (exhaustive listings of the examples of a given category of objects), workflows, and frameworks (for instance software stacks with all the programs used in a research project), etc. Even a single formula could be considered research data. To set up good practices in the scientific community, Wilkinson et al published the FAIR Guiding Principles for scientific data management and stewardship. These principles are Findability, Accessibility, Interoperability, and Reusability.
In this article, we will introduce the Findability principle, with a focus on mathematical sciences, in connection with the infrastructure that is being developed by MaRDI.
For more information about what research data is and how to manage it (especially for researchers in German-speaking countries), you can visit Forschungdaten.info (in German). For a comprehensive introduction to the FAIR principles, you can visit the Go-Fair portal.
Findability
Findability is the first of the FAIR principles; it is also the most basic one because if you can't find some data, you can't re-use it in any way, it is as if it does not exist.
When we try to find (research) data, we may face two situations: either we know that something exists and we are looking for it specifically, or we don't know exactly what we want and we look for anything related to a search term. In the first case, rather than finding that data, our problem is locating it somewhere in the physical or virtual space. In the second, our problem is to examine all the data available (in a certain catalog) for a certain characteristic that we are interested in.
Both problems can be solved by using a few tools. Firstly, each piece of data needs to have a unique reference or identification, so that we can build lookup tables for the location of each dataset. Secondly, together with the ID, we need other metadata that describes the data with some useful information (type, subject, authors, etc). Thirdly, we need to build comprehensive catalogs that gather all the metadata of the datasets and build search engines, which are algorithms to retrieve things from the catalogs.
Thus, the Findability principle can be concretized to the following recommendations:
- (Meta)data is assigned a globally unique and persistent identifier.
- Data is described with rich metadata.
- Metadata clearly and explicitly includes the identifier of the data it describes.
- (Meta)data is registered or indexed in a searchable resource.
The classical approach for searching and finding data has been dominated by the publication paradigm: You look for a specific publication, or for any publication related to a certain topic, that will contain the information you are interested in. However, in reality, you often want to find a theorem, a formula, or any concrete information rather than a publication. For instance a specific expression of a Bessel function, a particular representation of a given group, or the proof that certain differential equations have unique solutions. This approach requires re-thinking how we structure and manage research data. We discuss next the available places to find research data and then the MaRDI proposal for such a comprehensive approach.
Where to look for research data
For mathematical articles, books, and other classically published works, a reference includes title, author, year, etc. While this is easily usable and readable by a human, it is not always consistent in format and it does not provide a means to locate and access that information. The two de-facto standard catalogs that collect mathematical literature and also assign a unique identifier are:
- The ZentralBlatt Mathematik (unique identifier: Zb number), archived in zbMath by the FIZ Karlsruhe - Leibniz Institute and
- The Mathematical Reviews (unique identifier: MR number), archived in MathSciNet by the American Mathematical Society.
While these unique identifiers are helpful in referencing a piece of mathematical literature and these platforms are useful in finding works in a specific math domain, their catalogs are much less comprehensive when it comes to other research data (databases, media, online resources, etc). It also has the drawback that the authors cannot control the existence or the metadata of an entry, and MathSciNet is a subscription-based service*.
Another notable mention is arXiv, which is a de-facto standard platform for pre-publications. Here the actual paper is offered publicly thus making it Accessible. Furthermore, any work in arXiv also gets a unique ID and can be found via the catalog search. The focus here is also on literature, although there is limited support for datasets related to a paper. When it comes to non-literature research data, the panorama is much coarser. swMath, a sister project to zbMath, is a catalog of mathematical software packages (computer algebra, numerics, etc) and a cross-referencing record of their citations articles in zbMath. zbMath also features a full-text search of formulas, which is being improved within the MaRDI framework.
There are also general-purpose identifiers and catalogs for data. One of the most standardized identifiers for online resources is the Digital Object Identifier (DOI), which references any digital object. Unlike a URL, the DOI is linked to a particular file and not to the server or website where it is hosted. The DOI website resolves the DOI number to the most up-to-date URL to access the data, so the DOI also serves as a locator in addition to being a unique identifier. Usually, publishers assign a DOI to new publications but authors can also obtain a DOI in other registration agencies. Some open repositories offer free DOI registration. For instance, Zenodo is a general-purpose repository for open data, which hosts quite a few mathematical research datasets. See our article "Publishing on open repositories" where we talk more about Zenodo.
Currently, for pure research databases (experimental data, simulations data, etc), there is no universally accepted repository in mathematics. There are a few curated collections of mathematical objects, such as the Online Encyclopedia of Integer Sequences (OEIS), the SuiteSparse Matrix Collection, and the NIST Digital Library of Mathematical Functions. The reality is that many researchers rely on open repositories for access to data. Unfortunately, in contrast to biological repositories where researchers can find standardized catalogs of proteins or genetic encodings, mathematical catalogs are neither for general-purpose use nor very interoperable.
MaRDI's proposal concerning Findability
Unfortunately, most data-based mathematical research is still published either without the datasets, or the datasets are hosted on university servers accessible only through personal websites of the researchers involved.
MaRDI aims to, on the one hand, provide the necessary ground infrastructure to properly publish research data in federated repositories (using standards and practices according to the FAIR principles), and on the other, it plans to spread awareness within the math research community on the problems and proposed solutions that publishing research data entails.
Here we will name a few of the initiatives related to the Findability principle.
The Scientific Computing Task Area (TA2) is preparing a benchmark framework to compare existing and new algorithms and methods to solve specific problems. For instance, there are several dozens of methods to solve a linear system Ax=b, with different performance and different technology stacks, depending on the size of the matrix A, if it is sparse or dense, if we look for exact or approximate solutions, etc. So far there is no centralized catalog where a "user" (for instance a computational biologist) can go to choose the best method for their particular application. This catalog and benchmark will make finding symbolic and numerical algorithms much easier and it aspires to be a major reference when looking for such algorithms.
The tool for this is building a knowledge graph of numerical algorithms. A knowledge graph is an abstract representation of a set of concepts, objects, events, or anything related to a domain of study, as nodes, and formal relations between them (edges) that can be read by humans or computers unambiguously. The biggest collective effort to build a knowledge graph is Wikidata. In this mathematical knowledge graph, nodes will be the algorithms themselves as concepts, but also papers related to them, software packages implementing them, benchmarks, and connections to other databases. It will then be possible to navigate the knowledge graph to find semantical information, such as which algorithms extend a given one, where can we find implementations, how do they perform in comparison, etc.
Another effort aimed at Findability in MaRDI is the Mathematical Entity Linking (MathEL), or a way to extract and compare conceptual information from mathematical formulas. The concept of a particular equation (for instance the Klein-Gordon equation, the General Relativity equation, etc) can be expressed in many different forms, variables can be named differently, notations for derivatives or tensors may differ, and groupings and substitutions can occur. The MathEL sub-project aims to retrieve the conceptual information of formulas, propose annotation standards for introducing semantic information into formulas (for instance referencing a WikiData node or other knowledge graph node), to mine large corpora of research data (for instance the Zb catalog or the arXiv repository) and to create user interfaces to retrieve concept and source information, such as question-answering engines.
To illustrate this, here is a sneak peek into the MaRDI portal, under development, which will integrate the MathWebSearch search engine as a MediaWiki component. The formula search can find Wikipages based on formula expressions denoted in LaTeX on the pages on the MaRDI portal. This test wiki page contains a couple of math formulas. This search portal should be able to find those formulas when queried in the search box. With the TeX and BaseX configuration, you can try an input like " V=4/3 \pi r^3 " or " V=\frac{4}{3} \pi r^3 " and it will find the Wiki page with the test formula. Also, with " V = 4/3 \pi ?s^3 " you can find variable substitutions. Other common re-writings are not yet recognized, such as " V = \frac{4\pi}{3} r^3 " but the core search engine is also under active development. The same engine is used in zbMath formulae search. Plans for MaRDI include to make entities in a Wikibase knowledge graph findable through formula search.
In subsequent articles, we will expose other tasks being carried out within MaRDI** that exemplify the other FAIR principles (for instance open interfaces, or descriptions of workflows).
* MR Lookup offers limited services to non-subscribers. As of 2021, ZbMath became zbMATH-open and requires no subscription.
**The funded MaRDI proposal can be accessed here.
Taking some data from a project, we try to prepare it according to the FAIR principles. Follow us in our attempt to make it FAIR on the first try.
Publishing research data in open repositories
We are IMAGINARY, a math communication association, part of the MaRDI consortium and we develop and organize math exhibitions as our main activity. Using data that we collected about Earth grids for one of our recent projects on climate change, we will take you through how we almost painlessly set up data in a public repository.
Our latest exhibition is the "10-minute museum on the climate crisis mathematics", where we describe mathematical modeling and places where maths is used in climate science. We all know that the latitude and longitude grid is the most common way of creating a reference system on the Earth. Did you know there are other ways to divide the Earth into small regions that can be particularly useful in numerical models?
Quite excited by this, we contacted a couple of climate researchers who were able to prepare for us the sets of geographic nodes and edges that make those grids. Then another one of our collaborators took that data and converted it into a 3D-printable model by adding thickness to the edges and checking the structural integrity of the ensemble so that it could be a physical object. Finally, a 3D printing company made the objects that we used in our exhibition.
As this dataset was not used in a way that contributed to existing knowledge, it was not suitable for a publication in a journal. However, it occurred to us that the data that was gathered and processed was niched and specific enough to be the basis for others to re-use and build on.
Being a company committed to Free and Open Source licenses, we wanted to not only make the data available but FAIR as well.
Git (GitHub, GitLab)
Since we were dealing with software files, the most convenient platform for publishing and developing is GitHub. Git is an efficient version control software and any organization of code should start here. GitHub and GitLab are probably the most popular platforms to host projects. However, as a publishing tool, it could be considered almost as a kind of personal website (actually, you can host and serve a git repository in your server) and it is a live and working tool. This means that the published data can change at any time. Github does not offer, by default, a guarantee of stability (although there are archive options), a standardized identifier, or a good way to search and find your data. Also, it keeps a record of all previous versions so all the dirty work is on the public.
Our GitHub page was our collaboration tool within the team. It was not intended as a publication method; it just happened that we left it to be publicly available. Having data available somewhere does not automatically make it FAIR. We wanted to have an identifier associated with it and we knew that some repositories offered that.
Zenodo
Zenodo is one such open-access general-purpose repository. It is hosted by the CERN infrastructure and funded in part by the European Commission. Researchers in any scientific area use it to make a copy of their work findable and accessible to the public. These works can be articles or books in pre-print or, in some cases, already published by traditional publishing houses but also databases, data files, images or any digital asset that their research relies upon.
Zenodo offers a Digital Object Identifier (DOI) if the work does not already have one. In this case, the DOI contains a "zenodo" string in it. For instance, 10.5281/zenodo.6538815.
This was a perfect fit for our data and as a bonus, creating our entry on Zenodo was not difficult!
Firstly, we created an account. A valid email is all you need. You can also link it to your ORCiD to determine the author(s) uniquely.
Secondly, we made a new upload draft. You can choose the type of document (publication, poster, dataset, image, video, software, physical object, etc.) and fill in the form with the title, authors, publication date (can be in the past), description, and several other fields.
For the authors, we added the ORCID of those who had it. We also used "IMAGINARY" as an author, even though it was not a physical person but a company.
We requested a new DOI since we did not have any. The DOI can be "reserved" during the draft process, so you know it in advance and can use it in the documents you prepare.
For the actual content, we used a zip file with the master branch of the GitHub repository. You can also link your Zenodo account to your GitHub account so that whenever you make a "release" in GitHub, a snapshot is automatically published in Zenodo.
Finally, we submitted the draft. Take note: once published, you can't add, delete or modify the files associated with a DOI, which is the main point of the DOI. You would have to make new versions with a new DOI. Thus, we recommend that you double- and triple-check before clicking submit. In case you make an erroneous submission, you can write an email to the Zenodo administrators for help.
Wikipedia / Wikidata
We now have an identifier that would make our data easy to find if you have it, or if you happen to search in Zenodo's search box. But now, we wanted to increase our Findability. We needed to include our data in places where people often look for information and Wikipedia / Wikidata are the perfect places for that.
Wikipedia is the universally known collaborative encyclopedia. With more than 6 million articles in English, it would be easy to find an article relating to your data. However, before advertising your data on Wikipedia by editing general-interest articles, you must be familiar with the core principles of Wikipedia content: Neutral point of view, Verifiable, and No original research. That is to say, only link to research and data published elsewhere and do not hijack articles for self-promotion.
In our case, we found an article on Discrete global grid. Since our work provides an example of such grids, it could be of general interest. Additionally, as there are no other examples of 3D-printable grids that we are aware of, we decided to add a link in the "External references" section.
We then had a look at Wikidata. Wikidata is the data backbone for Wikipedia. In contrast with Wikipedia, which is made of articles, Wikidata is made of entries; every entry can be an object, an abstract concept, a person, a feeling, a math research article..., essentially anything. Every entry lists some properties of the item in a structured form. It is human-readable but also planned to be machine-readable, meaning one day some AI or search engine can obtain knowledge from such an enormous database, which aspires to have all human knowledge structured. As such, it is a suitable place to catalog research data. Many researchers index there their articles (listing title, authors, DOI...), databases, models, etc. But many don't, so it is not yet a comprehensive research (or general) catalog. It is also less intuitive as a search tool than Wikipedia (there is no full text to read), and it can be challenging to retrieve useful information by hand.
In our case, searching for "Earth grid" produced nothing, while "Earth system grid" brought us to the US Energy department portal, and we learned that "Grid in Earth sciences" is the title of a concrete published article. We finally found the Wikidata entry on "Discrete Global Grid" (linked in the Wikipedia article) which is about the concept, but not much information therein. We could have created a Wikidata entry and have our data listed as an instance (example) of a Discrete Global Grid, but we found that our 3D data would have more context in the Wikipedia article. Therefore, we decided not to put our reference in Wikidata.
After asking some colleagues, we found that a more typical use case would be the following: A published research article uses a dataset. Then a Wikipedia page references the published article as a source. By creating a reference in Wikipedia, an entry in Wikidata is created. Then a (different) entry in Wikidata representing the dataset is linked to the entry representing the published article. This way, there is a path from Wikipedia to the research data referenced in Wikidata. Hopefully, eventually, the dataset is used in other publications (referenced in other Wikipedia pages) and Wikidata can keep track of all the works derived from that dataset.
Assessing the FAIRness
At this point, we were wondering, how can we tell if our data is really FAIR? How well did we do? Fortunately, there is also a tool to assess that!!
The Automated FAIR Data Assessment Tool from FAIRsFAIR data initiative accepts any working reference, a DOI for instance, and tries to determine its FAIRness from its metadata. It generates a summarised report with individual scores and a final global mark. Luckily for us, Zenodo handles that metadata quite well and makes it available via the HTML code on the Zenodo page itself.
So how did we do? On a scale from 0 to 3, our grand score is: "moderate" or 2.
To improve that score, we could have edited the metadata and added more details; however, that is still a feature under development in Zenodo (e.g., supporting the citation file format), and it may be a bit cumbersome to edit that metadata on other platforms.
Conclusion
Overall we were satisfied with this experiment of making our data FAIR. The GitHub workflow is a bit difficult to learn but it is nowadays part of software development. An added benefit is that it can integrate into FAIR workflows. Zenodo was a success: easy to use, takes care of most of the metadata, and provides free DOIs. Wikipedia is not difficult, but you need to restrain your interest in getting visibility from undermining the general interest of an encyclopedia. About Wikidata, we concluded that it is not for our use case (although it might be for other research data). Finally, the FAIR data assessment tool is great not only to evaluate but also to educate on good practices and improving your FAIRness. Probably there are still many tools and hints that we can discover, but so far it was not so hard a trip to make.
We hope that reading about our experience encourages you to re-evaluate and want to improve the FAIRness of your data.
In Conversation with Cedric Villani
In the first episode of the interview series Data Date, Cedric Villani joins Christiane Görgen for a brief exchange of thoughts about Math & Data.
OpenML hackathon at Dagstuhl castle
Sebastian Fischer and Oleksandr Zadorozhnyi, of the MaRDI task area Statistics and Machine Learning, participated in an OpenML hackathon held in late March at the headquarters of the Leibniz Center for Informatics at Dagstuhl, Germany.
OpenML is an open-source platform for sharing datasets, algorithms, experiments, and results. The hackathon was initiated by Bernd Bischl, one of the key players behind OpenML and a Co-Spokesperson in MaRDI. Researchers from other parts of Germany, France, the Netherlands, Poland, and Slovenia were present to discuss topics such as data quality on OpenML, an extension of its established services to new data formats, and new computational tasks.
The review article "Datasheets for datasets" provided fruitful exchanges on future improvement of data and metadata quality. In particular, support for non-tabular data formats such as images was discussed and will now be embedded by transitioning from the attribute-related file format to parquet. The so-far available eight types of tasks, including regression, classification, and clustering, will be extended to new tasks which are typical for graphical modeling. As this is one of the main use cases and an important topic for both Sebastian and Oleksandr, discussions on the problem of graphical-model structure estimation from a given dataset, embedding into the current set of tasks available on OpenML, addition of different evaluation measures or criteria for model selection and storage of graph-specified datasets within the OpenML framework were had with Jan van Rijn. The evaluation measures and criteria for model selection allow for the comparison of estimated graphs to some given ground truth, a procedure that is not normally part of the ML workflow.
Sebastian also presented their collaborative work with Michael Lang on the mlr3oml R package. This package connects the OpenML platform to the open-source machine learning mlr3 package in R, another crucial aspect of the MaRDI task area.
The hackathon was rounded out with social activities like a walk through the forest. The good weather aside, special thanks needs to be given to Joaquin Vanschoren, the OpenML founder, whose supply of water to the whole group during the hike was the other reason why everyone made it back to the castle in good spirits!!!
All in all the week in Wadern was a pleasant and fruitful one for all the participants.
We will also be introducing you to the people who shape MaRDI with their expertise and vision for mathematical research data. They will appear in a series of "Making MaRDI" interviews available via our Twitter account. Stay tuned!
Call for seed funds 2023
These funds support scientists from all fields of research within engineering, relating to the development and implementation of innovative ideas in data management. The grant is equivalent to the funding of a full-time doctoral position for one year. If necessary, the funding can be split between project partners.
More information:
- To learn about the Nationale Forschungsdateninfrastruktur, the community of which MaRDI is just one small part, read the 2021 article by Nathalie Hartl, Elena Wössner, and York Sure-Vetter in Informatik Spektrum. See doi.org/10.1007/s00287-021-01392-6
- Christiane Görgen and Claudia Fevola explain in a short review article the role repositories can play in the MaRDI infrastructure. They use MathRepo as an example, a small math research-data repository hosted at the Max Planck Institute for Mathematics in the Sciences in Leipzig. See arxiv.org/abs/2202.04022
- The interim report of the European Commission Expert Group on FAIR data discusses how to turn FAIR into reality. See doi.org/10.2777/1524
- Thomas Koprucki and Karsten Tabelow have been two of the driving forces in the early stages of MaRDI. Together with Ilka Kleinod they discussed mathematical models as an important type of mathematical research data in a 2016 article for the Proceedings in Applied Mathematics and Mechanics: doi.org/10.1002/pamm.201610458
Our Newsletter "Math & Data Quarterly" is prepared by our partner IMAGINARY. Sign up to get it delivered straight to your inbox quaterly.