12-2023 7th issue content

Information on cookies

On the basis of your freely given consent, which can be revoked at any time, your visit to our website is recorded using Matomo, an open source web analytics software program. The information produced will be used solely for statistical purposes and to improve the website and server. No personal data will be stored or shared with third parties. For more information, please refer to the Privacy Policy on our website. By clicking “Accept”, you consent to our use of cookies for analytical purposes. If you do not agree to this, please click “Reject”. In principle, you can visit our website without cookies being enabled. This does not apply in the case of essential cookies.

Welcome

Welcome to this year's final Newsletter, our seventh issue on math and data! We are happy to present you with troves of information on managing your mathematical research data. Over the past issues, we have discussed the implications of the FAIR acronym for mathematics, how to search for and structure math results using knowledge graphs, and the specialties of mathematical research data. We will close this cycle by raising these questions: What do funding bodies and NFDI members recommend in their research-data management guidelines? How can we practice this in maths? What are MaRDI's recommendations?

For answers, check out our interview with math research-data manager Christoph Lehrenfeld; the keynote article on what to write in a research-data management plan; various reports from meetings where these topics were discussed, especially with a community of librarians; and not to mention, the ever-intriguing list of recommended reading.

Enjoy and seasonal greetings!

As always, we start off with an illustration. This time, it depicts the different research data types in mathematics, as discussed in our previous issue.
Hot tip: Send the illustration to your colleagues and friends as a seasonal greeting.

Download the illustration

by Ariel Cotton, licensed under CC BY-SA 4.0.

In the previous issue, we asked what type of mathematician you are. All types of mathematicians are represented significantly within our newsletter community. The largest fraction belongs to the Guardians of the Data Vault category. You can also check out this page for more information, including a free poster download.
Here are the results:

Now, back to the current topic - Research Data Management in Mathematics. Have you ever been asked about data handling in a funding proposal? The survey of this newsletter issue deals with:

Data handling in your funding proposal?

Research Data Management in Mathematics

Data Management: From Theory to Practice

In previous MaRDI newsletter articles, we discussed what is Mathematical Research Data, the guiding principles that define proper, good quality research data (the FAIR principles), and why as a researcher you should care about your data. It is time now to raise the question of how to properly curate your research data in practice.

Research Data Management (RDM) refers to all handling of data related to a research project. It includes a planning phase (written as a formal RDM plan and included in an application for funding agencies), an ongoing data curation and plan revision during the project, and an archival phase at the conclusion of the project.

In this article, we will survey the main points to consider for proper data management. There are, however, more comprehensive and detailed guides that you can use to create your own RDM plan. The MaRDI community has written a report on Research-data management planning in the German mathematical community, and a whitepaper (Research Data Management Planning in Mathematics) that will be helpful in the context of mathematics. You can also get useful resources from other NFDI consortia, such as the FAIRmat Guide to writing a Research Data Management Plan.

Writing an RDM plan

An RDM plan is a document that describes how you and your team will handle the research data associated with your research project. This document is a helpful reference for the researchers on how to fulfill data management requirements. It is a standard requirement nowadays from many funding agencies on their application regulations for projects.

There are several standard key points to consider in an RDM plan. These points were developed by Science Europe and have been adopted across agencies internationally. You can check the evaluation criteria for each point that evaluators are likely to use when reviewing your application. In Germany, the key requirements are given by the German Research Foundation, the DFG.

Data description

First and foremost, you need to know the type of data you will be handling. Start by describing the types of data involved in your project (experimental records, simulations, software code…). It is a good idea to separate data by its provenance: internal data is data generated within the project, whereas external data is data used for the project that is generated elsewhere. When recording internal data, specify the means of data generation (by measuring instruments in the laboratory, by software simulation, written by a researcher…). As for the documentation of external data, include details of any interface/compatibility layer used (for instance format conversions).

Workflows of data are in itself a type of data. If you process data in a complex way (combining data from different sources, involving several steps, using different tools and methods…) the process in itself, the workflow, is something that should be properly documented and treated as research data.

Plan in advance the file formats required for recording, the necessary toolchains, and other aspects that affect interoperability. Prioritize the use of open formats and standards (if you need to use proprietary formats, consider saving both the proprietary and an exported copy to an open format). Finally, estimate the amount of data you will collect, and any other practical needs that you or anyone using the data will require. As you cannot foresee in detail the various data requirements (for instance, you may not know the specific software tools necessary to solve your problem), your RDM plan should be updated at a later stage if the type, volume, or characteristics of your data change significantly.

In mathematics, it is likely that you will generate some text and pdf files for your texts, with graphics from different sources. If your bibliographies grow above a hundred references, you may use separate BibTeX files (.bib) that you can re-use across publications, and constitute a usually overlooked piece of research data.

If your project involves computations, you will have scripts, notebooks, or code files that serve as input for your computation engine. Your system will require a toolchain to work, for instance, a particular installation of Singular, OSCAR, MatLab, a C compiler…, together with some installed libraries and dependencies. You may use an IDE or a particular text editor (while that may seem a personal choice not relevant for other users, it is in fact quite useful to know how some software was developed in practice). This toolchain is also a piece of research data that needs to be curated. You may have output files that require documentation, even if they can be recreated from the inputs. If your project involves third-party databases, these should be properly referenced and sourced.

Documentation and data quality

Data must be accompanied by rich metadata that describes it. Your documentation plan should state the metadata you need to collect, and explain how it will stay attached to the data. Once you have a description of your data, you need to organize it. Create a structure that will accommodate all the generated data. The structure can include some hierarchy in your filesystem, conventions for naming files, or another systematic way to find and identify your data easily. Do not call your files “code.sing”, “paper.tex” or “example3_revised 4 - FINAL2.txt”, instead use meaningful names such as “find_eigenvalues.nb” and start your document with comments explaining what this file is, the author, language, date, references to theory, how to run it, and any other useful information.

Good documentation will be a crucial step in enabling the reusability of the data. If you are developing a software library, you need to document the functions, APIs, and other parts of the software, including references to the theoretical sources that your algorithms are based on. If you are curating a database or a classification of mathematical objects, you need to document the meaning of fields in your tables, the formulae for derived values, etc. If parts of the data can be re-generated (for example, as a result of a simulation), you should describe how to do so, and differentiate clearly between source data and automatically generated data.

Data quality refers to the FAIRness of the data, which needs to be checked and addressed during the implementation phase. At the planning stage, you can provide metrics to evaluate data quality, and provide quality control mechanisms. For instance, checking the integrity of data periodically, and testing whether the whole toolchain can be installed and executed successfully. You can plan a contingency in case some tools become obsolete or unavailable.

Storage and archiving

Storing and archiving research data is not a trivial matter and should be planned carefully. On one hand, it requires security against data loss (or data break in case of sensitive data), and on the other hand, it needs to be accessible to all the researchers involved, in a practical way.

Your storage strategy should take into account the amount of data (large volumes of data are more difficult to move and preserve), its persistence (experiments are recorded usually only once, whereas an article or computer code is rewritten again and again with improvements), the number of people needing write access, or the sensitivity of the data (for instance, data related to people can have private information that needs to be anonymized and access-controlled, whereas non-sensitive data can be put in public repositories).

Backup plans should include keeping several copies of the data, in different physical locations and different media. Important key files (e.g. indices or tables listing contents of other files) should be specially protected and backed up. Synchronizing and keeping up versions of data is also important to avoid unorganized data. If your team has several people needing write access to the same data, you need appropriate tools to avoid conflicting versions, such as online multi-user editors (e.g. Overleaf, Nextcloud/collabora, Google docs…), or version control systems. Remember to always backup a local copy and not keep your only copy on the cloud. A good scientific practice is to keep your storage for at least 10 years after publication or project completion.

Legal and ethical obligations

It is important to be safe with the legal obligations associated with data, and related ethical considerations that may arise.

All data should be associated with an author or an owner, who has the right to decide a license and control its access, usage, and other legal prerogatives. Intellectual property and copyright may apply to some data, like patents, software, commercial products, or publications. Intellectual property protects ideas (but not facts of nature), while copyright protects an expression of an idea. In mathematics, a theorem cannot be protected by intellectual property (since it is a fact of nature), while an algorithm for a practical purpose or its implementation in software could be protected by it. The text of a scientific article is generally protected by copyright, even if the ideas contained therein are free. If the copyright of your written texts is to be transferred to a publisher (as it is standard practice), you should state the conditions that are acceptable within your project and your publication strategy (see next section).

Sensitive data (e.g. medical records, personal information…) require a specific data handling policy with special attention to data protection and access.

In all cases, you should include an appropriate license note, after evaluating all implications. For open licenses, you should prefer standard licenses (for instance Creative Commons or free/open software licenses) instead of crafting your own licenses or adding/removing clauses, which can lead to license incompatibilities and encumber its reusability. See our article on Reusability.

Data exchange

Data exchange involves an integration of the project and its data with the community. Data should be readily found, accessed, operated, and reused by anyone with a legitimate interest in it. In practice, this data exchange will involve a data preservation strategy for the long-term (archiving), as well as a dissemination strategy (using community standards to share data).

While the storage and archiving section above concerns mainly data security and preservation, this data exchange section focuses on the FAIRness (and specially the accessibility) of the data and its exchange within the community. Naturally, both topics overlap. FAIRness of data is heavily affected by the FAIRness of the repositories or hosting solutions that store and make this data accessible. Look for FAIR and reliable repositories in your domain that can host online versions of your data during the implementation of the project and also act as a long-term archiving solution.

In this section, you can include the publication strategy for your research articles. This includes, whether you plan to publish pre-prints or final articles in free repositories (like arXiv), or whether you consider publishing only in open-access journals, etc. Though there are comprehensive catalogs for literature in mathematical research (zbMATH open, MathSciNet), it is always good to ensure that your publications are findable and accessible. For other types of data, you should carefully consider their dissemination, and ensure that your data is listed in relevant catalogs.

Responsibilities

All research data needs someone to take care of it. A person (or a team) must take responsibility for the research data in the project. This responsibility may be that of the owner/author, or someone else. A data steward can be appointed to help with the technical aspects of data management. There can also be teams designated with different responsibilities during different phases of the project (planning, implementation, archiving), but they should be public and well defined, and serve as a contact point during and after the project.

In case the data is meant to be static (no changes in the future), then the responsible person is only answerable for what has been published. If, on the other hand, the data is expected to grow in the future (for instance a growing classification of mathematical objects), a maintainer should be appointed for the future, to take care of advances in the field and incorporate the data to its appropriate place. In case the maintainer can no longer take care of that role, the position should be transferred to another suitable person/team, for as long as the project needs a maintenance team.

MaRDI RDM consulting

We can offer a couple of examples of RDM plans from MaRDI, developed in the context of mathematics by MaRDI members. First, for a project applying statistical analysis to datasets containing student records for a study in didactics [RDMP1]. Second, for a project that develops algorithms and software with applications in robotics [RDMP2]. These are prototypes we prepared along with RDM experts from Leipzig university for mathematics projects planned by our researchers. We handed these out as examples to the community at the DMV annual meetings in 2022 and 2023.

MaRDI can offer consulting services for math projects that need help with creating their own RDM plan, or just figuring out the necessary infrastructure and best practices for a FAIR RDM. You can contact the MaRDI Help Desk for more information.

Tools for keeping your RDM up

There are some existing tools to help researchers plan and fulfill an RDM plan. These can be used in small or individual projects, though they are meant for large projects involving many researchers. We will briefly discuss Research Data Management Organizer (RDMO), a web-based service widely used in German research institutions.

RDMO is a free open-source software developed as a DFG project, meant to run as a web service in your institution's infrastructure. Normally, a data manager (data steward) will play an administrator’s role and install the RDMO software in a server with access to the institution researchers. The data manager will create questionnaires for handling the data of a specific project. That questionnaire will be available in an online form that researchers can fill-in for each piece of data that they create or gather. By analyzing this questionnaire, a standardized file can be exported, that serves as metadata of the described data. Template questionnaires are available, so that all relevant information is included (e.g. the DFG guidelines). The questionnaires can also be used to generate a standardized RDM plan for the project. No data is actually stored or handled in the RDMO platform, RDMO and other RDM tools only handle metadata and help with organization. You still need to store and structure your research data, ensure data quality, apply licenses, manage data exchange, etc. These tasks are not automated by any RDM tool and you are still in charge of implementing your RDM plan.

Some institutions will require you to use this platform to prepare RDM plans for their research projects. Such a case is Math+ excellence cluster at Zuse Institute Berlin. A version of the questionnaire that ZIB researchers use is available in [Quest1] (also published here in XML format; the actual RDMO instance is only available to ZIB users). Using such a system reduces the possibility of unintended omissions, ensures compatibility with the guidelines of their funding agency (DFG), and uniformizes RDM plans across different projects.

MaRDI is also actively using RDMO as part of its task area devoted to interdisciplinary workflows. Workflows are important research data for projects involving researchers from different disciplines, making its management particularly challenging. MaRDI has prepared an RDMO questionnaire that can describe workflows in a MaRDI standard way, you can have a look in [Quest2] (also published here in XML format). Additionally, MaRDI is developing MaRDMO, an RDMO plug-in that can be installed on the instance of RDMO you use (a live demo will be soon available). This plug-in will add the feature of exporting the documented workflow metadata directly to the MaRDI knowledge graph, and make it findable and accessible through the MaRDI portal. This will provide a streamlined method to populate the MaRDI knowledge graph directly from the researchers, with the same tool they used to create an RDM plan and manage their RD metadata.

[RDMP1] RDM plan of the project “Methods of non-linear algebra applied to a didactics’ dataset”.

[RDMP2] RDM plan of the project “Real Algebraic Geometry and Path Finding”.

[Quest1] Questionnaire for Zuse Institute Berlin’s instance of RDMO.

[Quest2] Questionnaire for MaRDI’s MaRDMO plugin.

Data Dates

In Conversation with Christoph Lehrenfeld

To get to know about infrastructural projects within collaborative research centers, Christiane Görgen interviews Christoph Lehrenfeld from Göttingen Scientific Computing about new developments and best practices in research data management.

DMV Annual Meeting

For four days in September (25th-28th), the town of Ilmenau in Thuringia was populated by hundreds of mathematicians from various disciplines and regions across Germany, who had traveled to the annual meeting of the Deutsche Mathematiker-Vereinigung (DMV). The event provided an excellent opportunity to present MaRDI and engage with the mathematical community. MaRDIans from nearly all our task areas were present. On the first day, we held our mini-symposium 'Towards a digital infrastructure for mathematical research', where speakers presented infrastructure services for mathematics they developed or are developing. At the MaRDI stall, we engaged in lively discussions with interested mathematicians, presented the latest version of our Algorithm Knowledge Graph, and distributed information material. The community responded positively to a checklist for technical peer review distributed at the MaRDI stall and to the "What type of mathematician are you?" poster [https://www.mardi4nfdi.de/community/data-type] in specific.We noticed an increase in scientists' awareness of research data management in mathematics and recognition of MaRDI compared to last year's DMV annual meeting in Berlin. It is encouraging to see that awareness regarding FAIR data is growing. Overall, we are pleased with our conference visit and the connections we made in Ilmenau.

Math meets Information Specialists Workshop

The first „Maths meets Information Specialists“ workshop was held from October 9th to 11th as a noon-to-noon event at the Max Planck Institute for Mathematics in Sciences in Leipzig. Organized by MaRDI, it brought together 20 professionals from diverse capacities, including librarians, data stewards, domain experts, and mathematicians. The workshop included talks and interactive elements such as hands-on sessions and barcamps. The focus was on addressing key questions related to the unique characteristics of mathematical research data (for example, what metadata is minimally sufficient to identify a maths object?), and exploring existing services and challenges faced by infrastructure facilities and service providers. This also included the topic of training and addressed the difficulty of raising awareness of rdm topics among mathematicians.

The „Maths meets Information Specialists“ workshop provided valuable insights, discussions, and best practices for the challenges associated with mathematical research data management. Moving forward, the initiative aims to continue fostering collaboration, developing standards, and supporting training efforts to ensure the effective management of mathematical research data. Stay tuned for a follow-up event.

The 3rd MaRDI Annual Workshop in Berlin

In November, we met as the MaRDI team in Berlin for the third run of our annual workshop. Participants from every task area arrived on Tuesday, 28 November to engage with collaborators in neighboring NFDI consortia -- namely 4Biodiversity, KonsortSWD, and 4DataScience -- with strong links to mathematical methods. This was followed by a panel discussion with all speakers on topics of common interest, such as knowledge graphs, community building, and potential areas for interdisciplinary collaboration. After such an inspiring kick-off, the meeting gave ample opportunity for MaRDIans to discuss the status quo and plans for the second half of the five-year funding period. Four new services were proudly presented: a new FAIR file format for saving mathematical objects, now available in OSCAR; a first version of the scientific-computing knowledge graph; software solutions for open interfaces between different computational tools like algorithms in Python and Julia; and ways of annotating and visualizing tex code in the Portal. These sparked lively discussions in subsequent barcamps on how to: present MaRDI services using the upcoming interactive MaRDI station. This can be done, for instance, as video games; by embedding the teaching of math infrastructure services in a curriculum, even if only for one hour per semester; and by integrating these services into our MaRDI Portal. The meeting concluded with a clear focus for the next 2 years: bringing MaRDI services to our users and communities.

First NFDI Berlin-Brandenburg Network Meeting

MaRDI initiated the first NFDI Berlin-Brandenburg network meeting at the Weierstrass Institute (WIAS) Berlin on October 12, 2023 with an aim to set up a local network of all NFDI consortia located within the region. The main goal was to establish contacts between members of the different consortia and identify common fields of interest. We focussed particularly on the mutual benefit of cooperation between projects of consortia in different disciplines.

25 out of the 27 consortia of the NFDI are present within the Berlin-Brandenburg region, involving more than 120 scientific and other institutions. 73 registered participants were to attend the meeting. Most participants belonged to one of 21 different consortia of the NFDI, whereas a few were not affiliated to any of the consortia but attended with an interest in learning about the NFDI and its consortia.

While similar NFDI local communities ("Stammtische") exist in a few other regions in Germany, a formal network of these communities is still elusive and is expected to be initiated by the NFDI headquarters in the future.

The workshop started with participants introducing themselves, getting to know each other, and brainstorming topics for the afternoon's World Café.

Among others, the following points were discussed:

Improving the acceptance and importance of FAIR principles and Research Data Management (RDM)
Role of open-source software for infrastructure technology and sustainability and longevity of services in the NFDI
Teaching RDM and literacy in the use of cross-disciplinary data types
Industry and International Collaborations
Ontologies and Knowledge Graphs (KG)
Importance of teaching the central topics of the NFDI such as RDM and KG, mainly to scientists in early phases of their career. Role of incentives for engagement in data management.

Overall, the atmosphere was open and constructive, focusing on bridging traditional gaps and fostering interdisciplinary cooperation. The meeting enabled us to find a common language and fields of interest, emphasizing the overarching aspects of the NFDI. We expect the venture to grow into future collaborations on topics central to the NFDI, meetings of smaller groups to discuss topics such as teaching RDM or ontologies, and biannual meetings at bigger forums. The main communication channel for information on future NFDI_BB activities will be the NFDI_BB mailing list:
https://www.listserv.dfn.de/sympa/info/nfdi_bb
Kindly register to be informed of future workshops and events.

NFDI4friends

Workshop on RDM in Modelling in Computer Science

This NFIDxCS workshop is a base for discussing systematic approaches to dealing with research data. The workshop aimed to gather individuals willing to contribute to the handling of research data management. The result will be a manifesto for research data management in modeling research (in Computer Science). Date: March 11, 2024, submission deadline: January 8, 2024.

More information:

in English

Data Management Plan Tool

The German Federation for Biological Data (GFBio) offers a Data Management Plan (DMP) Tool. It will help you find answers to important questions about the data management of your project, and create a structured PDF file from your entries. You can also get free personal DMP support from their experts.

More information:

in English

Mailing list “Math and Data Forum”

The MaRDI mailing list “Math and Data Forum” offers news and insights into the realm of mathematical research data as well as a discussion forum for research data management practices and services in mathematics.

More information:

Subscribe here

Special ITIT Issue Data Science and AI within the NFDI

Data Science and AI is an interdisciplinary field that is important for many NFDI consortia. This special issue of the journal "it - Information Technology" will focus on recent developments in Data Science and AI in the different consortia. Submission deadline: January 31, 2024.

More information:

in English