Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management

Poger, David; Yen, Lisa; Braet, Filip

doi:10.1007/s00418-023-02191-8

Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management

Review
Open access
Published: 13 April 2023

Volume 160, pages 169–192, (2023)
Cite this article

Download PDF

You have full access to this open access article

Histochemistry and Cell Biology Aims and scope Submit manuscript

Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management

Download PDF

4454 Accesses
5 Citations
Explore all metrics

Abstract

The second decade of the twenty-first century witnessed a new challenge in the handling of microscopy data. Big data, data deluge, large data, data compliance, data analytics, data integrity, data interoperability, data retention and data lifecycle are terms that have introduced themselves to the electron microscopy sciences. This is largely attributed to the booming development of new microscopy hardware tools. As a result, large digital image files with an average size of one terabyte within one single acquisition session is not uncommon nowadays, especially in the field of cryogenic electron microscopy. This brings along numerous challenges in data transfer, compute and management. In this review, we will discuss in detail the current state of international knowledge on big data in contemporary electron microscopy and how big data can be transferred, computed and managed efficiently and sustainably. Workflows, solutions, approaches and suggestions will be provided, with the example of the latest experiences in Australia. Finally, important principles such as data integrity, data lifetime and the FAIR and CARE principles will be considered.

A simple, web-based repository for the management, access and analysis of micrographic images

Article 30 October 2019

Publishing and sharing multi-dimensional image data with OMERO

Article Open access 30 July 2015

Building a FAIR image data ecosystem for microscopy communities

Article Open access 21 June 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Electron microscopy (EM) of biological samples has changed dramatically since the development of the first electron microscopes in the 1930s. Over the past 10 years, the rapid advances in EM, especially in cryogenic electron microscopy (cryo-EM), cryogenic electron tomography (cryo-ET), volume electron microscopy (VEM) and correlative light and electron microscopy (CLEM), have enabled the collection of high-volume, high-complexity and high-resolution data at an ever-increasing speed (Hauser et al. 2017; Ando et al. 2018; Danev et al. 2019; Eisenstein 2023). Advances in cryo-EM were indeed recognised in 2017 with the Nobel Prize in Chemistry. Progress in sample preparation, instrumentation and algorithms for image processing has continually pushed the boundaries of possibilities (Kühlbrandt 2014; Bai et al. 2015; Schur 2019; Chua et al. 2022). Cryo-EM, cryo-ET, VEM and CLEM are now capable of resolving unprecedented, near-atomic-resolution structures of single proteins and molecular complexes (Bai et al. 2013; Nakane et al. 2020; Yip et al. 2020; Cao et al. 2021; Lazić et al. 2022) but also aggregates, self-assemblies, whole cells and molecular sociology of cells at near-atomic detail (Beck and Baumeister 2016; Oikonomou and Jensen 2017; Guo et al. 2018; Shahmoradian et al. 2019; Bäuerlein and Baumeister 2021).

As EM embraces its “resolution revolution” (Kühlbrandt 2014), new challenges have emerged. Nowadays, EM experiments and subsequent data processing and analysis yield ever more data that require dedicated techniques, technologies and infrastructure suitable for data-intensive science. For example, specialised hardware for compute and optimised software for data transfer have become critical to manage this “data deluge” (Bell et al. 2009). However, the nature of this explosion of data cannot be reduced to large amounts of data. EM has shifted towards big data. The concept of “big data” has been given many definitions, but, in general, big data are associated with at least three attributes: high volume, high velocity and high variety (Assunção et al. 2015; Hilbert 2016). Alongside the resolution revolution, the big-data revolution necessitates new paradigms to process and manage data in order to optimise tasks and enable discovery. The ability to extract value from big data is key and depends on data analytics that turns data into insights (Jagadish et al. 2014; Assunção et al. 2015). This is essential, as otherwise instead of a big-data revolution rich in new opportunities, it could become a big-data flood that prevents realisation of the maximum value from EM big data. As EM underwent its big-data revolution, it expanded its user base in a very short time from a small number of experts to a bigger and more diverse community of users, making cryo-EM and cryo-ET mainstream methods for structural biology alongside X-ray crystallography and nuclear magnetic resonance spectroscopy.

Finally, as a last challenge, the big-data revolution in EM has unfolded at a time when there has been a strong drive to make scientific data findable, accessible, interoperable and reusable (FAIR) (Wilkinson et al. 2016). In essence, the FAIR principles require that research data and metadata, as well as the workflows, tools and repositories that they are associated with, foster knowledge discovery, experiment reproducibility and research impact by assisting humans and machines in their discovery, access, sharing and integration with other data and applications for processing, analysis and storage. The FAIR principles put specific emphasis on machine actionability, that is, enhancing the ability of computational systems to automatically find, access, interoperate and reuse data with no or minimal human intervention. This is especially relevant in the current explosion of data in science. In addition, and complementary to the FAIR principles, the CARE Principles for Indigenous Data Governance (collective benefit, authority to control, responsibility and ethics) have been proposed to provide guidance to data producers, users, managers and publishers on the inclusion of Indigenous Peoples in data processes that strengthen indigenous control for improved discovery, access, use, reuse and attribution in contemporary data landscapes (Carroll et al. 2020). As the FAIR and CARE principles permeate through all fields of research, addressing the challenges of big data in light of those principles gives the unique opportunity to change current practices and promote best ones amongst researchers and research facilities in the way that they handle EM data from capture through to storage, sharing and disposal.

Many fields of research – including particle physics (Britton and Lloyd 2014; Klimentov et al. 2015), astronomy (Kremer et al. 2017), chemical engineering (Chiang et al. 2017), climatology (Schnase et al. 2017), genomics (Palacio and López 2018) and synchrotron science (Wang et al. 2018) – have experienced their own big-data revolution and faced similar challenges to EM. The sheer volumes and complexity of data generated at the time of capture by electron microscopes and during data processing and analysis have posed new problems to and created opportunities for both researchers and research facilities that operate microscopes. Transferring, storing, sharing, processing and analysing data have become far from trivial in the era of big-data EM, but there are approaches available to harness the power of big data. In this review, we will outline what we understand about big data in the EM sciences, alongside the challenges and opportunities that big data present to both researchers and microscopy research facilities. This thematic paper discusses approaches available for big-data transfer, processing, analysis and management. In particular, advances in software, hardware and workflows – which, together, form the key infrastructure underpinning contemporary research facilities and electron microscopes – are outlined in the context of the expansion of the EM user base and the promotion of the FAIR and CARE principles. We will use our experiences on big data in cryo-EM as an example case. However, those insights are also directly applicable to VEM and CLEM and are approached throughout the paper. Note, a workflow in this review is understood as a series of tasks or steps in a process to accomplish an objective (Ludäscher et al. 2009).

Big data in electron microscopy

The concept of “big data” originated in the business and information-technology sectors around the 1990s to early 2010s period (Kitchin 2014). It has since gained in popularity across business circles and media, as well as in the scientific community. This has sparked many debates on its precise definition, so much so that it has become a wide-ranging term for which a unified definition across all business sectors and scientific disciplines has been a moving target. There is nevertheless a consensus that big data are characterised by three properties often dubbed “the three V’s”: volume, variety and velocity (Assunção et al. 2015; Hilbert 2016). Volume defines the amounts of data. Variety corresponds to the range of data types (such as structured and unstructured data; data formats; small and large files) and sources. Velocity refers to the speed of growth of data or the speed at which data arrive or are produced at various stages in workflows (upstream at data collection or creation, or downstream during data processing or analysis). Importantly, what data volume, variety and velocity encompass in the scientific community in general, and in microscopy in particular, may differ from traditional views in information technology (IT).

The properties of big data

While volume is often considered the key and most evident feature of big data, what constitutes large data volumes may vary with institution, discipline, technique, image-acquisition settings or intrinsic ability of infrastructure (hardware, software, workflow) to support an instrument. For example, the scalability (or lack thereof) of existing storage hardware may be an asset (or a hindrance) to deal with increasing data volumes. What separates “big and challenging” from “small and manageable” data may be consequently arbitrary. Big data are thus not defined by specific size or speed metrics but rather by the fact that they cannot be managed by traditional processes and tools due to their size, velocity or variety (Miele and Shockley 2013).

Variety in big data encompasses the diversity in data types (models and formats) and sources. This includes the possible incompatibility between data formats or format versions and the lack of interoperability of data formats and software applications, as well as the ability to maintain and support various versions of software packages or tools associated with the data. The heterogeneity of sources of data may be broadened in science beyond the common definition in IT that limits the variety of sources to elements such as texts, audio, video, web pages, social media and reports (Miele and Shockley 2013). A range of diverse factors contribute to the variety of sources in science, making datasets complex and challenging to combine and manage (Chiang et al. 2017; Richarz 2020). Examples include different techniques (e.g. microscopy, mass spectrometry and omics), inconsistent data models and formats (e.g. lists of scalar values, one- or multi-dimensional arrays), inconsistent conventions used for metadata for data annotation or description, inconsistent modalities to access data, the difficulty to integrate or collate data due to their scattering across geographical locations or storage systems (including instrument or operator log books, laboratory notebooks and electronic laboratory notebooks), and different requirements in data management (for example, regarding data governance).

Finally, there are also several dimensions in data velocity. The reception of incoming data and the creation of outgoing data at any point along a workflow occur at different rates and can be performed following four modes: (1) in batches (data points are grouped together and released for processing at regular time intervals); (2) in near-real time (at small regular time intervals); (3) in real time (continuous input, processing and output of data in a steady flow while retaining the ability to stop or adjust processing to adapt to changes such as incorrect, artefactual or biased input or output data); or (4) in stream (data flow through processing regardless of data quality). Some steps in a workflow may have different velocity requirements: while some software applications may be compatible with more than one of these four modes, other applications may run in one specific mode only. Velocity also poses challenges to transferring big data. For example, transmitting data through a workflow rapidly or data that grow rapidly may require changes in the software or network infrastructure used to transfer big data.

Besides the three V’s (i.e. volume, variety and velocity), a range of other attributes of big data have been discussed in the literature in relation to specific challenges for the management (such as data transfer, storage, sharing and archiving), analysis or visualisation of big data (Miele and Shockley 2013; Assunção et al. 2015; Gandomi and Haider 2015; Khan et al. 2019). Two of those extra properties are often cited as core properties alongside the three V’s: data veracity and data value. First, data veracity relates to the trustworthiness of data and to the degree of uncertainty and inaccuracy associated with data. High data quality underlies reliability of big data (Mehnert et al. 2019). Although there are factors that are unpredictable or difficult to predict that can alter the quality of data and as a result their level of reliability, specific procedures in data cleansing ensure that datasets can be trusted by removing or fixing incorrect, incomplete or corrupted data, as well as duplicated or improperly formatted data (Miele and Shockley 2013). While data cleansing is not specific to big data, traditional methods used so far by researchers may not scale up. Secondly, data value is often considered amongst the most important properties of big data. It corresponds to the usefulness, potential or adequacy of the data to contribute to the research project or the business (Richarz 2020). Value also takes financial considerations, such as the cost to collect, analyse, store and archive data (Khan et al. 2018; Richarz 2020). A range of diverse factors have a direct and great impact on value, including: data governance; best practices, ethical research practices, community standards and conventions such as the FAIR and CARE principles (Wilkinson et al. 2016; Carroll et al. 2020); but also what researchers, research funders and publishers may consider valuable, that is, sound, rigorous, reproducible, publishable or fundable research. In the case of data governance, policies and procedures that determine and maintain the level of availability, accessibility, quality, integrity and security of data from their creation to their disposal or archiving contribute to the valuation of data. Therefore, unlike volume, variety, velocity and veracity, the value attributed to data can change over time and across research organisations and disciplines.

The ten V’s of big data in electron microscopy

As science embraces big data, a range of disciplines have explored what big data mean to them, in terms of both challenges and opportunities, for example in medicine (Salathé 2016), chemical engineering (Chiang et al. 2017), ecology (Farley et al. 2018) and toxicology (Richarz 2020). The focus of science fields on data volume, variety, velocity and other properties has not been uniform across all fields as the relevance and the pressing nature of the challenges constituted by some of those properties are differently perceived. After reviewing the above literature, including our project findings, Table 1 lists the ten attributes of big data that we consider worth considering in EM because of their importance, namely: volume, variety, velocity, veracity, value, visibility, visualisation, vocabulary, variability and volatility. Of special note, this list is arbitrary and additional properties of big data such as validity (suitability of data to a specific model or application), viscosity (data complexity), verification (data authenticity and desired outcome) and vulnerability (data security) (Khan et al. 2019) also apply to EM but were not included. Importantly, despite the challenges presented by the ten V’s of big data in EM, they also represent multiple opportunities across a range of domains (Fig. 1): scientific discovery (data volume, velocity, value, visualisation and variability), technological development in hardware and software (data variety, velocity, veracity and visualisation), optimised use of instruments (data velocity and visualisation), enhanced research impact (data volume, value, veracity, visibility, vocabulary and volatility), implementation of best practices in data management (data variety, veracity, value, visibility, vocabulary and volatility) and the operationalisation of the FAIR and CARE principles (data value, visibility and vocabulary). None of the challenges and opportunities is indeed new. The big-data era of EM is at the confluence of technological advances in instrumentation, development in hardware, software and algorithm for data processing and analysis, and widespread promotion of best practices in data management across research to put the FAIR and CARE principles in practice. Consequently, workflows, tools and methods that traditionally worked well at smaller scales to manage and analyse EM data (until about 2015) are currently being urgently revisited to address issues related to big data while considering the evolution of the research environment and efficient use of instruments. In this review, a range of approaches to the transfer, processing, analysis and management of EM big data are outlined to demonstrate that suitable approaches are available or being developed to address the challenges posed by the ten V’s of big data in EM and exploit the opportunities created.

Table 1 The ten V’s of big data in electron microscopy with their respective challenges versus opportunities

Full size table

Big-data electron microscopy at research facilities

The advent of big data in EM has certainly impacted researchers. They are often overwhelmed by the vast amount of digital information collected in a single acquisition session. Improved highly sensitive cameras, collection of large volumes of data and high-throughput collection of numerous single-image planes are a few of examples attributed to this big-data challenge. In parallel, research facilities, institutes and centres that host big-data-producing electron microscopes have accompanied this revolution to maintain their services to their research communities. Given the high level of expertise and the associated costs for the operation and maintenance of these advanced microscopes, EM as a high-end imaging and analytical tool is mostly available at core facilities or dedicated centres at the institutional level that support research communities rather than a single research group. Facilities design or co-design workflows and practical approaches associated with microscopes (e.g. for data storage, transfer, processing and quality control), and are often involved in the development of the software and hardware infrastructure underpinning those workflows. Furthermore, they also provide expert advice on a range of commercially available software solutions (e.g. for data analysis and visualisation), promote best practices in research data management and offer training for instrument scientists and researchers at varied levels (Braet and Ratinac 2007; Alewijnse et al. 2017; Mills 2021). Facilities contribute to standardising workflows, fostering innovations and improving EM accessibility by both lowering the barrier to entry for non-expert users and optimising workflows to maximise the availability of instruments (Zimanyi et al. 2022). They also keep researchers at the forefront of technological development. Research infrastructure facilities – preferably through a centralised and coordinated approach – therefore play a critical role in the overall performance of microscopes, that is, in their reliability, efficient and optimal use and quality output.

The big-data revolution at microscopy research facilities

A recent report by Poger et al. investigated how microscopy research facilities in Australia, France, Germany, the Netherlands and the USA adapted to the big-data revolution in an environment where research data were increasingly required to follow the FAIR principles (Poger et al. 2021). Seventeen facilities in total were interviewed. All facilities were either located at universities or primarily supporting academic researchers. The microscopy techniques cited as creating or likely to create problems included scanning EM (SEM), transmission EM (TEM), scanning transmission EM (STEM), focused ion beam systems (FIB) and derived techniques such as cryogenic EM (cryo-EM), focused ion beam–scanning EM (FIB-SEM) and correlative light and EM (CLEM). Some facilities indicated that a single cryo-TEM experiment could generate in excess of 1–2 TB of data per day. The total volume of data generated at some facilities amounted to between 500 TB and 2 PB per year. The report reviewed workflows, tools, methods, procedures and the whole underlying infrastructure used for data transfer, storage and overall data management, as well as data processing. It was found that, in general, the challenges posed by the vast amount of data and the trends to adapt to them were shared across all facilities, regardless of their geographical locations, sizes and levels of specialisation. Note, some facilities operated a diverse range of instruments and supported various research fields, whereas other facilities were specialised in biological cryo-EM. In general, all facilities indicated that data processing had shifted towards high-end workstations or high-performance computing in order to deal with the high volume and velocity of data.

While the above-mentioned report mostly highlighted issues associated with big data, it demonstrated that the challenges of big data in EM could be tackled by harnessing suitable tools and adopting appropriate methods and practices to transfer, analyse and manage data. Some of them are described in this review. The report showed that researchers and facilities needed to integrate approaches developed in e-science (alternatively called e-research in Australasia and cyberinfrastructure in the USA) to ensure the sustainability of big-data EM and the promotion and adoption of the FAIR principles across facilities and researchers. Some of the findings and conclusions on data transfer and management are outlined below.

Data transfer at microscopy research facilities

Data transfer was a typical illustration of an area that had been impacted by the big-data revolution, mostly by the creation of large volumes of data at high rates (Poger et al. 2021). The report focused on the transfer of data between storage servers at facilities and various other end points: instruments, remote and local computing capabilities, end users’ computers, institutional repositories and partner institutions (Fig. 2). Data transfer often consisted of routine tasks that required human intervention for actuation or ad hoc adjustments. Sometimes, data transfer was via “sneakernet”, that is, using flash drives or emails. The tools or settings used for transfer could result in bottlenecks in data workflows that limited the capacity to process large data volumes or high-velocity data following acquisition. In particular, some facilities cited limitations in their capacity to process or pre-process data on-the-fly, i.e. in real time or near-real time as data were generated from instruments. This was described as especially important for on-the-fly monitoring of data quality as it ensured that the time dedicated to data collection using instruments at facilities led to high-quality data. Optimal microscope usage was therefore coupled to appropriate workflows to move data. Efficient data transfer was seen as necessary to automate data collection and to develop high-throughput EM. The report noted limited knowledge of, or awareness of, network performance across the majority of the facilities interviewed. This impacted the time taken to transfer data, the reliability, reproducibility and predictability of the tools chosen to transfer data, and the ability to optimise and automate workflows.

Data management at microscopy research facilities

Research data management, especially data storage (retention and disposal) and description (metadata collection), was an emerging challenge in the context of big data and compliance with the FAIR principles. In general, most facilities showed some level of understanding of data management. Best practices in research data management were considered important to ensure that the quality and the integrity of research data were maintained throughout the data lifecycle (collection, organisation, storage, preservation and sharing) and that legal, regulatory, ethical, governance and funding requirements were met. All the facilities applied some level of local data management (in particular for data storage), but general guidelines, especially on data retention and disposal, were needed given the increasing volumes of data generated by instruments. Many facilities stated that they were using, had used, or had trialled a number of tools to assist them in data and image management, including OMERO (Allan et al. 2012; Burel et al. 2015; Li et al. 2016), XNAT (Marcus et al. 2007), 4CeeD (Nguyen et al. 2017) and MyTardis (Androulakis et al. 2008; Meyer et al. 2014). While all facilities acknowledged how useful those tools could be, in particular in the standardisation of workflows and the promotion of the FAIR principles through the collection of metadata, a large number of facilities that operated a broad and diverse range of instruments indicated that finding a solution that would be suitable for all or most instruments was in fact difficult. For example, the file formats supported by OMERO was limited to those found in life sciences. Facilities often indicated that proprietary formats were converted to standard, open-source file formats, so data interoperability was in general not cited as a major hurdle for data management and FAIR data. The number of file formats used at a facility could however be challenging, especially at multidisciplinary facilities (some facilities managed over 50 formats). Most facilities commented that data-processing and data-analysis tools also led to a multiplication of file formats. Metadata collection at all the facilities concerned instrument metadata only, that is, metadata generated by the instrument alongside the data (embedded in data files or in separate files). In contrast, metadata that included general information to describe data (such as title of the dataset and names of the researchers involved) and that are essential to enrich data to a higher FAIR state were never collected. The nature of instrument metadata and the range of metadata captured and stored could vary across instruments, acquisition software and file formats within a facility and between facilities. Some facilities noted that many proprietary formats contained some form of metadata embedded (such as energies and scan time) but the conversion to standard or open-source formats (e.g. TIFF and BMP) often resulted in the loss of those metadata.

In general, whether it was for data transfer or management, an increased international awareness and call to standardise the way that data were handled at microscopy research facilities were noted, with the ultimate goal that data were accessible to all for scientific purposes and reproducibility.

Transfer of big data

Efficient and sustainable data transfer lies at the core of the operations of microscopy facilities. Data transfer is a critical component in the automation of data collection and processing workflows and overall data orchestration through to long-term storage, as well as in the development of high-throughput EM (Suloway et al. 2005; Scherer et al. 2014; Ding et al. 2015; Cheng et al. 2021). The efficiency and sustainability of data transfer rest upon three fundamental properties of the network infrastructure: reliability, predictability and speed. In the case of big data, uncertainties and shortcomings in these three properties have only become more acute. Inconsistent transfer of big data over a complex network infrastructure may reveal existing weaknesses in components of the infrastructure (hardware, software or workflows) that were either hidden or manageable before the big-data revolution. Fortunately, a range of tools are available to ensure fast, reliable and predictable data transfer. They are introduced below and listed in Table 2.

Table 2 Tools that can assist in dealing with challenges posed by big data in cryo-electron microscopy (cryo-EM) and tomography (cryo-ET)

Full size table

Achieving fast data transfer

The high volume, variety or velocity of data collected requires rapid transfer to data-processing computers and data stores that can achieve at least 10 Gb/s transfer speed. Hence, fibre-optic cables are preferred over copper-based connections because they provide more bandwidth than copper cables of the same diameter (Sader et al. 2020; Mills 2021). Traditional protocols and tools for data transfer such as SFTP, SCP, Robocopy, rclone and rsync are widely used but have limitations when it comes to achieving high-speed data throughput on fast networks (10 Gb/s and above). These tools are prone to failure or reduced efficiency in cases of high network latency, loss or congestion. Network latency is when a delay is observed to transfer data between two end points. Several factors contribute to network latency, including the distance between the two end points (that is, the longer the distance, the higher the latency), and the nature and configuration of the components of the network infrastructure (e.g. security measures). To overcome such limitations and to standardise data transfer across operating systems, higher-level tools are required. Amongst those tools is Globus (www.globus.org), a service developed for researchers and research organisations that provides fast, reliable, secure and high-assurance transfer of data (Foster 2011; Allen et al. 2012). Unlike tools such as rsync, SCP and SFTP, Globus is less sensitive to network-performance fluctuations (e.g. glitches and high latency) and checks the integrity and completeness of data transfer. It can be used to transfer data between a range of storage devices and systems, such as personal computers, data stores, compute servers (e.g. high-performance computing facilities) and cloud stores (including commercial solutions such as DropBox, Amazon Web Services, Microsoft OneDrive, Microsoft Azure, IBM Cloud, Google Cloud and Google Drive). Globus was recently tested across several university-based microscopy core facilities in Australia (van Schyndel et al. 2021). It was shown to be especially suitable to transfer large data volumes across long and short distances, which helped streamline workflows for data collection, processing and analysis.

Monitoring network performance

In a typical workflow that includes storage, processing and analysis of data, data are transferred multiple times as illustrated in Fig. 2:

from instrument to facility data storage (transfer denoted as a in Fig. 2): on-instrument storage is prioritised for speed so that data can be captured quickly during measurements. However, the data need to be transferred regularly to larger, short- or mid-term storage because on-instrument storage can fill up within hours;
from facility storage to computing capabilities (transfers b and c): data are transferred to a local, in-house computing capability at the microscopy facility (a computer, a local computing server or a virtual desktop) (b) or to an external high-performance computing (HPC) capability that may be located in the same institution as the microscopy facility or outside (c). While near-real-time or real-time data analysis may be performed to assess data quality and optimise instrument use, in-depth data processing and analysis often necessitates appropriate HPC resources;
from facility storage to researchers’ individual workstations (transfer d): researchers may choose to store a copy of their data on their computers. Data processing and analysis require high-end workstations provided datasets are not too large;
from facility storage or researchers’ workstations to long-term storage or archive (transfers e and f): it is not practical nor possible to store the growing data volumes generated by instruments on the facility storage servers or on researchers’ computers. Once data have been processed and analysed and regular access to data is no longer required; the data can be kept in an appropriate and sustainable form of long-term storage or archived, which can be local, remote or in the cloud. Note, data can be retrieved from the long-term storage or archive.

Figure 2 shows multiple instances of data transfer between end points that may be geographically close or distant. However, it hides the complexity of the underpinning global network infrastructure that crosses the boundaries of facilities, organisations and smaller network components. Behind the apparent simplicity of the task from a user’s perspective, transferring microscopy data from one point to another implies that data may transit via a series of nodes along a network path. Indeed, the schematic in Fig. 3 illustrates that computer networks at universities and other research organisations, and across cities and countries, are complex, multi-component systems sometimes consisting of subnetworks. For example, a network at a university campus may comprise subnetworks that vary in technical specifications such as the types of cables used (for example, fibre-optic cables and copper cables). Note, Fig. 3 represents an example of how end points may be interconnected. In some countries, there may be no regional network or no regional research and education network. A common feature across all countries is the presence of a national research and education network (NREN). NRENs are specialised internet service providers that support the needs of their national research and education communities through a high-speed backbone network (for example, AARNet in Australia, Internet2 and ESnet in the USA, Jisc in the UK and RENATER in France). Access of data-intensive facilities such as microscopy research facilities and data-processing and storage facilities to NREN infrastructure is vital to maintain flows of large data volumes between geographically dispersed points at high speed, which are especially required in near-real-time or real-time applications. This is permitted by the inherent nature of the backbone networks provided by NRENs which are, by design, over-dimensioned with respect to the average network usage. This large amount of white space enables NREN networks to deal with spikes of high demand while minimising loss and congestion during data transfer. This is in contrast with commercial networks that aim to keep white space low, i.e. to minimise unused network capacity.

However fast data transfer may be, a reliable and predictable network infrastructure is essential. Instabilities, discrepancies or faults in the behaviour of interconnected networks can have various potential sources. Identifying problems and recording when they happen in such an environment is a challenge for research organisations and NRENs. Network reliability and predictability are measured through network performance, which is itself a composite property consisting of qualitative and quantitative properties that collectively characterise the end user’s experience of the service supported by a network at a point in time or over a period of time (Myers and Poger 2022). Network baselining is monitoring and measuring the performance of a network over time. Benchmarking the performance of a network is comparing its performance with respect to that of another network, an industry standard or any other external reference. Benchmarking determines whether a network behaves normally (within a range of acceptable parameter values) or as expected. Network baselining and benchmarking are essential to monitor and evaluate the reliability and predictability of the performance of a network. They can be used by NRENs, IT specialists at universities and research organisations, and research facilities and researchers to set expectations regarding the capacity for a network to transfer data. Importantly, network baselining is an iterative process: whenever a component of a network infrastructure changes (e.g. hardware, software, software version in any part of the network from end to end), it is necessary to repeat network performance baselining to compare with the previous performance baseline and ensure that the quality of the service supported by the network is maintained. A range of tools are available to observe, record, understand and predict network performance between end points (e.g. instruments, data stores and compute facilities) within an organisation or between different organisations. Those tools include free programs such as perfSONAR (www.perfsonar.net) and RIPE Atlas (atlas.ripe.net), and commercial solutions such as SolarWinds (www.solarwinds.com). The perfSONAR tool, for example, is a dedicated open-source, modular toolkit for network performance monitoring to support research and education (Hanemann et al. 2005). It contains a suite of tools and services (e.g. ping, iperf3, traceroute) that measure the capacity and the quality of a network as well as the consistency of network behaviour in real time across an entire end-to-end network path. Measurements can be recorded over time for retrospective analysis. It detects and diagnoses issues or anomalies, and facilitates the collection and sharing of network performance information. This makes perfSONAR very well suited to network baselining and benchmarking. It allows one to predict the performance of a network and determines if a network has the capacity to meet users’ expectations regarding the reliability for data transfer. This is of great importance for large volumes of high-velocity data. perfSONAR is therefore useful to end users and microscopy facilities that rely heavily upon well-performing networks for big-data transfer within and between institutions (Myers and Poger 2022).

Processing and analysis of big data

A single EM experiment can now amount to terabytes of data. Two major factors have contributed to this. First, advances in direct electron detection have led to an increase in the amount of data collected in the form of high-resolution movies of up to 100 frames per imaging area (Baldwin et al. 2018). Secondly, approaches to automate pipelines for data collection and processing have been developed, for example with Leginon (Suloway et al. 2005; Cheng et al. 2021), SerialEM (Mastronarde 2005), Appion (Lander et al. 2009), Scipion (de la Rosa-Trevín et al. 2016), Focus (Biyani et al. 2017), CryoFLARE (Schenk et al. 2020) and the Caltech Tomography Database and automatic image processing pipeline (Ding et al. 2015). Such progress in how EM data are processed and analysed has been underpinned by changes in techniques, technologies and infrastructure, in particular in the area of artificial intelligence (AI), computing and workflow automation.

EM big data and artificial intelligence

AI using machine- or deep-learning methods is increasingly being used in data processing and analysis to extract meaningful information from large datasets. Examples include improving the resolution and sensitivity of electron microscopes (Zhou et al. 2021), reducing noise in datasets (Bepler et al. 2020) and facilitating and automating pattern recognition such as particle selection (Voss et al. 2009; Wang et al. 2016; Sanchez-Garcia et al. 2021), macromolecule identification (Moebel et al. 2021; Uddin et al. 2021; Che et al. 2018) and cell counting and classification (Liu et al. 2019; Zaritsky et al. 2021) (Table 2). Of particular interest is the European-Union-funded project AI4LIFE (https://ai4life.eurobioimaging.eu) that started in 2022 and aims to develop and make readily accessible methods based on AI for bioimage analysis, in particular microscopy image analysis. Note, it is commonly believed that algorithms can be trained better with more data and consequently provide more accurate results. However, this is not necessarily true nor possible with big data. In particular, several traditional machine-learning algorithms were designed for smaller datasets assuming that entire datasets could be stored in memory or that they were available for processing at the time of training, which is often impossible with big data (L’Heureux et al. 2017).

Computing EM big data

Advances in high-performance computing (HPC) using both central processing units (CPUs) and graphics processing units (GPUs) concomitant with the development of better algorithms, particularly those amenable to parallelisation have been critical to harness big data (Baldwin et al. 2018). They have been especially exploited in a range of tools that have helped researchers leverage the big-data revolution in cryogenic electron microscopy and tomography. Such applications include RELION (Scheres 2012; Kimanius et al. 2021), cryoSPARC (Punjani et al. 2017), cisTEM (Grant et al. 2018), EMAN2 (Chen et al. 2019) and emClarity (Himes and Zhang 2018). Some of these are continually updated to facilitate and accelerate data processing and analysis. For example, RELION, which is widely used in cryo-EM and cryo-ET, has been regularly updated to integrate new algorithms and optimisations to improve efficiency and expand its range of functionalities (Scheres 2012; Bharat et al. 2015; Scheres 2015; Scheres 2016; He and Scheres 2017; Zivanov et al. 2020). In addition, a pipeline approach of RELION based on standardised and (semi-)automated procedures for structure determination has been developed to allow on-the-fly processing of cryo-EM data (Fernandez-Leiro and Scheres 2017; Kimanius et al. 2021). Finally, RELION supports GPU acceleration (Kimanius et al. 2016) and CPU vector acceleration (Zivanov et al. 2018) to reduce the computational load. Although programs such as RELION, cryoSPARC and cisTEM tend to scale well to larger resources thanks to CPU or GPU parallelisation, they still present high computational costs for researchers, especially for large datasets which require high-end workstations or computer clusters. Therefore, the bottleneck in big-data processing and analysis does not generally lie in the unsuitability or limited scalability of existing software to deal with big data, but in the underlying compute resources and workflows.

Large computational resources (high-end workstations or computer clusters such as high-performance computers) may be too expensive, inaccessible or unavailable to many laboratories and researchers. In addition, HPC capabilities need to be managed and regularly expanded to meet growing compute and storage demands. Cloud computing using commercial solutions such as Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform, have been suggested as viable, flexible and cost-effective alternatives to traditional, academic computing environments (Cianfrocco and Leschziner 2015; Castaño-Díez 2017; Cuenca-Alba et al. 2017). The limited command-line literacy of new EM users or novice researchers can be another major barrier to big-data EM because they require training to use a Linux/Unix environment and deal with a HPC environment for at least job submission and management. To address this obstacle, user-friendly alternatives with graphical user interfaces (GUIs) as well as integrated and standardised workflows have been developed. GUI-based solutions have gained in popularity as they enable any user to exploit the power of a HPC infrastructure (supercomputers, cloud) for data processing and analysis in a simple and flexible way through a web-based virtual desktop that can be accessed from anywhere. They are called various names: virtual desktop infrastructure, virtual computing environments, science gateways (mainly in the USA), virtual research environments (mainly in Europe), virtual laboratories (mainly in Australia), “collaboratory” (Wulf 1993) or, more generically, research platforms, portals or workbenches amongst other denominations. Herein, we refer to these collectively as virtual research environments (VREs). In general, VREs provide convenient and secure access to CPU or GPU computing at no cost or at a limited cost (such as a pay-as-you-go basis), thereby avoiding investments in hardware and long-term maintenance for research laboratories. For research organisations, they are cost-effective as their flexibility and versatility allow them to serve diverse user communities. Examples of VREs for EM data processing include ScipionCloud (Cuenca-Alba et al. 2017), COSMIC² in the USA (Cianfrocco et al. 2017) and, in Australia, the Electron Microscopy Data-Processing Portal (van Schyndel 2022), the Characterisation Virtual Laboratory (imagingtools.au/characterisation-virtual-laboratory), the Virtual Desktop Service (desktop.rc.nectar.org.au) and the Australian Research Environment (nci.org.au/our-services/data-services) (Table 2). While some VREs such as ScipionCloud are agnostic in terms of the underpinning computing infrastructure, others have been developed to be deployed on commercial, institutional or national infrastructure, which may restrict their accessibility and the general awareness of their existence or availability in the research community. For example, the Electron Microscopy Data-Processing Portal and the Australian Research Environment use national research resources funded by the Australian government (the ARDC Nectar Research Cloud and the National Computational Infrastructure, respectively). This means that they are available to the Australian research community only. Regardless of their technical details, successful and impactful VREs, however, require adequate ongoing support (e.g. funding and skilled professionals) to ensure long-term sustainability, significant community adoption, persistent online presence, technological relevance and compatibility with advances in technologies and standards in computing, data management and cybersecurity (Calyam et al. 2021). Importantly, the high degree of user-friendliness of VREs should not be at the expense of the reproducibility of the results and repeatability of the tasks that they facilitate. While VREs can lower the entry barrier to advanced CPU- or GPU-based data processing and analysis methods, critical information on the programs or automated pipelines available in VREs should be easily accessible to users. For example, the versions of the programs used, which options were turned on or off during software installation, the nature of the intermediate steps executed between the initial (input) and final (output) data, the treatment of outliers and missing values and the full set of model parameters used during processing or analysis are crucial factors that determine the reproducibility and repeatability of an experiment (Baker 2016; Stodden et al. 2016; Taubert and Bucker 2017). Funding bodies and research-infrastructure organisations have encouraged and enabled the development and deployment of VREs in multiple disciplines. To ensure long-term viability, there is clearly a need for consolidation, interoperability and a better coordinated approach in the development and deployment of VREs, including in the establishment of common policies, governance strategies and best practices (for example, on underlying architectures, interfaces, data access and community building). At the international level, the Research Data Alliance has thus established the VRE Interest Group (rd-alliance.org/groups/vre-ig.html) to explore those challenges and to provide expert ongoing guidance in this ever-changing EM computing world.

Alongside VREs are workflows that integrate cloud computing seamlessly for users. For example, cryoem-cloud-tools moves cryo-EM analysis routines and atomic-model-building jobs in the RELION processing pipeline to Amazon Web Services and synchronises data in real time between the cloud and the user’s computer (Cianfrocco et al. 2018). In the area of data analysis, the platform ZeroCostDL4Mic is an entry-level tool that simplifies the use of deep learning in microscopy image analysis. It integrates cloud-based virtual machines provided by Google Colab (von Chamier et al. 2021) (Table 2).

Workflow optimisation

Optimisation of workflows is critical and can have a significant effect in the ability to process big data. Importantly, optimising some tasks in a workflow can be readily achievable (Silver 2022). For example, default and legacy settings in computers and network tools as well as legacy or traditional tools or scripts used for data movement may be inadequate for big data. These tools, scripts and settings may have been used for many years (sometimes decades) at a microscopy facility and are therefore well established and familiar to facility support staff and their research users. Moving to newer and more modern approaches may offer more advanced settings, be better suited to recent standards and infrastructure, and have a more user-friendly GUI (instead of command lines). Facilities thus require resources to investigate workflow improvements, implement the changes and provide the training and support to their research communities. This underscores how critical collaborations between microscopy research facilities and IT providers or e-science specialists are to ensure that research facilities and their support staff can exploit technological advances. Importantly, the adaptation of workflows to big data should not be reduced to investment in new hardware.

Data transfer from the instrument computer to temporary storage or from temporary storage to computing capabilities (transfers a, b and c in Fig. 2) can be a bottleneck for big data because of the high volumes, velocity or variety of the data collected. Various tools such as SCP, SFTP, rsync and rclone have been commonly used combined with task schedulers (for example, cron, Windows Task Scheduler and systemd) in (semi-)automated data-movement workflows. However, SCP, SFTP and rsync have intrinsic limitations that hinder the high throughput required by big-data EM. Briefly, rsync does not support parallel data-transfer streams and, as a dial-up-age tool, performs poorly on today’s multi-gigagbit-per-second connections. Despite their higher transfer speed due to concurrent transfer streams, SFTP and SCP have throughput limitations because of their encryption algorithms. In contrast, rclone is a more modern program that can sustain elevated transfer rates over 1 Gb/s. However, it is important to note that, ultimately, transfer rates are by nature limited by the read-and-write speed of disks (referred to as I/O speed) and network bandwidths. A key feature of rclone is that it supports many common protocols and application programming interfaces (APIs) to transfers to and from a range of locations, including cloud storage. The underlying model for data transfer is another important factor: data can be taken out of the instrument computer by the storage server (pulling) or put into storage by the instrument computer (pushing). The CPU load is on the instrument computer in the latter case, whereas it is on the storage server in the former. Switching from a push to a pull may lead to a great increase in data-transfer rate and reduced time to transfer data. It frees up computing power in the instrument computer that can be solely dedicated to capture images and perform other tasks on images, such as the conversion to other formats. For example, converting MRC images to a TIFF format is CPU-intensive. In an optimised big-data workflow, using the TIFF format with lossless compression (LZW and ZIP methods) is especially advantageous as it is a lossless image compression that provides storage-space saving by a factor of five on average (Eng et al. 2019). This also translates into faster transfer times as datasets are smaller. Overall, such apparently small changes to optimise an EM workflow and harness big data can produce dramatic changes, such as enabling on-the-fly data processing for real-time or near-real-time data-quality assessment and optimised instrument usage (Silver 2022).

Management of big data

Data management is the cardinal activity that underpins the value of EM big data. Amongst the ten V’s of EM big data (see Section “Big data in electron microscopy” and Table 1), value is closely associated to veracity, visibility, vocabulary and volatility, which all depend on the adoption of best practices in data management by research facilities and researchers. The management of EM big data faces three concurrent challenges: (1) the need for community-wide standards and conventions for data description, annotation, storage and sharing; (2) the capacity to handle the data deluge; and (3) the operationalisation of the FAIR and CARE principles. User training and community uptake of standards, conventions and best practices are paramount to ensure that EM big data are managed in a way that is appropriate and sustainable. In particular, managing big data may require a change in practices and an understanding of the overall workflows to manage the big-data deluge efficiently (Sader et al. 2020; Alewijnse et al. 2017). Online learning environments such as MyScope developed by Microscopy Australia (myscope.training) and its module on research data management facilitate user training and the early adoption of best practices to support the FAIR principles.

Establishing standards and conventions

Big data have created new opportunities to the EM community. Data variety and data vocabulary are key attributes in the establishment of standards and convention for EM data. The various proprietary and non-proprietary data formats created at the time of data capture, processing and analysis often lead to the creation of files that contain non-standardised metadata, incomplete metadata or no metadata at all for data annotation or description. Combined with the lack of workflows that allow for the systematic and consistent collection of standardised generic metadata, there is a significant risk that the value of EM big data may be low. In addition, multidisciplinary microscopy research facilities can manage over 50 different file formats, which is not sustainable in the long term (Poger et al. 2021). It is therefore essential to develop, promote and adopt rigorous international standards for annotating, describing and formatting microscope image data beyond legitimate reasons that software developers may have for using proprietary data formats or non-conventional metadata. However, it is not sufficient to define standards and conventions. For those to be adopted, it is pivotal that the science community be provided with the necessary software libraries to allow lossless data conversion to and from the convention or standard (Patwardhan et al. 2012). Bio-Formats is an excellent example of such a library that aims to promote the open standard called the OME (Open Microscopy Environment) data model (Goldberg et al. 2005; Linkert et al. 2010).

Standards facilitate reproducibility of experiments and data reuse, encourage research transparency and data sharing, and contribute to the development of interoperable ecosystems of tools for processing, analysis and visualisation of data. Repositories are powerful tools to standardise data and metadata and allow seamless data sharing. The EM field has been a forerunner in setting up public microscopy repositories for EM data. Good examples are the Electron Microscopy Data Bank for 3D-structure data (EMDB or EMDataBank) (Lawson et al. 2015) and the Electron Microscopy Public Image Archive (EMPIAR), a public archive for raw 2D image data that also supports 3D-structure data deposited in EMDB (Iudin et al. 2016). Noteworthily, the Protein Data Bank (PDB) stores atomic models constructed using EMDB data (Berman et al. 2000, 2003). More broadly, and for completeness, the repositories BioImage Archive for published image data (Hartley et al. 2022) and Image Data Resource (IDR) for reference data with added value (Williams et al. 2017) are available to all data irrespective of the imaging technique utilised.

There are many initiatives currently that guide the field of light microscopy towards standardisation that may benefit EM. For example, one of the goals of the AI4LIFE project is to develop standards by creating harmonised and interoperable tools and methods, in particular in the submission, storage and FAIR access of reference data, reference annotations and AI methods (https://ai4life.eurobioimaging.eu/about-us/#objective). Recently, the Recommended Metadata for Biological Images (REMBI) were proposed as metadata guidelines for light and electron microscopy (Sarkans et al. 2021). Other notable efforts include the development of guidelines for Minimum Information about Highly Multiplexed Tissue Imaging (MITI) for data and metadata in genomics and microscopy of tissue images (Schapiro et al. 2022), alongside the establishment of the 3D Microscopy Metadata Standards (3D-MMS) by the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative and the wider neuroscience research community (Ropelewski et al. 2022). Similarly, the Brain Imaging Data Structure (BIDS), a standard initially developed for neuroimaging data and metadata for magnetic resonance imaging (Gorgolewski et al. 2016), has been extended to microscopy (Microscopy-BIDS) to support common imaging methods, including optical and electron microscopy with the aim to harmonise metadata definitions for hardware, image acquisition and sample properties in multi-modal, multi-scale imaging (Bourget et al. 2022). The initiative QUality Assessment and REProducibility for instruments and images in Light Microscopy (QUAREP-LiMi) plans to improve data quality and experiments reproducibility through the development of common standards, guidelines, metadata models and tools (Boehm et al. 2021; Nelson et al. 2021). Amongst the achievements of QUAREP-LiMi are the 4DN-BINA-OME (NBO) Microscopy Metadata specifications framework (Hammer et al. 2021) and three interoperable metadata collection tools, namely Micro-Meta App (Rigano et al. 2021), MethodsJ2 (Ryan et al. 2021) and MDEmic (Kunis et al. 2021). Importantly, many of these initiatives build on existing standards and tools to maximise sustainability, interoperability and adoption by the community. Specifically, the tools MethodsJ2 is an ImageJ/Fiji plugin (Schneider et al. 2012; Schindelin et al. 2012; Rueden et al. 2017); MDEmic is fully compatible with Bio-Formats and the OME data model, and is part of the standard installation package of the image database OMERO (under the name OMERO.mde). MethodsJ2, MDEmic and Micro-Meta App also interoperate with each other so they can create a rich environment conducive to metadata collection. Regarding standards, 4DN-BINA-OME and 3D-MMS are extensions of existing metadata standards (the OME data model in both cases, as well as the generic DataCite metadata schema for 3D-MMS).

Global community-driven partnerships such as the pan-European consortium Euro-BioImaging (eurobioimaging.eu), BioImaging North America (bioimagingnorthamerica.org) and Global BioImaging (globalbioimaging.org) in conjunction with national networks play an important role in the cooperative development and dissemination of best practices, standards and conventions for formats, repositories, annotation, description, processing, visualisation and analysis of image data beyond geographical boundaries, techniques and disciplines (Swedlow et al. 2021). In Australia, Microscopy Australia promotes and coordinates the adoption of guidelines for the collection of metadata based on the DataCite schema (schema.datacite.org), across material sciences and life sciences and for all microscopy modalities. In particular, the guidelines being developed for metadata collection require a set of minimum metadata properties being collected and described using consistent information.

Managing the big-data deluge

In principle, the management of EM big data is not fundamentally different from that of “small data”. However, specific attributes of EM big data may reveal existing flaws or issues in infrastructure (hardware, software or workflow) by stretching it to its limits. In particular, data storage, manual operations and methods to read and write smaller files or datasets that have so far been suitable may not be compatible nor practical with large datasets.

An important consequence of the data deluge is the long-term storage, archiving and preservation of microscopy data (and their accompanying metadata if they are stored in separate files). This all is associated with data volatility (Table 1). Storing data is the first step in data management. Big data require highly scalable storage systems at a reasonable cost. As the volumes of data generated by single experiments can amount to terabytes in EM nowadays, it is essential that scientists and research organisations assess how important each dataset is and more importantly, understand how long each dataset should be kept for. Research organisations are expected by law or funders’ requirement to retain categories of research data for specific retention periods before disposing of them. However, datasets often exist in multiple copies across different storage systems or under different user accounts within an organisation that are never or rarely destroyed. This is not a sustainable practice for evident reasons. In addition, there must be minimum information associated with the stored data for them to keep their value over their lifetimes. That implies organisational or, preferably, community-endorsed universal standards and conventions to establish minimum information requirements. This includes the types of metadata to collect, the data formats to keep and specific directives on minimal storage time. Finally, provisions should be taken for long-term storage and backup of original data files. For the latter, chief investigators have a central role to play to foster best practices and ensure research data integrity.

The use of AI to assist in particle picking in cryo-EM and macromolecule picking in cryo-ET is an efficient and time-effective avenue to automate complex, laborious, tedious and time-consuming tasks when completed manually by an expert. When trained properly, deep-learning methods can provide fast and reliable results on large datasets (Wang et al. 2016; Moebel et al. 2021). Furthermore, automation promotes consistent and standardised image annotation using, for example, controlled terms, which in turn can contribute to the adoption of common machine-readable metadata standards and facilitate interoperability between the various programs used in processing and analysis pipelines that use annotation metadata.

Beyond the standardisation of data formats, the deluge of massive volumes of data creates bottlenecks that cannot be solved by new hardware, workflow optimisation or automation of tasks. For files greater than 10 GB in size, the repeated access to large datasets in proprietary file formats for on-the-fly translation or permanent conversion into open formats such as OME-TIFF and HDF5 can come at such a great computational cost or take so much time that it precludes the use of data- and resource-hungry applications such as training of artificial intelligence models and visualisation in public repositories (Moore et al. 2021a). This is because the representation of data in traditional image formats such as TIFF requires that the whole image data be loaded into computer memory when the file is opened or displayed on a screen. This becomes impractical or impossible when the size of an image exceeds that of a computer memory. Traditional image formats are consequently ill-suited for repeated, frequent access to data. In contrast, pyramidal images use a multi-resolution representation of data that enables zoomable visualisation and selectable levels of resolution for interactive navigation and scalable image analysis (Moore et al. 2021a, 2021b). The image is composed of a pyramid of images in which higher-level images in the pyramid are smaller and at lower resolution and lower-level images are larger and at higher resolution. Each layer of the pyramid is thus composed of small images, thereby allowing one to load only the pyramid level that is necessary to display or open for analysis. The next-generation file format (NGFF) has been developed by the OME team as a solution to the limitations of traditional image formats in microscopy (Moore et al. 2021a, 2021b). This file format is a multi-dimensional, multi-resolution, high-content pyramidal-image format that contains pixel data and metadata (such as annotations by a machine-learning tool). It is encouraged for data sharing and reuse, in particular for public data repositories and collaborative data resources. Importantly, NGFF is complementary to and does not supplant other open or proprietary formats because each image format has its own specification. For example, some image formats have optimised writing performance that is well suited to fast data capture whereas a machine-learning technology may depend on high-dimensional, high-content scalability to allow rich annotation of datasets (Moore et al. 2021a, 2021b).

Operationalising the FAIR and CARE principles

That the big-data revolution in EM happens concomitantly with the growing importance of the FAIR and CARE principles (Wilkinson et al. 2016; Carroll et al. 2020) across the research sector offers the unique opportunity to tackle the consequences and challenges of the big-data deluge in a way that is sustainable and that maximises the output and impact of EM science.

The CARE Principles for Indigenous Data Governance are complementary to the FAIR principles. Whereas the FAIR principles propose a data-centric approach to data management, the CARE principles are people and purpose oriented (Carroll et al. 2021). Indigenous data are data, information and knowledge – in any format – that impact or concern Indigenous Peoples, nations and communities at the collective or individual levels. They encompass data about their resources and environments (Kukutai and Taylor 2016; Nickerson 2017). In the context of EM big data, it is important to be aware that it is possible that indigenous data may be buried in larger datasets, or that samples or additional data used in EM experiments or EM research projects may be associated with indigenous data. They may be hard to find, mislabelled and controlled by others in a manner inconsistent with the FAIR and CARE principles (Carroll et al. 2021). As a result, data could be subject to both CARE and FAIR. Given the tension between protecting Indigenous rights and interests in data while promoting FAIR data in research, the implementation of CARE should be considered as a required extra dimension of FAIR to ensure that the general use of data aligns with Indigenous rights. Importantly, researchers and facilities will find it easier to apply the CARE principles to data that are managed and properly documented through compliance with FAIR. Therefore, compliance with or promotion of CARE data is implied in the rest of the review when referring to FAIR data.

Putting FAIR and CARE into practice is associated with mainly three attributes of EM big data: vocabulary, visibility and value (Table 1). Advances in addressing or enhancing those three attributes contribute to enriching data to a higher FAIR or CARE state. This can have flow-on effects on the development and adoption of best practices in the management of EM big data overall, in particular in how big-data variety, veracity and volatility are dealt with.

The first step in reusing data is to find them. As shown earlier, data vocabulary through community-endorsed standards for data models together with data formats and metadata for the annotation and description of data is critical and integral to all aspects of FAIR (Wilkinson et al. 2016). In addition to accelerating the implementation and adoption of new standards, digital data repositories such as PDB and EMDB enhance data visibility. They provide stable, transparent, long-term storage and have become essential to the management and sharing of data (Habermann 2020). They increase the findability, accessibility, discoverability, sharing and reusability of data. They facilitate collaboration by creating an environment in which requesting and transferring data is easier. Researchers are encouraged to make their data available using repositories because data shared in repositories are more often cited than data shared by other means (i.e. data available on request, data contained within a publication and supplementary materials) (Colavizza et al. 2020). Importantly, FAIR metadata and data should be visible to, and easy to find for, both humans and computers. Thus, machine-readable metadata are essential for automatic and seamless discovery of data. The adoption of machine-actionable, community-endorsed, persistent identifiers (PIDs) plays an important role in findability, verification, replicability and reusability of data (Starr et al. 2015). PIDs are long-lasting, globally unique, digital references to objects, people or organisations. Digital Object Identifier (DOI) for resources (such as publications), persistent uniform resource locator (PURL) for web resources, ORCID ID for researchers and the Research Organization Registry (ROR) ID for research organisations (such as universities, centres and institutes) are well-established PIDs across all research communities. Additional PIDs are being developed so their levels of awareness and adoption can vary across communities. Initiatives that may be especially relevant to the biological EM community include the Research Resource Identifier (RRIDs) for antibodies, model organisms and tools such as software and databases (rrids.org) (Bandrowski et al. 2015), the International Generic Sample Number (IGSN) for physical samples (igsn.org), the Research Activity ID for research projects (raid.org.au) and PIDs for instrument. The latter follows recommendations by the Research Data Alliance Persistent Identification of Instruments Working Group (PIDINST) (Stocker et al. 2020). The work by the PIDINST group and the mapping of the schema developed by it onto the DataCite metadata schema (schema.datacite.org) emphasise the importance of instruments and associated metadata in the assessment of data quality and reuse. In Australia, Microscopy Australia is working towards a community-endorsed definition of instrument and guidelines for instrument PIDs. This is with the aim to promote and implement PIDs for all microscopes across Microscopy Australia’s research infrastructure facilities network. Instrument PIDs have many potential benefits such as the facilitation of asset management for a facility and the unambiguous reference to digital representations of instruments, including the generation of metrics that quantify the use of instruments and the rationale for future funding. Moreover, the linking of data to specific instruments enables citation and impact tracking. The benefits of such a rich ecosystem of PIDs can be further amplified by connecting them via their metadata in a PID graph (Cousijn et al. 2021). A PID graph establishes relationships between different entities within the research landscape (for example, objects, organisations, people, funders, instruments). This enables all stakeholders in research to access new information. For example, microscopy facilities can acquire a snapshot of the impact of the research that they have supported (such as number of publications, publication citations, cross-disciplinary impact and whether research outputs are reused). Overall, adopting PIDs and integrating them into platforms and information systems used in research organisations (e.g. research management, finances, human resources) facilitates information exchange between those systems across and between organisations. Critically, it eliminates the need to rekey information manually multiple times into multiple systems, for example about a grant, a publication or a person (e.g. publications, employment history), which leads to cost savings for research organisations (Brown et al. 2021, 2022).

Metadata and PIDs are fundamental in the operationalisation of the FAIR and CARE principles, thereby increasing the value of data. They increase trust in data and research reproducibility because they contain a range of information on the provenance of data and the tools, resources and methods associated with their creation before the experiment, during the experiment at the time of data capture and after the experiment during processing and analysis of data. Without thorough implementation of FAIR, the value of data peaks at publication and then falls off over time. Information entropy is the natural tendency for decline in data, information and understanding that occurs after data are used and results are published (Habermann 2020). Given the investment in big-data-producing electron microscopes, this is not a desired outcome. High-quality metadata that support understanding and reuse of data are a critical antidote to information entropy. A research data management plan that describes extensively how data will be annotated and described using metadata and PIDs and how this information will be obtained maximises the chance of high-value FAIR data (Michener 2015). Workflows that collect metadata reliably, consistently, systematically and automatically play an essential role in adding value to EM data. This requires scalable and adaptable software that integrates information from various sources into metadata, including electronic notebooks and instrument schedulers or booking systems at facilities. Besides widely used open-source applications for image data management such as OMERO (Allan et al. 2012; Burel et al. 2015; Li et al. 2016), the recently developed tools NexusLIMS (Taillon et al. 2021) and Pitschi (Nguyen 2022) are examples of data-workflow engines that assist in the capture and management of research data from electron microscopes as well as metadata from various sources (Table 2). Interestingly, both tools have been developed by microscopy research facilities with a focus on automated metadata harvest combined with an intuitive web-based GUI for searching, browsing and examining research data. In particular, Pitschi is an end-to-end data-management solution based on the Clowder framework (Marini et al. 2018) that supports the entire research data lifecycle by storing, indexing and annotating data generated at the facility from the capture of raw data at the instruments (Nguyen 2022). Pitschi is fully integrated with the data storage infrastructure at The University of Queensland (Brisbane, Australia) where it has been developed. Importantly, Pitschi adheres to and fosters the FAIR principles. Metadata of supported file types are extracted automatically during data ingestion. There is also the option to enrich metadata. For example, information such as users’ ORCID IDs and instrument PIDs can be collected into metadata from a range of sources such as the instrument-booking system of the facility.

Finally, it is important to note that operationalising FAIR and CARE implies paying attention that platforms such as VREs used for data processing, analysis and visualisation enable FAIR data and enrich data that they create to a richer FAIR state (or, at least, not make them less FAIR than input data). Despite their advantages, some platforms, in particular cloud ones, do not automatically or natively support FAIR. Cloud platforms tend to encourage self-sufficient environments including data repositories at the expense of interoperable services within and outside the cloud (Sheffield et al. 2022). The Research Data Alliance FAIR for VREs Working Group (rd-alliance.org/groups/fair-virtual-research-environments) is developing guidelines to ensure that VREs enable FAIR data in coordination with existing communities working with VREs and VRE developers.

Conclusions and future perspectives

The big-data revolution undergone by electron microscopy (EM) presents the opportunity to maximise the value of the investment in EM and EM data themselves by implementing approaches to transfer, compute and manage big data in ways that are faster, more accessible, more reliable and more sustainable. While the ten V’s of EM big data (volume, variety, velocity, veracity, value, visibility, visualisation, vocabulary, variability and volatility) have created challenges for researchers and microscopy research facilities, each challenge is a chance for optimised workflows, greater research impact, richer metadata or widely adopted best practices. This review highlights an overall need for more or better engagement and coordination across the EM community in two areas: first, in the sharing of experiences on how to adapt and optimise EM-underpinning infrastructure to big-data transfer, processing and analysis; and secondly, in the establishment and fostering of guidelines, standards and conventions for the development of VREs, the unification of data formats (or rationalisation in their numbers) and the collection of metadata for data description and annotation. Both aspects are especially important as the integration of omics and EM into single, multi-modal characterisation approaches will lead to even larger and more diverse datasets generated in high throughput (McCafferty et al. 2020; Kuhn Cuellar et al. 2022; Watson et al. 2022). Finally, the strong drive to make scientific data FAIR and CARE is a golden opportunity for the microscopy community because it emphasises that standards for data and metadata cannot be defined by individual laboratories, research groups or microscope manufacturers for reasons that suit their own needs and interests. Instead, they should arise from extensive, community-wide consultations to ensure rapid adoption that will serve the science community for the foreseeable future to come.

Data availability

Data sharing is not applicable to this review as no dataset was generated or analysed during the current study.

References

Alewijnse B, Ashton AW, Chambers MG, Chen S, Cheng A, Ebrahim M, Eng ET, Hagen WJH, Koster AJ, López CS, Lukoyanova N, Ortega J, Renault L, Reyntjens S, Rice WJ, Scapin G, Schrijver R, Siebert A, Stagg SM, Grum-Tokars V, Wright ER, Wu S, Yu Z, Zhou ZH, Carragher B, Potter CS (2017) Best practices for managing large CryoEM facilities. J Struct Biol 199(3):225–236. https://doi.org/10.1016/j.jsb.2017.07.011
Article PubMed PubMed Central Google Scholar
Allan C, Burel J-M, Moore J, Blackburn C, Linkert M, Loynton S, MacDonald D, Moore WJ, Neves C, Patterson A, Porter M, Tarkowska A, Loranger B, Avondo J, Lagerstedt I, Lianas L, Leo S, Hands K, Hay RT, Patwardhan A, Best C, Kleywegt GJ, Zanetti G, Swedlow JR (2012) OMERO: flexible, model-driven data management for experimental biology. Nat Methods 9(3):245–253. https://doi.org/10.1038/nmeth.1896
Article CAS PubMed PubMed Central Google Scholar
Allen B, Bresnahan J, Childers L, Foster I, Kandaswamy G, Kettimuthu R, Kordas J, Link M, Martin S, Pickett K, Tuecke S (2012) Software as a service for data scientists. Commun ACM 55(2):81–88. https://doi.org/10.1145/2076450.2076468
Article Google Scholar
Ando T, Bhamidimarri SP, Brending N, Colin-York H, Collinson L, De Jonge N, de Pablo PJ, Debroye E, Eggeling C, Franck C, Fritzsche M, Gerritsen H, Giepmans BNG, Grunewald K, Hofkens J, Hoogenboom JP, Janssen KPF, Kaufman R, Klumpermann J, Kurniawan N, Kusch J, Liv N, Parekh V, Peckys DB, Rehfeldt F, Reutens DC, Roeffaers MBJ, Salditt T, Schaap IAT, Schwarz US, Verkade P, Vogel MW, Wagner R, Winterhalter M, Yuan H, Zifarelli G (2018) The 2018 correlative microscopy techniques roadmap. J Phys D Appl Phys 51(44):443001. https://doi.org/10.1088/1361-6463/aad055
Article CAS PubMed PubMed Central Google Scholar
Androulakis S, Schmidberger J, Bate MA, DeGori R, Beitz A, Keong C, Cameron B, McGowan S, Porter CJ, Harrison A, Hunter J, Martin JL, Kobe B, Dobson RCJ, Parker MW, Whisstock JC, Gray J, Treloar A, Groenewegen D, Dickson N, Buckle AM (2008) Federated repositories of X-ray diffraction images. Acta Crystallogr D 64(7):810–814. https://doi.org/10.1107/S0907444908015540
Article CAS Google Scholar
Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R (2015) Big data computing and clouds: trends and future directions. J Parallel Distrib Comput 79–80:3–15. https://doi.org/10.1016/j.jpdc.2014.08.003
Article Google Scholar
Bai XC, McMullan G, Scheres SH (2015) How cryo-EM is revolutionizing structural biology. Trends Biochem Sci 40(1):49–57. https://doi.org/10.1016/j.tibs.2014.10.005
Article CAS PubMed Google Scholar
Bai X-c, Fernandez IS, McMullan G, Scheres SHW (2013) Ribosome structures to near-atomic resolution from thirty thousand cryo-EM particles. Elife 2:e00461. https://doi.org/10.7554/eLife.00461
Baker M (2016) 1,500 scientists lift the lid on reproducibility. Nature 533(7604):452–454. https://doi.org/10.1038/533452a
Article CAS PubMed Google Scholar
Baldwin PR, Tan YZ, Eng ET, Rice WJ, Noble AJ, Negro CJ, Cianfrocco MA, Potter CS, Carragher B (2018) Big data in cryoEM: automated collection, processing and accessibility of EM data. Curr Opin Microbiol 43:1–8. https://doi.org/10.1016/j.mib.2017.10.005
Article CAS PubMed Google Scholar
Bandrowski A, Brush M, Grethe J, Haendel M, Kennedy D, Hill S, Hof P, Martone M, Pols M, Tan S, Washington N, Zudilova-Seinstra E, Vasilevsky N, null n (2015) The Resource Identification Initiative: a cultural shift in publishing [version 2; peer review: 2 approved]. F1000Res 4 (134). https://doi.org/10.12688/f1000research.6555.2
Bäuerlein FJB, Baumeister W (2021) Towards visual proteomics at high resolution. J Mol Biol 433 (20):167187. https://doi.org/10.1016/j.jmb.2021.167187
Beck M, Baumeister W (2016) Cryo-electron tomography: can it reveal the molecular sociology of cells in atomic detail? Trends Cell Biol 26(11):825–837. https://doi.org/10.1016/j.tcb.2016.08.006
Article PubMed Google Scholar
Bell G, Hey T, Szalay A (2009) Beyond the data deluge. Science 323(5919):1297–1298. https://doi.org/10.1126/science.1170411
Article CAS PubMed Google Scholar
Bepler T, Kelley K, Noble AJ, Berger B (2020) Topaz-Denoise: general deep denoising models for cryoEM and cryoET. Nat Commun 11(1):5208. https://doi.org/10.1038/s41467-020-18952-1
Article CAS PubMed PubMed Central Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242. https://doi.org/10.1093/nar/28.1.235
Article CAS PubMed PubMed Central Google Scholar
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide protein data bank. Nat Struct Mol Biol 10(12):980–980. https://doi.org/10.1038/nsb1203-980
Article CAS Google Scholar
Bharat Tanmay AM, Russo Christopher J, Löwe J, Passmore Lori A, Scheres Sjors HW (2015) Advances in single-particle electron cryomicroscopy structure determination applied to sub-tomogram averaging. Structure 23(9):1743–1753. https://doi.org/10.1016/j.str.2015.06.026
Article CAS PubMed PubMed Central Google Scholar
Biyani N, Righetto RD, McLeod R, Caujolle-Bert D, Castano-Diez D, Goldie KN, Stahlberg H (2017) Focus: the interface between data collection and data processing in cryo-EM. J Struct Biol 198(2):124–133. https://doi.org/10.1016/j.jsb.2017.03.007
Article CAS PubMed Google Scholar
Boehm U, Nelson G, Brown CM, Bagley S, Bajcsy P, Bischof J, Dauphin A, Dobbie IM, Eriksson JE, Faklaris O, Fernandez-Rodriguez J, Ferrand A, Gelman L, Gheisari A, Hartmann H, Kukat C, Laude A, Mitkovski M, Munck S, North AJ, Rasse TM, Resch-Genger U, Schuetz LC, Seitz A, Strambio-De-Castillia C, Swedlow JR, Nitschke R (2021) QUAREP-LiMi: a community endeavor to advance quality assessment and reproducibility in light microscopy. Nat Methods 18(12):1423–1426. https://doi.org/10.1038/s41592-021-01162-y
Article CAS PubMed PubMed Central Google Scholar
Bourget M-H, Kamentsky L, Ghosh SS, Mazzamuto G, Lazari A, Markiewicz CJ, Oostenveld R, Niso G, Halchenko YO, Lipp I, Takerkart S, Toussaint P-J, Khan AR, Nilsonne G, Castelli FM, Maintainers TB, Cohen-Adad J (2022) Microscopy-BIDS: an extension to the brain imaging data structure for microscopy data. Front Neurosci 16:871228. https://doi.org/10.3389/fnins.2022.871228
Braet F, Ratinac K (2007) Creating next-generation microscopists: structural and molecular biology at the crossroads. J Cell Mol Med 11(4):759–763. https://doi.org/10.1111/j.1582-4934.2007.00072.x
Article CAS PubMed PubMed Central Google Scholar
Britton D, Lloyd SL (2014) How to deal with petabytes of data: the LHC grid project. Rep Prog Phys 77 (6):065902. https://doi.org/10.1088/0034-4885/77/6/065902
Brown J, Jones P, Meadows A, Murphy F, Clayton P (2021) UK PID consortium: cost-benefit analysis. Zenodo. https://doi.org/10.5281/zenodo.4772627
Brown J, Jones P, Meadows A, Murphy F (2022) Incentives to invest in identifiers: a cost-benefit analysis of persistent identifiers in Australian research systems. Zenodo. https://doi.org/10.5281/zenodo.710057
Burel J-M, Besson S, Blackburn C, Carroll M, Ferguson RK, Flynn H, Gillen K, Leigh R, Li S, Lindner D, Linkert M, Moore WJ, Ramalingam B, Rozbicki E, Tarkowska A, Walczysko P, Allan C, Moore J, Swedlow JR (2015) Publishing and sharing multi-dimensional image data with OMERO. Mamm Genome 26(9):441–447. https://doi.org/10.1007/s00335-015-9587-6
Article PubMed PubMed Central Google Scholar
Calyam P, Wilkins-Diehr N, Miller M, Brookes EH, Arora R, Chourasia A, Jennewein DM, Nandigam V, Drew LaMar M, Cleveland SB, Newman G, Wang S, Zaslavsky I, Cianfrocco MA, Ellett K, Tarboton D, Jeffery KG, Zhao Z, González-Aranda J, Perri MJ, Tucker G, Candela L, Kiss T, Gesing S (2021) Measuring success for a future vision: defining impact in science gateways/virtual research environments. Concurr Comput Pract Exper 33 (19):e6099. https://doi.org/10.1002/cpe.6099
Cao C, Kang HJ, Singh I, Chen H, Zhang C, Ye W, Hayes BW, Liu J, Gumpper RH, Bender BJ, Slocum ST, Krumm BE, Lansu K, McCorvy JD, Kroeze WK, English JG, DiBerto JF, Olsen RHJ, Huang X-P, Zhang S, Liu Y, Kim K, Karpiak J, Jan LY, Abraham SN, Jin J, Shoichet BK, Fay JF, Roth BL (2021) Structure, function and pharmacology of human itch GPCRs. Nature 600(7887):170–175. https://doi.org/10.1038/s41586-021-04126-6
Article CAS PubMed PubMed Central Google Scholar
Carroll SR, Garba I, Figueroa-Rodríguez OL, Holbrook J, Lovett R, Materechera S, M. P, Raseroka K, Rodrigues-Lonebear D, Rowe, R., Sara R, Walker JD, Anderson J, Hudson M, (2020) The CARE principles for indigenous data governance. Data Sci J 19(1):43. https://doi.org/10.5334/dsj-2020-043
Article Google Scholar
Carroll SR, Herczog E, Hudson M, Russell K, Stall S (2021) Operationalizing the CARE and FAIR principles for indigenous data futures. Sci Data 8(1):108. https://doi.org/10.1038/s41597-021-00892-0
Article PubMed PubMed Central Google Scholar
Castaño-Díez D (2017) The Dynamo package for tomography and subtomogram averaging: components for MATLAB, GPU computing and EC2 Amazon web services. Acta Crystallogr D 73(6):478–487. https://doi.org/10.1107/S2059798317003369
Article Google Scholar
Che C, Lin R, Zeng X, Elmaaroufi K, Galeotti J, Xu M (2018) Improved deep learning-based macromolecules structure classification from electron cryo-tomograms. Mach Vis Appl 29(8):1227–1236. https://doi.org/10.1007/s00138-018-0949-4
Article PubMed PubMed Central Google Scholar
Chen M, Bell JM, Shi X, Sun SY, Wang Z, Ludtke SJ (2019) A complete data processing workflow for cryo-ET and subtomogram averaging. Nat Methods 16(11):1161–1168. https://doi.org/10.1038/s41592-019-0591-8
Article CAS PubMed PubMed Central Google Scholar
Cheng A, Negro C, Bruhn JF, Rice WJ, Dallakyan S, Eng ET, Waterman DG, Potter CS, Carragher B (2021) Leginon: new features and applications. Protein Sci 30(1):136–150. https://doi.org/10.1002/pro.3967
Article CAS PubMed Google Scholar
Chiang L, Lu B, Castillo I (2017) Big data analytics in chemical engineering. Annu Rev Chem Biomol Eng 8(1):63–85. https://doi.org/10.1146/annurev-chembioeng-060816-101555
Article PubMed Google Scholar
Chua EYD, Mendez JH, Rapp M, Ilca SL, Tan YZ, Maruthi K, Kuang H, Zimanyi CM, Cheng A, Eng ET, Noble AJ, Potter CS, Carragher B (2022) Better, faster, cheaper: recent advances in cryo–electron microscopy. Annu Rev Biochem 91(1):1–32. https://doi.org/10.1146/annurev-biochem-032620-110705
Article CAS PubMed PubMed Central Google Scholar
Cianfrocco MA, Lahiri I, DiMaio F, Leschziner AE (2018) cryoem-cloud-tools: a software platform to deploy and manage cryo-EM jobs in the cloud. J Struct Biol 203(3):230–235. https://doi.org/10.1016/j.jsb.2018.05.014
Article PubMed PubMed Central Google Scholar
Cianfrocco MA, Leschziner AE (2015) Low cost, high performance processing of single particle cryo-electron microscopy data in the cloud. eLife 4:e06664. https://doi.org/10.7554/eLife.06664
Cianfrocco MA, Wong-Barnum M, Youn C, Wagner R, Leschziner A (2017) COSMIC2: a science gateway for cryo-electron microscopy structure determination. In: ACM international conference proceeding series. https://doi.org/10.1145/3093338.3093390
Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS One 15 (4):e0230416. https://doi.org/10.1371/journal.pone.0230416
Cousijn H, Braukmann R, Fenner M, Ferguson C, van Horik R, Lammey R, Meadows A, Lambert S (2021) Connected research: the potential of the PID graph. Patterns 2 (1). https://doi.org/10.1016/j.patter.2020.100180
Cuenca-Alba J, del Cano L, Gómez Blanco J, de la Rosa Trevín JM, Conesa Mingo P, Marabini R, S. Sorzano CO, Carazo JM, (2017) ScipionCloud: an integrative and interactive gateway for large scale cryo electron microscopy image processing on commercial and academic clouds. J Struct Biol 200(1):20–27. https://doi.org/10.1016/j.jsb.2017.06.004
Article PubMed Google Scholar
Danev R, Yanagisawa H, Kikkawa M (2019) Cryo-electron microscopy methodology: current aspects and future directions. Trends Biochem Sci 44(10):837–848. https://doi.org/10.1016/j.tibs.2019.04.008
Article CAS PubMed Google Scholar
de la Rosa-Trevín JM, Quintana A, del Cano L, Zaldívar A, Foche I, Gutiérrez J, Gómez-Blanco J, Burguet-Castell J, Cuenca-Alba J, Abrishami V, Vargas J, Otón J, Sharov G, Vilas JL, Navas J, Conesa P, Kazemi M, Marabini R, Sorzano COS, Carazo JM (2016) Scipion: a software framework toward integration, reproducibility and validation in 3D electron microscopy. J Struct Biol 195(1):93–99. https://doi.org/10.1016/j.jsb.2016.04.010
Article PubMed Google Scholar
Ding HJ, Oikonomou CM, Jensen GJ (2015) The caltech tomography database and automatic processing pipeline. J Struct Biol 192(2):279–286. https://doi.org/10.1016/j.jsb.2015.06.016
Article PubMed PubMed Central Google Scholar
Eisenstein M (2023) Seven technologies to watch in 2023. Nature 613(7945):794–797. https://doi.org/10.1038/d41586-023-00178-y
Article CAS PubMed Google Scholar
Eng ET, Kopylov M, Negro CJ, Dallaykan S, Rice WJ, Jordan KD, Kelley K, Carragher B, Potter CS (2019) Reducing cryoEM file storage using lossy image formats. J Struct Biol 207(1):49–55. https://doi.org/10.1016/j.jsb.2019.04.013
Article PubMed PubMed Central Google Scholar
Farley SS, Dawson A, Goring SJ, Williams JW (2018) Situating ecology as a big-data science: current advances, challenges, and solutions. Bioscience 68(8):563–576. https://doi.org/10.1093/biosci/biy068
Article Google Scholar
Fernandez-Leiro R, Scheres SHW (2017) A pipeline approach to single-particle processing in RELION. Acta Crystallogr D 73(6):496–502. https://doi.org/10.1107/S2059798316019276
Article CAS Google Scholar
Foster I (2011) Globus online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput 15(3):70–73. https://doi.org/10.1109/MIC.2011.64
Article Google Scholar
Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag 35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Article Google Scholar
Goldberg IG, Allan C, Burel J-M, Creager D, Falconi A, Hochheiser H, Johnston J, Mellen J, Sorger PK, Swedlow JR (2005) The open microscopy environment (OME) data model and XML file: open tools for informatics and quantitative analysis in biological imaging. Genome Biol 6(5):R47. https://doi.org/10.1186/gb-2005-6-5-r47
Article PubMed PubMed Central Google Scholar
Gorgolewski KJ, Auer T, Calhoun VD, Craddock RC, Das S, Duff EP, Flandin G, Ghosh SS, Glatard T, Halchenko YO, Handwerker DA, Hanke M, Keator D, Li X, Michael Z, Maumet C, Nichols BN, Nichols TE, Pellman J, Poline J-B, Rokem A, Schaefer G, Sochat V, Triplett W, Turner JA, Varoquaux G, Poldrack RA (2016) The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci Data 3 (1):160044. https://doi.org/10.1038/sdata.2016.44
Grant T, Rohou A, Grigorieff N (2018) cisTEM, user-friendly software for single-particle image processing. eLife 7:e35383. https://doi.org/10.7554/eLife.35383
Guo Q, Lehmer C, Martínez-Sánchez A, Rudack T, Beck F, Hartmann H, Pérez-Berlanga M, Frottin F, Hipp MS, Hartl FU, Edbauer D, Baumeister W, Fernández-Busnadiego R (2018) In situ structure of neuronal C9orf72 Poly-GA aggregates reveals proteasome recruitment. Cell 172(4):696-705.e612. https://doi.org/10.1016/j.cell.2017.12.030
Article CAS PubMed PubMed Central Google Scholar
Habermann T (2020) Metadata and reuse: antidotes to information entropy. Patterns. https://doi.org/10.1016/j.patter.2020.100004
Article PubMed PubMed Central Google Scholar
Hammer M, Huisman M, Rigano A, Boehm U, Chambers JJ, Gaudreault N, North AJ, Pimentel JA, Sudar D, Bajcsy P, Brown CM, Corbett AD, Faklaris O, Lacoste J, Laude A, Nelson G, Nitschke R, Farzam F, Smith CS, Grunwald D, Strambio-De-Castillia C (2021) Towards community-driven metadata standards for light microscopy: tiered specifications extending the OME model. Nat Methods 18(12):1427–1440. https://doi.org/10.1038/s41592-021-01327-9
Article CAS PubMed PubMed Central Google Scholar
Hanemann A, Boote JW, Boyd EL, Durand J, Kudarimoti L, Łapacz R, Swany DM, Trocha S, Zurawski J (2005) PerfSONAR: a service oriented architecture for multi-domain network monitoring. In: Benatallah B, Casati F, Traverso P (eds) Service-oriented computing - ICSOC, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg, pp 241–254 https://doi.org/10.1007/11596141_19
Hartley M, Kleywegt GJ, Patwardhan A, Sarkans U, Swedlow JR, Brazma A (2022) The BioImage archive—building a home for life-sciences microscopy data. J Mol Biol 434 (11):167505. https://doi.org/10.1016/j.jmb.2022.167505
Hauser M, Wojcik M, Kim D, Mahmoudi M, Li W, Xu K (2017) Correlative super-resolution microscopy: new dimensions and new opportunities. Chem Rev 117(11):7428–7456. https://doi.org/10.1021/acs.chemrev.6b00604
Article CAS PubMed Google Scholar
He S, Scheres SHW (2017) Helical reconstruction in RELION. J Struct Biol 198(3):163–176. https://doi.org/10.1016/j.jsb.2017.02.003
Article CAS PubMed PubMed Central Google Scholar
Hilbert M (2016) Big data for development: a review of promises and challenges. Dev Policy Rev 34(1):135–174. https://doi.org/10.1111/dpr.12142
Article Google Scholar
Himes BA, Zhang P (2018) emClarity: software for high-resolution cryo-electron tomography and subtomogram averaging. Nat Methods 15(11):955–961. https://doi.org/10.1038/s41592-018-0167-z
Article CAS PubMed PubMed Central Google Scholar
Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A (2016) EMPIAR: a public archive for raw electron microscopy image data. Nat Methods 13(5):387–388. https://doi.org/10.1038/nmeth.3806
Article CAS PubMed Google Scholar
Jagadish HV, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, Shahabi C (2014) Big data and its technical challenges. Commun ACM 57(7):86–94. https://doi.org/10.1145/2611567
Article Google Scholar
Khan N, Alsaqer M, Shah H, Badsha G, Abbasi AA, Salehian S (2018) The 10 Vs, issues and challenges of big data. In: ICBDE '18, 2018. Proceedings of the 2018 international conference on big data and education. Association for Computing Machinery, New York, pp 52–56 https://doi.org/10.1145/3206157.3206166
Khan N, Naim A, Hussain MR, Naveed QN, Ahmad N, Qamar S (2019) The 51 V's of big data: survey, technologies, characteristics, opportunities, issues and challenges. In: COINS '19, 2019. Proceedings of the international conference on omni-layer intelligent systems. Association for Computing Machinery, New York, pp 19–24 https://doi.org/10.1145/3312614.3312623
Kimanius D, Dong L, Sharov G, Nakane T, Scheres SHW (2021) New tools for automated cryo-EM single-particle analysis in RELION-4.0. Biochem J 478(24):4169–4185. https://doi.org/10.1042/bcj20210708
Article CAS PubMed Google Scholar
Kimanius D, Forsberg BO, Scheres SHW, Lindahl E (2016) Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2. eLife 5:e18722. https://doi.org/10.7554/eLife.18722
Kitchin R (2014) Big data. In: The data revolution: big data, open data, data infrastructures & their consequences. SAGE Publications Ltd, London, pp 67–79 https://doi.org/10.4135/9781473909472
Klimentov A, Buncic P, De K, Jha S, Maeno T, Mount R, Nilsson P, Oleynik D, Panitkin S, Petrosyan A, Porter RJ, Read KF, Vaniachine A, Wells JC, Wenaus T (2015) Next generation workload management system for big data on heterogeneous distributed computing. J Phys Conf Ser 608 (1):012040. https://doi.org/10.1088/1742-6596/608/1/012040
Kremer J, Stensbo-Smidt K, Gieseke F, Pedersen KS, Igel C (2017) Big universe, big data: machine learning and image analysis for astronomy. IEEE Intell Syst 32(2):16–22. https://doi.org/10.1109/MIS.2017.40
Article Google Scholar
Kühlbrandt W (2014) The resolution revolution. Science 343(6178):1443–1444. https://doi.org/10.1126/science.1251652
Article PubMed Google Scholar
Kuhn Cuellar L, Friedrich A, Gabernet G, de la Garza L, Fillinger S, Seyboldt A, Koch T, zur Oven-Krockhaus S, Wanke F, Richter S, Thaiss WM, Horger M, Malek N, Harter K, Bitzer M, Nahnsen S, (2022) A data management infrastructure for the integration of imaging and omics data in life sciences. BMC Bioinformatics 23(1):61. https://doi.org/10.1186/s12859-022-04584-3
Article PubMed PubMed Central Google Scholar
Kukutai T, Taylor J (2016) Data sovereignty for Indigenous peoples: current practice and future needs. In: Kukutai T, Taylor J (eds) Indigenous data sovereignty: toward an agenda, vol 38. Toward an agenda. Australian National University Press, Canberra, pp 1–22 https://doi.org/10.22459/CAEPR38.11.2016
Kunis S, Hänsch S, Schmidt C, Wong F, Strambio-De-Castillia C, Weidtkamp-Peters S (2021) MDEmic: a metadata annotation tool to facilitate management of FAIR image data in the bioimaging community. Nat Methods 18(12):1416–1417. https://doi.org/10.1038/s41592-021-01288-z
Article CAS PubMed PubMed Central Google Scholar
L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797. https://doi.org/10.1109/ACCESS.2017.2696365
Article Google Scholar
Lander GC, Stagg SM, Voss NR, Cheng A, Fellmann D, Pulokas J, Yoshioka C, Irving C, Mulder A, Lau P-W, Lyumkis D, Potter CS, Carragher B (2009) Appion: an integrated, database-driven pipeline to facilitate EM image processing. J Struct Biol 166(1):95–102. https://doi.org/10.1016/j.jsb.2009.01.002
Article CAS PubMed PubMed Central Google Scholar
Lawson CL, Patwardhan A, Baker ML, Hryc C, Garcia ES, Hudson BP, Lagerstedt I, Ludtke SJ, Pintilie G, Sala R, Westbrook JD, Berman HM, Kleywegt GJ, Chiu W (2015) EMDataBank unified data resource for 3DEM. Nucleic Acids Res 44(D1):D396–D403. https://doi.org/10.1093/nar/gkv1126
Article CAS PubMed PubMed Central Google Scholar
Lazić I, Wirix M, Leidl ML, de Haas F, Mann D, Beckers M, Pechnikova EV, Müller-Caspary K, Egoavil R, Bosch EGT, Sachse C (2022) Single-particle cryo-EM structures from iDPC–STEM at near-atomic resolution. Nat Methods 19(9):1126–1136. https://doi.org/10.1038/s41592-022-01586-0
Article CAS PubMed PubMed Central Google Scholar
Li S, Besson S, Blackburn C, Carroll M, Ferguson RK, Flynn H, Gillen K, Leigh R, Lindner D, Linkert M, Moore WJ, Ramalingam B, Rozbicki E, Rustici G, Tarkowska A, Walczysko P, Williams E, Allan C, Burel J-M, Moore J, Swedlow JR (2016) Metadata management for high content screening in OMERO. Methods 96:27–32. https://doi.org/10.1016/j.ymeth.2015.10.006
Article CAS PubMed PubMed Central Google Scholar
Linkert M, Rueden CT, Allan C, Burel J-M, Moore W, Patterson A, Loranger B, Moore J, Neves C, MacDonald D, Tarkowska A, Sticco C, Hill E, Rossner M, Eliceiri KW, Swedlow JR (2010) Metadata matters: access to image data in the real world. J Cell Biol 189(5):777–782. https://doi.org/10.1083/jcb.201004104
Article CAS PubMed PubMed Central Google Scholar
Liu Q, Junker A, Murakami K, Hu P (2019) Automated counting of cancer cells by ensembling deep features. Cells 8(9):1019. https://doi.org/10.3390/cells8091019
Article CAS PubMed PubMed Central Google Scholar
Ludäscher B, Bowers S, McPhillips T (2009) Scientific workflows. In: Liu L, Özsu MT (eds) Encyclopedia of database systems. Springer US, Boston, pp 2507–2511 https://doi.org/10.1007/978-0-387-39940-9_1471
Marcus DS, Olsen TR, Ramaratnam M, Buckner RL (2007) The extensible neuroimaging archive toolkit: an informatics platform for managing, exploring, and sharing neuroimaging data. Neuroinformatics 5(1):11–34. https://doi.org/10.1385/ni:5:1:11
Article PubMed Google Scholar
Marini L, Gutierrez-Polo I, Kooper R, Satheesan SP, Burnette M, Lee J, Nicholson T, Zhao Y, McHenry K (2018) Clowder: open source data management for long tail data. In: PEARC '18, Pittsburgh, PA, USA, 2018. Proceedings of the practice and experience on advanced research computing. Association for Computing Machinery, New York, p 40 https://doi.org/10.1145/3219104.3219159
Mastronarde DN (2005) Automated electron microscope tomography using robust prediction of specimen movements. J Struct Biol 152(1):36–51. https://doi.org/10.1016/j.jsb.2005.07.007
Article PubMed Google Scholar
McCafferty CL, Verbeke EJ, Marcotte EM, Taylor DW (2020) Structural biology in the multi-omics era. J Chem Inf Model 60(5):2424–2429. https://doi.org/10.1021/acs.jcim.9b01164
Article CAS PubMed PubMed Central Google Scholar
Mehnert AJ, Janke A, Gruwel M, Goscinski WJ, Close T, Taylor D, Narayanan A, Vidalis G, Galloway G, Treloar A (2019) Putting the Trust into trusted data repositories: a federated solution for the australian national imaging facility. Int J Digit Curation 14(1):102–113. https://doi.org/10.2218/ijdc.v14i1.594
Article Google Scholar
Meyer GR, Aragao D, Mudie NJ, Caradoc-Davies TT, McGowan S, Bertling PJ, Groenewegen D, Quenette SM, Bond CS, Buckle AM, Androulakis S (2014) Operation of the Australian Store. Synchrotron for macromolecular crystallography. Acta Crystallogr D 70(10):2510–2519. https://doi.org/10.1107/S1399004714016174
Article CAS PubMed PubMed Central Google Scholar
Michener WK (2015) Ten simple rules for creating a good data management plan. PLoS Comput Biol 11 (10):e1004525. https://doi.org/10.1371/journal.pcbi.1004525
Miele S, Shockley R (2013) Analytics: the real-world use of big data. IBM Institute for Business Value
Mills DJ (2021) Setting up and operating a cryo-EM laboratory. Q Rev Biophys 54:e2. https://doi.org/10.1017/S003358352000013X
Moebel E, Martinez-Sanchez A, Lamm L, Righetto RD, Wietrzynski W, Albert S, Larivière D, Fourmentin E, Pfeffer S, Ortiz J, Baumeister W, Peng T, Engel BD, Kervrann C (2021) Deep learning improves macromolecule identification in 3D cellular cryo-electron tomograms. Nat Methods 18(11):1386–1394. https://doi.org/10.1038/s41592-021-01275-4
Article CAS PubMed Google Scholar
Moore J, Allan C, Besson S, Burel J-M, Diel E, Gault D, Kozlowski K, Lindner D, Linkert M, Manz T, Moore W, Pape C, Tischer C, Swedlow JR (2021a) OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. Nat Methods 18(12):1496–1498. https://doi.org/10.1038/s41592-021-01326-w
Article CAS PubMed PubMed Central Google Scholar
Moore J, Allan C, Besson S, Burel J-M, Diel E, Gault D, Kozlowski K, Lindner D, Linkert M, Manz T, Moore W, Pape C, Tischer C, Swedlow JR (2021b) OME-NGFF: scalable format strategies for interoperable bioimaging data. bioRxiv:2021b.2003.2031.437929. https://doi.org/10.1101/2021.03.31.437929
Myers C, Poger D (2022) Reproducible baselining and benchmarking of network performance with perfSONAR. Zenodo. https://doi.org/10.5281/zenodo.7018190
Nakane T, Kotecha A, Sente A, McMullan G, Masiulis S, Brown PMGE, Grigoras IT, Malinauskaite L, Malinauskas T, Miehling J, Uchański T, Yu L, Karia D, Pechnikova EV, de Jong E, Keizer J, Bischoff M, McCormack J, Tiemeijer P, Hardwick SW, Chirgadze DY, Murshudov G, Aricescu AR, Scheres SHW (2020) Single-particle cryo-EM at atomic resolution. Nature 587(7832):152–156. https://doi.org/10.1038/s41586-020-2829-0
Article CAS PubMed PubMed Central Google Scholar
Nelson G, Boehm U, Bagley S, Bajcsy P, Bischof J, Brown CM, Dauphin A, Dobbie IM, Eriksson JE, Faklaris O, Fernandez-Rodriguez J, Ferrand A, Gelman L, Gheisari A, Hartmann H, Kukat C, Laude A, Mitkovski M, Munck S, North AJ, Rasse TM, Resch-Genger U, Schuetz LC, Seitz A, Strambio-De-Castillia C, Swedlow JR, Alexopoulos I, Aumayr K, Avilov S, Bakker G-J, Bammann RR, Bassi A, Beckert H, Beer S, Belyaev Y, Bierwagen J, Birngruber KA, Bosch M, Breitlow J, Cameron LA, Chalfoun J, Chambers JJ, Chen C-L, Conde-Sousa E, Corbett AD, Cordelieres FP, Nery ED, Dietzel R, Eismann F, Fazeli E, Felscher A, Fried H, Gaudreault N, Goh WI, Guilbert T, Hadleigh R, Hemmerich P, Holst GA, Itano MS, Jaffe CB, Jambor HK, Jarvis SC, Keppler A, Kirchenbuechler D, Kirchner M, Kobayashi N, Krens G, Kunis S, Lacoste J, Marcello M, Martins GG, Metcalf DJ, Mitchell CA, Moore J, Mueller T, Nelson MS, Ogg S, Onami S, Palmer AL, Paul-Gilloteaux P, Pimentel JA, Plantard L, Podder S, Rexhepaj E, Royon A, Saari MA, Schapman D, Schoonderwoert V, Schroth-Diez B, Schwartz S, Shaw M, Spitaler M, Stoeckl MT, Sudar D, Teillon J, Terjung S, Thuenauer R, Wilms CD, Wright GD, Nitschke R (2021) QUAREP-LiMi: a community-driven initiative to establish guidelines for quality assessment and reproducibility for instruments and images in light microscopy. J Microsc 284(1):56–73. https://doi.org/10.1111/jmi.13041
Article PubMed PubMed Central Google Scholar
Nguyen H (2022) Pitschi: a clowder-based end-to-end data management tool. Zenodo. https://doi.org/10.5281/zenodo.7183431
Nguyen P, Konstanty S, Nicholson T, O’Brien T, Schwartz-Duval A, Spila T, Nahrstedt K, Campbell RH, Gupta I, Chan M, McHenry K, Paquin N (2017) 4CeeD: real-time data acquisition and analysis framework for material-related cyber-physical environments. Paper presented at the proceedings of the 17th IEEE/ACM international symposium on cluster, Cloud and Grid Computing, Madrid, Spain
Nickerson M (2017) First nations’ data governance: measuring the nation-to-nation relationship. (British Columbia First Nations’ Data Governance Initiative 2017)
Oikonomou CM, Jensen GJ (2017) Cellular electron cryotomography: toward structural biology in situ. Annu Rev Biochem 86(1):873–896. https://doi.org/10.1146/annurev-biochem-061516-044741
Article CAS PubMed Google Scholar
Palacio AL, López ÓP (2018) From big data to smart data: a genomic information systems perspective. In: 2018 12th International conference on research challenges in information science (RCIS), Nantes, France, 29–31 May 2018. IEEE, pp 1–11 https://doi.org/10.1109/RCIS.2018.8406658
Patwardhan A, Carazo J-M, Carragher B, Henderson R, Heymann JB, Hill E, Jensen GJ, Lagerstedt I, Lawson CL, Ludtke SJ, Mastronarde D, Moore WJ, Roseman A, Rosenthal P, Sorzano C-OS, Sanz-García E, Scheres SHW, Subramaniam S, Westbrook J, Winn M, Swedlow JR, Kleywegt GJ (2012) Data management challenges in three-dimensional EM. Nat Struct Mol Biol 19(12):1203–1207. https://doi.org/10.1038/nsmb.2426
Article CAS PubMed PubMed Central Google Scholar
Poger D, van Schyndel J, Nguyen H, Silver J, Goscinski WJ (2021) Orchestration and management of data generated by big-data electron microscopy instruments: a discovery report. Zenodo. https://doi.org/10.5281/zenodo.4744876
Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA (2017) cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat Methods 14(3):290–296. https://doi.org/10.1038/nmeth.4169
Article CAS PubMed Google Scholar
Richarz A-N (2020) Big data in predictive toxicology: challenges, opportunities and perspectives. In: Big data in predictive toxicology. The Royal Society of Chemistry, pp 1–37 https://doi.org/10.1039/9781782623656-00001
Rigano A, Ehmsen S, Öztürk SU, Ryan J, Balashov A, Hammer M, Kirli K, Boehm U, Brown CM, Bellve K, Chambers JJ, Cosolo A, Coleman RA, Faklaris O, Fogarty KE, Guilbert T, Hamacher AB, Itano MS, Keeley DP, Kunis S, Lacoste J, Laude A, Ma WY, Marcello M, Montero-Llopis P, Nelson G, Nitschke R, Pimentel JA, Weidtkamp-Peters S, Park PJ, Alver BH, Grunwald D, Strambio-De-Castillia C (2021) Micro-meta app: an interactive tool for collecting microscopy metadata based on community specifications. Nat Methods 18(12):1489–1495. https://doi.org/10.1038/s41592-021-01315-z
Article CAS PubMed PubMed Central Google Scholar
Ropelewski AJ, Rizzo MA, Swedlow JR, Huisken J, Osten P, Khanjani N, Weiss K, Bakalov V, Engle M, Gridley L, Krzyzanowski M, Madden T, Maiese D, Mandal M, Waterfield J, Williams D, Hamilton CM, Huggins W (2022) Standard metadata for 3D microscopy. Sci Data 9(1):449. https://doi.org/10.1038/s41597-022-01562-5
Article PubMed PubMed Central Google Scholar
Rueden CT, Schindelin J, Hiner MC, DeZonia BE, Walter AE, Arena ET, Eliceiri KW (2017) Image J2: ImageJ for the next generation of scientific image data. BMC Bioinformatics 18(1):529. https://doi.org/10.1186/s12859-017-1934-z
Article PubMed PubMed Central Google Scholar
Ryan J, Pengo T, Rigano A, Llopis PM, Itano MS, Cameron LA, Marqués G, Strambio-De-Castillia C, Sanders MA, Brown CM (2021) MethodsJ2: a software tool to capture metadata and generate comprehensive microscopy methods text. Nat Methods 18(12):1414–1416. https://doi.org/10.1038/s41592-021-01290-5
Article CAS PubMed PubMed Central Google Scholar
Sader K, Matadeen R, Castro Hartmann P, Halsan T, Schlichten C (2020) Industrial cryo-EM facility setup and management. Acta Crystallogr D 76(4):313–325. https://doi.org/10.1107/S2059798320002223
Article CAS Google Scholar
Salathé M (2016) Digital pharmacovigilance and disease surveillance: combining traditional and big-data systems for better public health. J Infect Dis 214 (suppl_4):S399–S403. https://doi.org/10.1093/infdis/jiw281
Sanchez-Garcia R, Gomez-Blanco J, Cuervo A, Carazo JM, Sorzano COS, Vargas J (2021) DeepEMhancer: a deep learning solution for cryo-EM volume post-processing. Commun Biol 4(1):874. https://doi.org/10.1038/s42003-021-02399-1
Article PubMed PubMed Central Google Scholar
Sarkans U, Chiu W, Collinson L, Darrow MC, Ellenberg J, Grunwald D, Hériché J-K, Iudin A, Martins GG, Meehan T, Narayan K, Patwardhan A, Russell MRG, Saibil HR, Strambio-De-Castillia C, Swedlow JR, Tischer C, Uhlmann V, Verkade P, Barlow M, Bayraktar O, Birney E, Catavitello C, Cawthorne C, Wagner-Conrad S, Duke E, Paul-Gilloteaux P, Gustin E, Harkiolaki M, Kankaanpää P, Lemberger T, McEntyre J, Moore J, Nicholls AW, Onami S, Parkinson H, Parsons M, Romanchikova M, Sofroniew N, Swoger J, Utz N, Voortman LM, Wong F, Zhang P, Kleywegt GJ, Brazma A (2021) REMBI: recommended metadata for biological images—enabling reuse of microscopy data in biology. Nat Methods 18(12):1418–1422. https://doi.org/10.1038/s41592-021-01166-8
Article CAS PubMed PubMed Central Google Scholar
Schapiro D, Yapp C, Sokolov A, Reynolds SM, Chen Y-A, Sudar D, Xie Y, Muhlich J, Arias-Camison R, Arena S, Taylor AJ, Nikolov M, Tyler M, Lin J-R, Burlingame EA, Abravanel DL, Achilefu S, Ademuyiwa FO, Adey AC, Aft R, Ahn KJ, Alikarami F, Alon S, Ashenberg O, Baker E, Baker GJ, Bandyopadhyay S, Bayguinov P, Beane J, Becker W, Bernt K, Betts CB, Bletz J, Blosser T, Boire A, Boland GM, Boyden ES, Bucher E, Bueno R, Cai Q, Cambuli F, Campbell J, Cao S, Caravan W, Chaligné R, Chan JM, Chasnoff S, Chatterjee D, Chen AA, Chen C, Chen C-h, Chen B, Chen F, Chen S, Chheda MG, Chin K, Cho H, Chun J, Cisneros L, Coffey RJ, Cohen O, Colditz GA, Cole KA, Collins N, Cotter D, Coussens LM, Coy S, Creason AL, Cui Y, Zhou DC, Curtis C, Davies SR, Bruijn I, Delorey TM, Demir E, Denardo D, Diep D, Ding L, DiPersio J, Dubinett SM, Eberlein TJ, Eddy JA, Esplin ED, Factor RE, Fatahalian K, Feiler HS, Fernandez J, Fields A, Fields RC, Fitzpatrick JAJ, Ford JM, Franklin J, Fulton B, Gaglia G, Galdieri L, Ganesh K, Gao J, Gaudio BL, Getz G, Gibbs DL, Gillanders WE, Goecks J, Goodwin D, Gray JW, Greenleaf W, Grimm LJ, Gu Q, Guerriero JL, Guha T, Guimaraes AR, Gutierrez B, Hacohen N, Hanson CR, Harris CR, Hawkins WG, Heiser CN, Hoffer J, Hollmann TJ, Hsieh JJ, Huang J, Hunger SP, Hwang E-S, Iacobuzio-Donahue C, Iglesia MD, Islam M, Izar B, Jacobson CA, Janes S, Jayasinghe RG, Jeudi T, Johnson BE, Johnson BE, Ju T, Kadara H, Karnoub E-R, Karpova A, Khan A, Kibbe W, Kim AH, King LM, Kozlowski E, Krishnamoorthy P, Krueger R, Kundaje A, Ladabaum U, Laquindanum R, Lau C, Lau KSK, LeBoeuf NR, Lee H, Lenburg M, Leshchiner I, Levy R, Li Y, Lian CG, Liang W-W, Lim K-H, Lin Y, Liu D, Liu Q, Liu R, Lo J, Lo P, Longabaugh WJ, Longacre T, Luckett K, Ma C, Maher C, Maier A, Makowski D, Maley C, Maliga Z, Manoj P, Maris JM, Markham N, Marks JR, Martinez D, Mashl J, Masilionis I, Massague J, Mazurowski MA, McKinley ET, McMichael J, Meyerson M, Mills GB, Mitri ZI, Moorman A, Mudd J, Murphy GF, Deen NNA, Navin NE, Nawy T, Ness RM, Nevins S, Nirmal AJ, Novikov E, Oh ST, Oldridge DA, Owzar K, Pant SM, Park W, Patti GJ, Paul K, Pelletier R, Persson D, Petty C, Pfister H, Polyak K, Puram SV, Qiu Q, Villalonga ÁQ, Ramirez MA, Rashid R, Reeb AN, Reid ME, Remsik J, Riesterer JL, Risom T, Ritch CC, Rolong A, Rudin CM, Ryser MD, Sato K, Sears CL, Semenov YR, Shen J, Shoghi KI, Shrubsole MJ, Shyr Y, Sibley AB, Simmons AJ, Sinha A, Sivagnanam S, Song S-K, Southar-Smith A, Spira AE, Cyr JS, Stefankiewicz S, Storrs EP, Stover EH, Strand SH, Straub C, Street C, Su T, Surrey LF, Suver C, Tan K, Terekhanova NV, Ternes L, Thadi A, Thomas G, Tibshirani R, Umeda S, Uzun Y, Vallius T, Van Allen ER, Vandekar S, Vega PN, Veis DJ, Vennam S, Verma A, Vigneau S, Wagle N, Wahl R, Walle T, Wang L-B, Warchol S, Washington MK, Watson C, Weimer AK, Wendl MC, West RB, White S, Windon AL, Wu H, Wu C-Y, Wu Y, Wyczalkowski MA, Xu J, Yao L, Yu W, Zhang K, Zhu X, Chang YH, Farhi SL, Thorsson V, Venkatamohan N, Drewes JL, Pe’er D, Gutman DA, Herrmann MD, Gehlenborg N, Bankhead P, Roland JT, Herndon JM, Snyder MP, Angelo M, Nolan G, Swedlow JR, Human Tumor Atlas Network (2022) MITI minimum information guidelines for highly multiplexed tissue images. Nat Methods 19 (3):262–267. https://doi.org/10.1038/s41592-022-01415-4
Schenk AD, Cavadini S, Thomä NH, Genoud C (2020) Live analysis and reconstruction of single-particle cryo-electron microscopy data with CryoFLARE. J Chem Inf Model 60(5):2561–2569. https://doi.org/10.1021/acs.jcim.9b01102
Article CAS PubMed Google Scholar
Scherer S, Kowal J, Chami M, Dandey V, Arheit M, Ringler P, Stahlberg H (2014) 2dx_automator: implementation of a semiautomatic high-throughput high-resolution cryo-electron crystallography pipeline. J Struct Biol 186(2):302–307. https://doi.org/10.1016/j.jsb.2014.03.016
Article PubMed Google Scholar
Scheres SHW (2012) RELION: implementation of a Bayesian approach to cryo-EM structure determination. J Struct Biol 180(3):519–530. https://doi.org/10.1016/j.jsb.2012.09.006
Article CAS PubMed PubMed Central Google Scholar
Scheres SHW (2015) Semi-automated selection of cryo-EM particles in RELION-1.3. J Struct Biol 189(2):114–122. https://doi.org/10.1016/j.jsb.2014.11.010
Article CAS PubMed PubMed Central Google Scholar
Scheres SHW (2016) Processing of structurally heterogeneous Cryo-EM data in RELION. In: Crowther RA (ed) Methods in enzymology, vol 579. Academic Press, pp 125–157. https://doi.org/10.1016/bs.mie.2016.04.012
Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, Preibisch S, Rueden C, Saalfeld S, Schmid B, Tinevez J-Y, White DJ, Hartenstein V, Eliceiri K, Tomancak P, Cardona A (2012) Fiji: an open-source platform for biological-image analysis. Nat Methods 9(7):676–682. https://doi.org/10.1038/nmeth.2019
Article CAS PubMed Google Scholar
Schnase JL, Duffy DQ, Tamkin GS, Nadeau D, Thompson JH, Grieg CM, McInerney MA, Webster WP (2017) MERRA Analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput Environ Urban Syst 61:198–211. https://doi.org/10.1016/j.compenvurbsys.2013.12.003
Article Google Scholar
Schneider CA, Rasband WS, Eliceiri KW (2012) NIH Image to ImageJ: 25 years of image analysis. Nat Methods 9(7):671–675. https://doi.org/10.1038/nmeth.2089
Article CAS PubMed PubMed Central Google Scholar
Schur FKM (2019) Toward high-resolution in situ structural biology with cryo-electron tomography and subtomogram averaging. Curr Opin Struct Biol 58:1–9. https://doi.org/10.1016/j.sbi.2019.03.018
Article CAS PubMed Google Scholar
Shahmoradian SH, Lewis AJ, Genoud C, Hench J, Moors TE, Navarro PP, Castaño-Díez D, Schweighauser G, Graff-Meyer A, Goldie KN, Sütterlin R, Huisman E, Ingrassia A, Gier Yd, Rozemuller AJM, Wang J, Paepe AD, Erny J, Staempfli A, Hoernschemeyer J, Großerüschkamp F, Niedieker D, El-Mashtoly SF, Quadri M, Van Ijcken WFJ, Bonifati V, Gerwert K, Bohrmann B, Frank S, Britschgi M, Stahlberg H, Van de Berg WDJ, Lauer ME (2019) Lewy pathology in Parkinson’s disease consists of crowded organelles and lipid membranes. Nat Neurosci 22(7):1099–1109. https://doi.org/10.1038/s41593-019-0423-2
Article CAS PubMed Google Scholar
Sheffield NC, Bonazzi VR, Bourne PE, Burdett T, Clark T, Grossman RL, Spjuth O, Yates AD (2022) From biomedical cloud platforms to microservices: next steps in FAIR data and analysis. Sci Data 9(1):553. https://doi.org/10.1038/s41597-022-01619-5
Article PubMed PubMed Central Google Scholar
Silver J (2022) Optimisation and automation of workflows from data capture to data processing in electron microscopy. Zenodo. https://doi.org/10.5281/zenodo.7039561
Starr J, Castro E, Crosas M, Dumontier M, Downs RR, Duerr R, Haak LL, Haendel M, Herman I, Hodson S, Hourclé J, Kratz JE, Lin J, Nielsen LH, Nurnberger A, Proell S, Rauber A, Sacchi S, Smith A, Taylor M, Clark T (2015) Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comput Sci 1:e1. https://doi.org/10.7717/peerj-cs.1
Article PubMed Google Scholar
Stocker M, Darroch L, Krahl R, Habermann T, Devaraju A, Schwardmann U, D’Onofrio C, Häggström I (2020) Persistent identification of instruments. Data Sci J 19(1):18. https://doi.org/10.5334/dsj-2020-018
Article Google Scholar
Stodden V, McNutt M, Bailey DH, Deelman E, Gil Y, Hanson B, Heroux MA, Ioannidis JPA, Taufer M (2016) Enhancing reproducibility for computational methods. Science 354(6317):1240–1241. https://doi.org/10.1126/science.aah6168
Article CAS PubMed Google Scholar
Suloway C, Pulokas J, Fellmann D, Cheng A, Guerra F, Quispe J, Stagg S, Potter CS, Carragher B (2005) Automated molecular microscopy: The new Leginon system. J Struct Biol 151(1):41–60. https://doi.org/10.1016/j.jsb.2005.03.010
Article CAS PubMed Google Scholar
Swedlow JR, Kankaanpää P, Sarkans U, Goscinski W, Galloway G, Malacrida L, Sullivan RP, Härtel S, Brown CM, Wood C, Keppler A, Paina F, Loos B, Zullino S, Longo DL, Aime S, Onami S (2021) A global view of standards for open image data formats and repositories. Nat Methods 18(12):1440–1446. https://doi.org/10.1038/s41592-021-01113-7
Article CAS PubMed Google Scholar
Taillon JA, Bina TF, Plante RL, Newrock MW, Greene GR, Lau JW (2021) NexusLIMS: a laboratory information management system for shared-use electron microscopy facilities. Microsc Microanal 27(3):511–527. https://doi.org/10.1017/S1431927621000222
Article CAS Google Scholar
Taubert F, Bucker HM (2017) On the reproducibility of biological image workflows by annotating computational results automatically. In: 2017 IEEE international conference on bioinformatics and biomedicine (BIBM), Kansas City, 2017. pp 1538–1545
Uddin MR, Ahmed AY, Khan K, Fatemi MS, Zeng X, Xu M (2021) Practical analysis of macromolecule identity from cryo-electron tomography images using deep learning. In: 2021 IEEE applied imagery pattern recognition workshop (AIPR), 12–14 Oct, 2021, pp 1–9. https://doi.org/10.1109/AIPR52630.2021.9762209
van Schyndel J (2022) The electron microscopy data-processing portal: a new capability for data-intensive research. Zenodo. https://doi.org/10.5281/zenodo.7302408
van Schyndel J, Silver J, Poger D (2021) Investigation, prototyping and deployment of a higher-level service for data transport. Zenodo. https://doi.org/10.5281/zenodo.5124450
von Chamier L, Laine RF, Jukkala J, Spahn C, Krentzel D, Nehme E, Lerche M, Hernández-Pérez S, Mattila PK, Karinou E, Holden S, Solak AC, Krull A, Buchholz T-O, Jones ML, Royer LA, Leterrier C, Shechtman Y, Jug F, Heilemann M, Jacquemet G, Henriques R (2021) Democratising deep learning for microscopy with ZeroCostDL4Mic. Nat Commun 12(1):2276. https://doi.org/10.1038/s41467-021-22518-0
Article CAS Google Scholar
Voss NR, Yoshioka CK, Radermacher M, Potter CS, Carragher B (2009) DoG picker and TiltPicker: software tools to facilitate particle selection in single particle electron microscopy. J Struct Biol 166(2):205–213. https://doi.org/10.1016/j.jsb.2009.01.004
Article CAS PubMed PubMed Central Google Scholar
Wang F, Gong H, Liu G, Li M, Yan C, Xia T, Li X, Zeng J (2016) DeepPicker: a deep learning approach for fully automated particle picking in cryo-EM. J Struct Biol 195(3):325–336. https://doi.org/10.1016/j.jsb.2016.07.006
Article PubMed Google Scholar
Wang C, Steiner U, Sepe A (2018) Synchrotron big data science. Small 14(46):1802291. https://doi.org/10.1002/smll.201802291
Article CAS Google Scholar
Watson ER, Taherian Fard A, Mar JC (2022) Computational methods for single-cell imaging and omics data integration. Front Mol Biosci 8:768106. https://doi.org/10.3389/fmolb.2021.768106
Article CAS PubMed PubMed Central Google Scholar
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, ’t Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B (2016) The FAIR guiding principles for scientific data management and stewardship. Sci Data 3(1):160018. https://doi.org/10.1038/sdata.2016.18
Article PubMed PubMed Central Google Scholar
Williams E, Moore J, Li SW, Rustici G, Tarkowska A, Chessel A, Leo S, Antal B, Ferguson RK, Sarkans U, Brazma A, Carazo Salas RE, Swedlow JR (2017) Image data resource: a bioimage data integration and publication platform. Nat Methods 14(8):775–781. https://doi.org/10.1038/nmeth.4326
Article CAS PubMed PubMed Central Google Scholar
Wulf WA (1993) The collaboratory opportunity. Science 261(5123):854–855. https://doi.org/10.1126/science.8346438
Article CAS PubMed Google Scholar
Yip KM, Fischer N, Paknia E, Chari A, Stark H (2020) Atomic-resolution protein structure determination by cryo-EM. Nature 587(7832):157–161. https://doi.org/10.1038/s41586-020-2833-4
Article CAS PubMed Google Scholar
Zaritsky A, Jamieson AR, Welf ES, Nevarez A, Cillay J, Eskiocak U, Cantarel BL, Danuser G (2021) Interpretable deep learning uncovers cellular properties in label-free live cell images that are predictive of highly metastatic melanoma. Cell Syst 12(7):733-747.e736. https://doi.org/10.1016/j.cels.2021.05.003
Article CAS PubMed PubMed Central Google Scholar
Zhou T, Cherukara M, Phatak C (2021) Differential programming enabled functional imaging with Lorentz transmission electron microscopy. NPJ Comput Mater 7(1):141. https://doi.org/10.1038/s41524-021-00600-x
Article Google Scholar
Zimanyi CM, Kopylov M, Potter CS, Carragher B, Eng ET (2022) Broadening access to cryoEM through centralized facilities. Trends Biochem Sci 47(2):106–116. https://doi.org/10.1016/j.tibs.2021.10.007
Article CAS PubMed PubMed Central Google Scholar
Zivanov J, Nakane T, Forsberg BO, Kimanius D, Hagen WJH, Lindahl E, Scheres SHW (2018) New tools for automated high-resolution cryo-EM structure determination in RELION-3. eLife 7:e42166. https://doi.org/10.7554/eLife.42166
Article PubMed PubMed Central Google Scholar
Zivanov J, Nakane T, Scheres SHW (2020) Estimation of high-order aberrations and anisotropic magnification from cryo-EM data sets in RELION-3.1. IUCrJ 7(2):253–267. https://doi.org/10.1107/S2052252520000081
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This contribution contains original data and findings published in reports from the Australian Characterisation Commons at Scale (ACCS) project. The authors would like to thank Chris Myers, Wojtek Goscinski, Dr Andrew Mehnert and Dr Roger Wepf for fruitful discussions and valuable feedback during the writing of this review and over the course of ACCS. The authors are especially indebted to Dr Roger Wepf for his instrumental support in the development of data-management and data-analysis capabilities at Microscopy Australia and his leading role in microscopy in ACCS. The authors are also grateful to Joshua Silver, Jay van Schyndel and Dr Hoang Nguyen, who have taken part in the work package focused on big-data microscopy in ACCS and developed tools and methods presented in this review. The authors would like to thank the microscopy research facilities and organisations that contributed to the survey on their adaptation to large data volumes in electron microscopy, namely: Monash Centre for Electron Microscopy (Monash University, Australia; ROR ID: https://ror.org/02bfwt286); Monash Ramaciotti Centre for Cryo-Electron Microscopy (Monash University, Australia; ROR ID: https://ror.org/02bfwt286); Centre for Advanced Microscopy (The Australian National University, Australia; ROR ID: https://ror.org/019wvm592); National Laboratory for X-ray Micro-Computed Tomography (The Australian National University, Australia; ROR ID: https://ror.org/019wvm592); Adelaide Microscopy (The University of Adelaide, Australia; ROR ID: https://ror.org/00892tw58); Melbourne Advanced Microscopy Facility (The University of Melbourne, Australia; ROR ID: https://ror.org/01ej9dk98); Centre for Microscopy and Microanalysis (The University of Queensland, Australia; ROR ID: https://ror.org/00rqy9422); Sydney Microscopy and Microanalysis (The University of Sydney, Australia; ROR ID: https://ror.org/0384j8v12); Cryogenic Electron Microscopy Facility (University of Wollongong, Australia; ROR ID: https://ror.org/00jtmb277); Electron Microscope Unit (UNSW Sydney, Australia; ROR ID: https://ror.org/03r8z3t63); Centre d’Elaboration de Matériaux et d’Etudes Structurales (CNRS Toulouse, France; ROR ID: https://ror.org/03kwnqq69); Maastricht University (Netherlands; ROR ID: https://ror.org/02jz4aj89); Leiden University (Netherlands; ROR ID: https://ror.org/027bh9e22); European Molecular Biology Laboratory (Heidelberg, Germany; ROR ID: https://ror.org/03mstc592); Materials Research Laboratory (University of Illinois at Urbana–Champaign, USA; ROR ID: https://ror.org/047426m28); and Simons Electron Microscopy Center (New York Structural Biology Center, USA; ROR ID: https://ror.org/00new7409). ACCS is a project co-funded by the Australian Research Data Commons (ROR ID: https://ror.org/038sjwq14) and has co-investment from 12 Australian universities and research organisations, including Microscopy Australia. Microscopy Australia (ROR ID: https://ror.org/042mm0k03) is a national, open-access microscopy facility, supported under Australia’s National Collaborative Research Infrastructure Strategy (NCRIS) programme with university and state government support.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Microscopy Australia, The University of Sydney, Sydney, NSW, 2006, Australia
David Poger & Lisa Yen
Australian Centre for Microscopy and Microanalysis, The University of Sydney, Sydney, NSW, 2006, Australia
Filip Braet
School of Medical Sciences (Molecular and Cellular Biomedicine), The University of Sydney, Sydney, NSW, 2006, Australia
Filip Braet

Authors

David Poger
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Yen
View author publications
You can also search for this author in PubMed Google Scholar
Filip Braet
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.P. wrote the main manuscript and prepared Tables 1 and 2 and Figs. 1, 2 and 3. F.B. and L.Y. critically reviewed and contributed to the manuscript. All authors have read and agreed to the submitted version of the manuscript.

Corresponding author

Correspondence to David Poger.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Poger, D., Yen, L. & Braet, F. Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management. Histochem Cell Biol 160, 169–192 (2023). https://doi.org/10.1007/s00418-023-02191-8

Download citation

Accepted: 21 March 2023
Published: 13 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00418-023-02191-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Big data in contemporary electron microscopy: challenges and opportunities in data transfer, compute and management

Abstract

Similar content being viewed by others

A simple, web-based repository for the management, access and analysis of micrographic images

Publishing and sharing multi-dimensional image data with OMERO

Building a FAIR image data ecosystem for microscopy communities

Introduction

Big data in electron microscopy

The properties of big data

The ten V’s of big data in electron microscopy

Big-data electron microscopy at research facilities

The big-data revolution at microscopy research facilities

Data transfer at microscopy research facilities

Data management at microscopy research facilities

Transfer of big data

Achieving fast data transfer

Monitoring network performance

Processing and analysis of big data

EM big data and artificial intelligence

Computing EM big data

Workflow optimisation

Management of big data

Establishing standards and conventions

Managing the big-data deluge

Operationalising the FAIR and CARE principles

Conclusions and future perspectives

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation