Advertisement

The Journal of Supercomputing

, Volume 72, Issue 12, pp 4546–4572 | Cite as

Optimizing the data-collection time of a large-scale data-acquisition system through a simulation framework

  • Tommaso Colombo
  • Holger Fröning
  • Pedro Javier Garcìa
  • Wainer Vandelli
Article

Abstract

The ATLAS detector at CERN records particle collision “events” delivered by the Large Hadron Collider. Its data-acquisition system identifies, selects, and stores interesting events in near real-time, with an aggregate throughput of several 10 GB/s. It is a distributed software system executed on a farm of roughly 2000 commodity worker nodes communicating via TCP/IP on an Ethernet network. Event data fragments are received from the many detector readout channels and are buffered, collected together, analyzed and either stored permanently or discarded. This system, and data-acquisition systems in general, are sensitive to the latency of the data transfer from the readout buffers to the worker nodes. Challenges affecting this transfer include the many-to-one communication pattern and the inherently bursty nature of the traffic. The main performance issues brought about by this workload are addressed in this paper, focusing in particular on the so-called TCP incast pathology. Since performing systematic studies of these issues is often impeded by operational constraints related to the mission-critical nature of these systems, we developed a simulation model of the ATLAS data-acquisition system. The resulting simulation tool is based on the well-established, widely-used OMNeT++ framework. This tool was successfully validated by comparing the obtained simulation results with existing measurements of the system’s behavior. Furthermore, the simulation tool enables the study of the theoretical behavior of the system in numerous what-if scenarios and with modifications that are not immediately applicable to the real system. In this paper, we take advantage of this to analyze the behavior of the system using different traffic shaping and scheduling policies, and with network hardware modifications. This analysis leads to conclusions that could be used to devise future system enhancements.

Keywords

Data acquisition Network Data collection Latency ATLAS Incast Ethernet TCP 

Notes

Acknowledgments

This work was partially supported by the Ministry of Economy and Competitiveness of Spain (project TIN2012-38341-C04) and the European Commission (ERDF).

References

  1. 1.
    ATLAS Collaboration (2008) The ATLAS experiment at the CERN large hadron collider. J Instrum 3(08):S08,003. doi: 10.1088/1748-0221/3/08/S08003
  2. 2.
    ATLAS Collaboration (2003) ATLAS high-level trigger, data-acquisition and controls. Technical Design Report ATLAS-TDR-016 CERN-LHCC-2003-022, CERN, GenevaGoogle Scholar
  3. 3.
    Pozo Astigarraga ME (2015) Evolution of the ATLAS trigger and data acquisition system. J Phys Conf Ser 608(1):012,006. doi: 10.1088/1742-6596/608/1/012006 CrossRefGoogle Scholar
  4. 4.
    Phanishayee A et al (2008) Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In: Proc. of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pp 12:1–12:14. USENIX Association, BerkeleyGoogle Scholar
  5. 5.
    Allman M, Paxson V, Blanton E (2009) TCP congestion control. RFC 5681. RFC Editor. doi: 10.17487/RFC5681
  6. 6.
    Paxson V, Allman M, Chu J, Sargent M (2011) Computing TCP’s retransmission timer. RFC 6298. RFC Editor. doi: 10.17487/RFC6298
  7. 7.
    Colombo T (2015) Data-flow performance optimisation on unreliable networks: the ATLAS data-acquisition case. J Phys Conf Ser 608(1):012,005. doi: 10.1088/1742-6596/608/1/012005 CrossRefGoogle Scholar
  8. 8.
  9. 9.
  10. 10.
  11. 11.
    Vargas A (2010) OMNeT++. In: Wehrle K, Gross J, Günes M (eds) Modeling and tools for network simulation, pp 35–59. Springer, Berlin. doi: 10.1007/978-3-642-12331-3
  12. 12.
    Castalia wireless sensor network simulator. https://castalia.forge.nicta.com.au
  13. 13.
    INET framework open-source OMNeT++ model suite for wired, wireless, and mobile networks. https://inet.omnetpp.org
  14. 14.
    Köpke A, Swigulski M, Wessel K, Willkomm D, Haneveld PTK, Parker TEV, Visser OW, Lichte HS, Valentin S (2008) Simulating wireless and mobile networks in OMNeT++ the MiXiM vision. In: Proceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems & Workshops, Simutools 2008, p 71. ICST, Brussels. doi: 10.4108/ICST.SIMUTOOLS2008.3031
  15. 15.
    Núñez A, Fernández J, Garcia JD, Prada L, Carretero J (2008) SIMCAN: a simulator framework for computer architectures and storage networks. In: Proceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems & Workshops, Simutools 2008, p 73. ICST, Brussels. doi: 10.4108/ICST.SIMUTOOLS2008.3025
  16. 16.
    Yebenes P, Escudero-Sahuquillo J, Garcia PJ, Quiles FJ (2013) Towards modeling interconnection networks of exascale systems with OMNet++. In: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP ’13, pp 203–207. IEEE Computer Society, Washington. doi: 10.1109/PDP.2013.36
  17. 17.
    Reschka T, Dreibholz T, Becke M, Pulinthanath J, Rathgeb EP (2008) Enhancement of the TCP module in the OMNeT++/INET framework. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, Simutools 2010, p 24. ICST, Brussels (2008). doi: 10.4108/ICST.SIMUTOOLS2010.8834
  18. 18.
    Ha S, Rhee I, Xu L (2008) CUBIC: a new TCP-friendly high-speed TCP variant. SIGOPS Oper Syst Rev 42(5):64–74. doi: 10.1145/1400097.1400105 CrossRefGoogle Scholar
  19. 19.
    Henderson T, Floyd S, Gurtov A, Nishida Y (2012) The NewReno modification to TCP’s fast recovery algorithm. RFC 6582. RFC Editor. doi: 10.17487/RFC6582
  20. 20.
    Alessio F et al (2014) The LHCb Data Acquisition during LHC Run 1. J Phys Conf Ser 513(1):012033. doi: 10.1088/1742-6596/513/1/012033
  21. 21.
    Carena F et al (2014) The ALICE data acquisition system. Nucl Instrum Methods Phys Res A 741:130–162. doi: 10.1016/j.nima.2013.12.015 CrossRefGoogle Scholar
  22. 22.
    Bawej T et al (2014) Boosting Event Building Performance using Infiniband FDR for the CMS Upgrade. Proc Sci TIPP2014:190Google Scholar
  23. 23.
    The ns-3 network simulator. http://www.nsnam.org
  24. 24.
    Zhang Y, Ansari N (2013) On architecture design, congestion notification, TCP incast and power consumption in data centers. IEEE Commun Surv Tutor 15(1):39–64. doi: 10.1109/SURV.2011.122211.00017 CrossRefGoogle Scholar
  25. 25.
    Jereczek G, Lehmann-Miotto G, Malone D (2015) Analogues between tuning TCP for data acquisition and datacenter networks. In: IEEE Int. Conf. CommGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Tommaso Colombo
    • 1
    • 2
  • Holger Fröning
    • 2
  • Pedro Javier Garcìa
    • 3
  • Wainer Vandelli
    • 1
  1. 1.Physics DepartmentCERNGenevaSwitzerland
  2. 2.Institut für Technische Informatik (ZITI)Universität HeidelbergMannheimGermany
  3. 3.Dep. de Sistemas InformàticosUniversidad de Castilla-La ManchaAlbaceteSpain

Personalised recommendations