Novel Approaches for Distributing Workload on Commodity Computer Systems
Efficient management of a distributed system is a common problem for university’s and commercial computer centres, and handling node failures is a major aspect of it. Failures which are rare in a small commodity cluster, at large scale become common, and there should be a way to overcome them without restarting all parallel processes of an application. The efficiency of existing methods can be improved by forming a hierarchy of distributed processes. That way only lower levels of the hierarchy need to be restarted in case of a leaf node failure, and only root node needs special treatment. Process hierarchy changes in real time and the workload is dynamically rebalanced across online nodes. This approach makes it possible to implement efficient partial restart of a parallel application, and transactional behaviour for computer centre service tasks.
KeywordsLong-lived transactions Distributed pipeline Node discovery Software engineering Distributed computing Cluster computing
Unable to display preview. Download preview PDF.
- 1.Andrianov, S., Degtyarev, A.: Parallel and distributed computations. Saint Petersburg State University (2007). (in Russian)Google Scholar
- 2.Armstrong, J.: Making reliable distributed systems in the presence of software errors. PhD thesis, The Royal Institute of Technology Stockholm, Sweden (2003)Google Scholar
- 3.Degtyarev, A.: High performance computer technologies in shipbuilding. In: Birk, L., Harries, S. (eds.) OPTIMISTIC – optimization in marine design. Mensch & Buch Verlag, BerlinGoogle Scholar
- 4.Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., McKeown, N.: Reproducible network experiments using container-based emulation. In: Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, pp. 253–264. ACM (2012)Google Scholar
- 5.Heller, B.: Reproducible Network Research with High-fidelity Emulation. PhD thesis, Stanford University (2013)Google Scholar
- 7.Lantz, B., Heller, B., McKeown, N.: A network in a laptop: rapid prototyping for software-defined networks. In: Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, p. 19. ACM (2010)Google Scholar
- 8.Lifflander, J., Meneses, E., Menon, H., Miller, P., Krishnamoorthy, S., Kalé, L.V.: Scalable replay with partial-order dependencies for message-logging fault tolerance. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 19–28. IEEE (2014)Google Scholar
- 9.Soshmina, I., Bogdanov, A.: Using GRID technologies for computations. Saint Petersburg State University Bulletin (Physics and Chemistry) 3, 130–137 (2007). (in Russian)Google Scholar
- 10.Tel, G.: Introduction to distributed algorithms. Cambridge University Press (2000)Google Scholar
- 11.Wilde, E., Pautasso, C.: REST: from research to practice. Springer Science & Business Media (2011)Google Scholar