Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

基于Intel众核架构的Java高性能计算研究与优化

Abstract

The increasing demand for performance has stimulated the wide adoption of many-core accelerators like Intel® Xeon PhiTM Coprocessor, which is based on Intel’s Many Integrated Core architecture. While many HPC applications running in native mode have been tuned to run efficiently on Xeon Phi, it is still unclear how a managed runtime like JVM performs on such an architecture. In this paper, we present the first measurement study of a set of Java HPC applications on Xeon Phi under JVM. One key obstacle to the study is that there is currently little support of Java for Xeon Phi. This paper presents the result based on the first porting of OpenJDK platform to Xeon Phi, in which the HotSpot virtual machine acts as the kernel execution engine. The main difficulty includes the incompatibility between Xeon Phi ISA and the assembly library of Hotspot VM. By evaluating the multithreaded Java Grande benchmark suite and our ported Java Phoenix benchmarks, we quantitatively study the performance and scalability issues of JVM on Xeon Phi and draw several conclusions from the study. To fully utilize the vector computing capability and hide the significant memory access latency on the coprocessor, we present a semi-automatic vectorization scheme and software prefetching model in HotSpot. Together with 60 physical cores and tuning, our optimized JVM achieves averagely 2.7x and 3.5x speedup compared to Xeon CPU processor by using vectorization and prefetching accordingly. Our study also indicates that it is viable and potentially performance-beneficial to run applications written for such a managed runtime like JVM on Xeon Phi.

摘要

创新点

基于Intel集成众核架构(MIC)的Xeon Phi协处理器是近年来十分流行的一款众核产品, 而Java由于其优秀的平台移植性与日益提升的虚拟机性能也越来越多地被应用于高性能计算领域, 然而遗憾的是, Intel并未对Xeon Phi提供Java环境支持。 本文实现了首个针对Intel MIC平台的OpenJDK移植工作, 并成功地搭建了一个完整的Java运行时环境。 同时, 我们基于一系列计算密集型Java程序详细研究了MIC上的Java HPC性能吞吐量与可扩展性, 并针对其中存在的问题分别提出了一个半自动的向量化模型与数据预取解决方案。实验表明, 本文所提出的优化方案可以在MIC上带来显著的性能提升, 并同时论证了Intel众核平台在Java高性能计算领域拥有巨大的潜力。

This is a preview of subscription content, log in to check access.

References

  1. 1

    Chrysos G. Intel® Xeon PhiTM Coprocessor-the Architecture. Intel Whitepaper, 2014

  2. 2

    Shafi A, Carpenter B, Baker M. Nested parallelism for multi-core HPC systems using Java. J Parall Distrib Comput, 2009, 69: 532–545

  3. 3

    Moreira J E, Midkiff S P, Gupta M, et al. NINJA: Java for high performance numerical computing. Sci Program, 2002, 10: 19–33

  4. 4

    Amedro B, Bodnartchouk V, Caromel D, et al. Current state of Java for HPC. Technical Report RT-0353. INRIA, 2008

  5. 5

    O’Mullane W, Luri X, Parsons P, et al. Using Java for distributed computing in the Gaia satellite data processing. Exp Astron, 2011, 31: 243–258

  6. 6

    Taboada G L, Touri˜no J, Doallo R. Java for high performance computing: assessment of current research and practice. In: Proceedings of the 7th International Conference on Principles and Practice of Programming in Java. New York: ACM, 2009. 30–39

  7. 7

    Boisvert R F, Moreira J, Philippsen M, et al. Java and numerical computing. Comput Sci Eng, 2001, 3: 18–24

  8. 8

    Guide P. Intel R 64 and IA-32 Architectures Software Developer’s Manual. 2010

  9. 9

    Blumofe R D, Joerg C F, Kuszmaul B C, et al. Cilk: an efficient multithreaded runtime system. In: Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 1995. 207–216

  10. 10

    Lindholm T, Yellin F, Bracha G, et al. The Java Virtual Machine Specification. 8th ed. Redwood City: Pearson Education, 2014

  11. 11

    Intel. Intel ® Xeon PhiTM Coprocessor Instruction Set Architecture Reference Manual. 2012

  12. 12

    Smith L A, Bull J M, Obdrizalek J. A parallel Java grande benchmark suite. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. New York: ACM, 2001. 8

  13. 13

    Ranger C, Raghuraman R, Penmetsa A, et al. Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture. Washington, DC: IEEE, 2007. 13–24

  14. 14

    Fang Z, Mehta S, Yew P C, et al. Measuring microarchitectural details of multi-and many-core memory systems through microbenchmarking. ACM Trans Architect Code Optim, 2015, 11: 55

  15. 15

    Intel. Intel® Xeon PhiTM Coprocessor System Software Developers Guide. 2013

  16. 16

    Mehta S, Fang Z, Zhai A, et al. Multi-stage coordinated prefetching for present-day processors. In: Proceedings of the 28th ACM International Conference on Supercomputing. New York: ACM, 2014. 73–82

  17. 17

    Krishnaiyer R, Kultursay E, Chawla P, et al. Compiler-based data prefetching and streaming non-temporal store generation for the Intel® Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), Cambridge, 2013. 1575–1586

  18. 18

    Wurthinger T, Wimmer C, Mossenbock H. Visualization of program dependence graphs. In: Proceedings of the Joint European Conferences on Theory and Practice of Software and the 17th International Conference on Compiler Construction. Berlin/Heidelberg: Springer-Verlag, 2008. 193–196

  19. 19

    Tuck J, Ceze L, Torrellas J. Scalable cache miss handling for high memory-level parallelism. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC: IEEE, 2006. 409–422

  20. 20

    Fang J, Varbanescu A L, Sips H, et al. An empirical study of Intel Xeon Phi. arXiv:1310.5842

  21. 21

    Ramachandran A, Vienne J, van der Wijngaart R, et al. Performance evaluation of NAS parallel benchmarks on Intel Xeon Phi. In: Proceedings of the 42nd International Conference on Parallel Processing, Lyon, 2013. 736–743

  22. 22

    Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-node systems based on Intel R Xeon PhiTM coprocessor. In: Proceedings of IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), Boston, 2013. 126–137

  23. 23

    Eyerman S, Eeckhout L. The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism. ACM SIGARCH Comput Architect News, 2014, 42: 591–606

  24. 24

    Chen K Y, Chang J M, Hou T W. Multithreading in Java: performance and scalability on multicore systems. IEEE Trans Comput, 2011, 60: 1521–1534

  25. 25

    Gidra L, Thomas G, Sopena J, et al. A study of the scalability of stop-the-world garbage collectors on multicores. In: Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2013. 229–240

  26. 26

    Yan Y H, Grossman M, Sarkar V. JCUDA: a programmer-friendly interface for accelerating Java programs with CUDA. In: Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Berlin/Heidelberg: Springer-Verlag, 2009. 887–899

  27. 27

    Docampo J, Ramos S, Taboada G L, et al. Evaluation of Java for general purpose GPU computing. In: Proceedings of the 27th International Conference on Advanced Information Networking and Applications Workshops. Washington, DC: IEEE, 2013. 1398–1404

Download references

Author information

Correspondence to Binyu Zang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yu, Y., Lei, T., Chen, H. et al. Characterizing and optimizing Java-based HPC applications on Intel many-core architecture. Sci. China Inf. Sci. 60, 122106 (2017). https://doi.org/10.1007/s11432-015-0989-3

Download citation

Keywords

  • many-core
  • Java
  • Xeon Phi
  • HPC
  • prefetching

关键词

  • 众核
  • Java
  • Xeon Phi
  • 高性能计算
  • 数据预取