EXECUTIVE SUMMARY
The HPC-AI Leadership Organization (HALO) aims to advance scalable computing in both High Performance Computing (HPC) and Artificial Intelligence (AI). Through interviews with its Advisory Committee, Intersect360 Research has identified key challenges and opportunities in the HPC-AI landscape.
The design of optimal HPC-AI infrastructure presents complex challenges – including the choice between homogenous and heterogeneous systems, integration of diverse processor types, and balancing AI and traditional HPC needs. These decisions impact user productivity, result reproducibility, and adaptability to technological advancements.
A critical shortage of skilled personnel in computational sciences and HPC-AI system management is clearly evident. Regional disparities also exist, with Asia/Pacific having an advantage due to an emphasis on educational programs. The commercial sector’s higher compensation further complicates talent retention in public and academic sectors.
Portability and ease of use remain significant concerns, as specialization in computational elements often leads to reduced portability. Porting applications across different systems or upgrading software versions requires substantial time and resources, often viewed as sideways progress rather than advancement.
Accuracy and reproducibility of results are crucial, especially in fields like research, medicine, and engineering. The increasing diversity of chip technology complicates result consistency and verification across different systems.
The market also faces processor suitability, chip supply, and design issues. Different applications require various processor types, leading to difficulties in system design and procurement. The high demand for AI-optimized GPUs is influencing market dynamics and potentially skewing HPC system designs.
AI and Large Language Model (LLM) training and use face hurdles including data availability, ownership issues, legal restrictions, and cultural implications. Developing efficient training methods, managing data transfers, and validating results are all ongoing concerns.
System software stacks present three major issues: the impact of “HPC Nationalism” on knowledge exchange, difficulties in integrating AI and traditional HPC support, and the need for improved schedulers and file systems to meet evolving HPC-AI needs.
Sustainability and power consumption are both growing concerns. The increasing energy demands may necessitate infrastructure upgrades and potentially reshape HPC-AI management strategies.
HALO aims to address these challenges through cross-industry collaboration, guiding technology development, and fostering innovation in the HPC-AI field. The organization’s structure – divided into three geographical areas – allows for targeted approaches to regional needs and challenges.
EXECUTIVE SUMMARY 2
TABLE OF CONTENTS 4
INTRODUCTION 5
The HPC-AI Leadership Organization (HALO) 5
Interviews with HALO Advisory Committee Members 6
ISSUES FACING THE HPC-AI INDUSTRY 7
Designing Optimal HPC-AI Infrastructure 7
Supporting Insights from Intersect360 Research Studies 8
Figure 1: Accelerators Configured per Node in HPC-AI Systems 8
Figure 2: HPC-AI Performance Relative to Expectations 9
Figure 3: Average HPC-AI System Utilization, by Sector and Budget 10
Human Resources 10
Portability and Ease of Use 11
Accuracy and Reproducibility of Results 11
The Processor Market: Suitability to HPC, Chip Supply, Design Issues 12
Training and Use of AI/LLM 13
Supporting Insights from Intersect360 Research Studies 13
Figure 4: HPC User Engagement with Generative AI 14
Figure 5: LLM Adoption Among HPC Users 14
System Software Stacks 15
Supporting Insights from Intersect360 Research Studies 15
Figure 6: Programming Languages in Use for HPC-AI 16
Sustainability 16
CONCLUSIONS 17