Issues Facing the HPC-AI Industry: Insights from the Advisory Committees of the HPC-AI Leadership Organization (HALO)

EXECUTIVE SUMMARY

The HPC-AI Leadership Organization (HALO) aims to advance scalable computing in both High Performance Computing (HPC) and Artificial Intelligence (AI). Through interviews with its Advisory Committee, Intersect360 Research has identified key challenges and opportunities in the HPC-AI landscape.

The design of optimal HPC-AI infrastructure presents complex challenges – including the choice between homogenous and heterogeneous systems, integration of diverse processor types, and balancing AI and traditional HPC needs. These decisions impact user productivity, result reproducibility, and adaptability to technological advancements.

A critical shortage of skilled personnel in computational sciences and HPC-AI system management is clearly evident. Regional disparities also exist, with Asia/Pacific having an advantage due to an emphasis on educational programs. The commercial sector’s higher compensation further complicates talent retention in public and academic sectors.

Portability and ease of use remain significant concerns, as specialization in computational elements often leads to reduced portability. Porting applications across different systems or upgrading software versions requires substantial time and resources, often viewed as sideways progress rather than advancement.

Accuracy and reproducibility of results are crucial, especially in fields like research, medicine, and engineering. The increasing diversity of chip technology complicates result consistency and verification across different systems.

The market also faces processor suitability, chip supply, and design issues. Different applications require various processor types, leading to difficulties in system design and procurement. The high demand for AI-optimized GPUs is influencing market dynamics and potentially skewing HPC system designs.

AI and Large Language Model (LLM) training and use face hurdles including data availability, ownership issues, legal restrictions, and cultural implications. Developing efficient training methods, managing data transfers, and validating results are all ongoing concerns.

System software stacks present three major issues: the impact of “HPC Nationalism” on knowledge exchange, difficulties in integrating AI and traditional HPC support, and the need for improved schedulers and file systems to meet evolving HPC-AI needs.

Sustainability and power consumption are both growing concerns. The increasing energy demands may necessitate infrastructure upgrades and potentially reshape HPC-AI management strategies.

HALO aims to address these challenges through cross-industry collaboration, guiding technology development, and fostering innovation in the HPC-AI field. The organization’s structure – divided into three geographical areas – allows for targeted approaches to regional needs and challenges.

 

TABLE OF CONTENTS

EXECUTIVE SUMMARY                                                                                                                   2

TABLE OF CONTENTS                                                                                                                      4

INTRODUCTION                                                                                                                             5

The HPC-AI Leadership Organization (HALO)                                                                                    5

Interviews with HALO Advisory Committee Members                                                                        6

ISSUES FACING THE HPC-AI INDUSTRY                                                                                            7

Designing Optimal HPC-AI Infrastructure                                                                                        7

Supporting Insights from Intersect360 Research Studies                                                                  8

Figure 1: Accelerators Configured per Node in HPC-AI Systems                                                           8

Figure 2: HPC-AI Performance Relative to Expectations                                                                     9

Figure 3: Average HPC-AI System Utilization, by Sector and Budget                                                   10

Human Resources                                                                                                                         10

Portability and Ease of Use                                                                                                           11

Accuracy and Reproducibility of Results                                                                                        11

The Processor Market: Suitability to HPC, Chip Supply, Design Issues                                               12

Training and Use of AI/LLM                                                                                                           13

Supporting Insights from Intersect360 Research Studies                                                                 13

Figure 4: HPC User Engagement with Generative AI                                                                          14

Figure 5: LLM Adoption Among HPC Users                                                                                       14

System Software Stacks                                                                                                                15

Supporting Insights from Intersect360 Research Studies                                                                 15

Figure 6: Programming Languages in Use for HPC-AI                                                                       16

Sustainability                                                                                                                               16

CONCLUSIONS                                                                                                                              17