Machine Learning in HPC: Workloads, Frameworks, Data Types, Configurations, Cloud Usage, Business Outcomes

EXECUTIVE SUMMARY

Intersect360 Research studied machine learning (ML) initiatives across HPC-using organizations worldwide. This study of 152 qualified respondents in academic, commercial and government organizations was conducted from Q4 2022 to Q1 2023. Our goal was to capture the dynamics of ML initiatives within organizations that employ HPC, with ML treated as a broad category that includes deep learning and other AI methods. The study analyzes a wide spectrum of factors: budgets, workflows, data types and resources, configurations, cloud usage, and organizational/business outcomes.

A large majority (89%) of the surveyed sites now use or have plans in place to use ML, and 93% of the survey respondents directly influence ML budgets and/or purchasing.  Annual budgets for combined HPC and ML range from under $10,000 to more than $10 million. Slightly more than half of the average budget is spent on standalone HPC, with one-fifth going toward pure AI and the remaining one-quarter toward mixed HPC and ML. Half of the sites have difficulty finding enough ML developers.

The sites represent a wide range of application domains. Only about one in five ML applications are full production mode. ML is most heavily used for image recognition and pre/post-processing of HPC (simulation) applications. Text, numbers, video, audio and video data are all well represented. ML data sizes range from 500GB to more than 1PB. Supervised learning is much more common than unsupervised. Open-source software is only slightly more common than software from other sources. Most sites use multiple frameworks, with PyTorch and TensorFlow most popular. The highest levels of scalability are associated with cloud-based or cloud-provided frameworks.