Architecting the Future: CXL Memory Disaggregation for large scale AI/ML, HPC workloads
Luis Ancajas
Micron, USA
The advent of the CXL Memory Systems, along with the Linux Supported FAMFS (fabric attached memory file system) enables unprecedented performance breakthroughs for critical AI workloads and HPC workloads. This presentation introduces a CXL-based system and software architecture that redefines how memory is provisioned and used across servers and clusters, while also allowing for seamless adoption by existing software applications.
By enabling large pools of near-memory to be dynamically attached and shared, this architecture unlocks unprecedented performance improvements for graph databases, vector databases, and large-scale data-centric applications, all of which are central to growing AI / ML workloads. We will showcase how this system architecture overcomes long-standing bottlenecks in storage by enabling 10s to 100s of terabytes of near-memory to the servers running these workloads.
Attendees will gain insights into the architectural innovations, real-world benchmarks and deployments strategies that mark a turning point for memory system design — ushering a new era of composable infrastructure optimized for the new frontier of AI workloads.
Enable ML workloads on HPC infrastructure
Mauro Bianco and Stefano Schuppli
Swiss National Supercomputing Centre, ETH, Switzerland
Training state-of-the-art Large Language Models (LLMs) requires unprecedented compute, memory, and data throughput, posing new challenges for High-Performance Computing (HPC) centers traditionally optimized for large-scale numerical simulations. Public institutions have a strong incentive to invest in open-source and open-data LLMs to ensure transparency, reproducibility, and digital sovereignty, but adapting HPC infrastructure and services to this emerging workload requires significant adaptations. The Alps Research Infrastructure at CSCS, built on GH200 technology at scale with 10,752 GPUs, provides a unique platform to explore this transition. While Alps serves a wide range of scientific domains, its early adoption by the Swiss AI community since 2023 has highlighted gaps between traditional HPC services and the dynamic needs of machine learning.
In this presentation we will discuss the overall design of the Alps infrastructure and concrete enhancements deployed, or under development, to simplify the adoption of ML workloads in an HPC environment. These efforts illustrate both the technical and organizational steps needed for HPC centers to become key enablers of open, reproducible, and publicly accessible LLMs—expanding their mission from advancing science through simulation to driving innovation in AI for the public good.
On some Communication Complexity Properties of Linear Transformations
Gianfranco Bilardi
University of Padova, Italy
The computation of linear transformations is ubiquitous, e.g., in machine learning and in signal processing. In parallel and hierarchical computing systems, the achievable performance is highly impacted by the communication requirements. We quantify these requirements via the information exchange, a metric relevant, e.g., to area-time complexity of VLSI computations. We consider the relationship of the information exchange of closely related transforms, presenting the following results:
An invertible linear transformation and its inverse have the same information exchange.
A linear transform and its transpose have the same information over any finite Principal Ideal Ring. (It is an open question whether this property holds for arbitrary finite rings.)
Joint work with Carlo Fantozzi.
Scientific Workflow Management for Generative Artificial Intelligence and Machine Learning
Ewa Deelman
University of Southern California, USA
As AI ML become integral to scientific discovery, their workflows are evolving beyond single-system execution to span multiple HPC centers, cloud platforms, and edge environments. Future AI-driven science will demand orchestration models that can coordinate computation, data movement, and model sharing across diverse, geographically distributed resources—while preserving security, efficiency, and reproducibility. Scientific workflow management systems (WMS) such as Pegasus provide the foundation for this vision, offering automation, scalability, and provenance capture across heterogeneous computing landscapes. Looking ahead, WMS will not only manage the complexity of today’s AI pipelines but also enable new paradigms such as distributed model training, federated learning, and dynamic resource provisioning. This talk will explore how workflow technologies can become the connective tissue for collaborative, multi-site AI, presenting a model for distributed workflow management.
Combining Experiments, Simulations & Data in Geophysics
Anne Elster
NTNU: Norwegian University of Science and Technology, Norway
In this talk, I will highlight some of the ongoing work at the Centre for Geophysical Forecasting at NTNU, where we are combining detailed laboratory experiments with simulations and the generation of large datasets.
Time permitting, I will also highlight some of the AI and HPC capabilities of Norway’s new supercomputer, Olivia, and encourage the discussion of the impact for Norway and Europe of OpenAI´s Stargate announcement.
Flow Computing — A New Way to Boost the Performance of CPUs for Parallel Functionalities
Martti Forsell
Flow Computing Oy, Finland
The performance development of multicore CPUs has long time relied heavily on the constant progress in silicon technology and increasing the number of processor cores per chip. As Moore’s law has slowed significantly, this approach has led to diminishing returns, especially in server processors. Their core count is already high, and architectures are unable to meet the requirements of efficient parallel computation—sufficient bandwidth, latency toleration, cost-efficient synchronization, and a parallel programming methodology simple enough for average programmers. At the same time, the rapid evolution of application-specific accelerators, such as GPUs and NPUs, has remarkably increased their performance and indirectly also the need for flexible and programmable parallel computing solutions. This has effectively left CPUs as the weakest link in computation, with no obvious workaround. Based on decades of scientific work in academia and research institutes, we at Flow Computing have created a new technology to boost the performance of CPUs and simplify the programming of parallel functionalities.
In this presentation, we will have a look at certain aspects of the Flow Computing technology as well as discuss possible development directions, including server CPUs, instruction sets, and IP licensing model. Simple examples and early performance results are shown.
Inclusive and Efficient Foundation Models
Manish Gupta
Google Deep Mind, India
We begin by presenting recent advances in foundation models, which are giving rise to the hope that artificial general intelligence capability is achievable in a not-too-distant future. As we seek to advance Inclusive AI to tackle problems for billions of people in the context of the Global South, we present our work on improving multilingual capabilities and cultural understanding of foundation models, and on improving their computational efficiency to enable scaling them to serve billions of people. We describe Matryoshka Representation Learning and Matryoshka Transformers (Matformers), which enable the elasticity of large models. We also present techniques like Tandem Transformers and High Recall Approximate Top-k Estimation (HiRE) to improve the efficiency of large language models.
AI-based ModSim for Computer Architectures and Applications
Adolfy Hoisie
Brookhaven National Laboratories, USA
The talk will describe recent advances in AI-based ModSim techniques developed at the Brookhaven Lab. A novel deep learning-based performance modeling framework that learns high-dimensional and independent/orthogonal program and microarchitecture representations will be presented. Various practical uses of the accompanying PerVec simulator, which yields a foundation model for performance, will be discussed for a spectrum of architectures and applications with an emphasis on data-intensive workflows. PerfVec’s features and advantages will be shown in the context of existing methods and tools for ModSim.
On-sensor vision: computation opportunities in the image sensor
Paul Kelly
Imperial College London, UK
Several years ago a friend lent us a parallel computer, "SCAMP5". With 65,536 cores. Powered by a micro-USB cable. With a lens on the front - a "focal-plane sensor processor": a CMOS image sensor with a tiny (mostly analog!) processor at every photodiode. We played with it, and did some cool things. Now the friend, Prof Piotr Dudek, has invited us, with colleagues, to join him in designing the next generation of the device. Our role in the project is to drive the design with applications and provide a software toolchain that helps do the right thing in the right place. This talk is about how we are thinking about this design challenge, how we are trying to map the future of robot vision in order to frame the design. It’s also about how to create magic again: the SCAMP5 device is fantastically simple, yet incredibly powerful - a huge and rewarding challenge to creativity. I conclude by inviting your thoughts on what we should do!
Exploring Data Locality in SpGEMM: A Level-Based Implementation of Gustavson’s Algorithm on Multicore CPUs
Dane Lacey
University of Erlangen, Germany
Sparse matrix–matrix multiplication (C = A * B, SpGEMM) is a critical operation in many scientific applications, but has received increasing attention in recent years due to its ubiquity in AI/ML workloads. Gustavson’s algorithm is the most widely used method of computing SpGEMM on multicore CPUs, but it incurs irregular accesses to intermediate structures. These access irregularities pose serious performance difficulties for multicore CPUs, which are optimized for regular access patterns. In this talk, I explore Gustavson’s SpGEMM algorithm, identify opportunities to improve access locality, and propose a level-based implementation to alleviate the problems arising from irregular accesses.
The Reconfigurable Future of AI Accelerators
Tulika Mitra
National University of Singapore, Singapore
Reconfigurable accelerators have long held the promise of near-ASIC efficiency while retaining the programmability missing from fixed-function accelerators. Over the past decade, significant architectural and compiler breakthroughs, including key contributions from our research group on Coarse-Grained Reconfigurable Array (CGRA) such as the HyCUBE and PACE architectures, and the Morpher compiler framework, have significantly improved throughput, energy efficiency, and programmability. Despite these advances, the AI accelerator landscape remains largely dominated by specialized domain-specific accelerators such as Google TPU. However, emerging trends in rapidly evolving AI algorithms, characterized by irregular sparsity and frequent model updates, highlight the unique advantage of reconfigurable fabrics. A paradigm shift from current static mapping approaches to dynamic, data-driven scheduling methods that fully leverage runtime reconfigurability could offer a distinct edge over GPUs and specialized AI accelerators. In this talk, I will outline the current state of the technology, examine the critical adoption barriers, and present a strategic roadmap that could make reconfigurable accelerators the default fabric for high-performance AI acceleration.
Architectural Support for Linear Algebra Acceleration: Bridging ISA Extensions and Programming Interfaces
José Moreira
IBM, T. J. Watson Research Lab., USA
The ever-increasing interest in Artificial Intelligence has generated significant demand for the acceleration of linear algebra operations, particularly matrix multiplication. Many modern processing elements, including CPUs, GPUs, and NPUs, include dedicated matrix units that are optimized for matrix multiplication. Diverse and often incompatible ISA extensions supporting these matrix units have emerged. On top of that, the most mature programming interface for dense linear algebra, the Basic Linear Algebra Subprograms (BLAS), is still mostly focused on scientific/technical computing and presents a mismatch to modern AI/ML frameworks. In this talk we will evaluate and propose directions for both matrix computing ISA extensions and a common programming interface.
Neural Net Guided Branch Prediction: Predicting Hard-to-Predict branches with minimal hardware cost
Luke Panayj
Imperial College London, USA
Branch prediction continues to be a critical bottleneck in general-purpose CPU performance. Pattern-based predictors such as TAGE have run up against limits in how accurate they can be, leaving precious accuracy gains on the table and capping how far future CPU generations can scale. Machine learning is an established approach in tackling the class of branches known as ‘Hard to Predict’ (H2Ps), as seen with BranchNet, a CNN that implements on-chip inference tailored only to predicting H2Ps. BranchNet is effective but expensive, proposing 32KB of on-chip storage on top of regular TAGE. This talk outlines ongoing work to use machine learning model explanations to identify correlating branches in the global history and have TAGE filter only for these branches when predicting H2Ps, improving accuracy with minimal storage.
Falcon and Kameleon: A Chiplet-Based Architecture for Scalable CXL Memory Infrastructure and Accelerated Processing Near Memory
Il Park
Primemas, Korea
The performance of large-scale AI/ML and HPC workloads is increasingly bottlenecked by data supply from massive databases that exceed the capacity of direct-attached memory. While CXL memory disaggregation offers a path forward, conventional approaches relying on external CXL switches introduce a significant "Switch Tax" in the form of latency (> 550ns), increased power consumption, and higher Total Cost of Ownership (TCO). This presentation introduces a novel chiplet-based hardware architecture that overcomes these limitations, centered on the Falcon Hublet®. The Falcon is a versatile chiplet SoC integrating a CXL 3.0 controller, DDR4/DDR5 memory controllers, an embedded ARM CPU cluster, and high-speed Die-to-Die (D2D) interconnects. Our contributions are threefold. First, we present the world’s first switchless pooled memory architecture, created by composing multiple Falcon chiplets on a single substrate to form a coherent D2D fabric. This design eliminates external CXL switches for 1-10TB memory pools, dramatically reducing memory access latency to under 200ns—a key enabler for hyperscale deployments. Second, we detail a high-capacity Just a Bunch of Memory (JBOM) solution built from Falcon-based Add-In Cards (AICs). This architecture offers superior memory density per AIC, reducing the “Switch Tax” in JBOM when scaling to 100s TB configurations. Third, we introduce the Kameleon® FPGA chiplet, a dynamically reconfigurable companion to Falcon. Paired together, Falcon and Kameleon unlock powerful Processing-Near-Memory (PNM) capabilities, enabling intelligent tiered-memory solutions. This heterogeneous approach provides the foundational hardware to realize the full promise of CXL, directly addressing the demands of workloads like graph databases, vector search, and large-scale AI to usher in a new era of composable infrastructure that is both high-performance and cost-effective.
A Hardware Perspective on AI & HPC: Trends, Trade-Offs, and Promising Solutions.
Stefania Perri
University of Calabria, Italy
Artificial Intelligence and High-Performance Computing are no longer separate worlds — they are converging into a single, demanding ecosystem where hardware is both the enabler and the bottleneck. From training trillion-parameter AI models to running exascale scientific simulations, today’s workloads push processors, memory systems, and interconnects to their physical and architectural limits. Due to this, efficient innovative hardware designs play a crucial role in accelerating Deep Learning on High Performance Computing systems and data centers, providing the computational power needed to process vast amounts of data and train complex models. With the growing demand to run Deep Learning models directly on edge devices – such as embedded systems, mobile phones, and IoT smart devices, energy-efficient hardware solutions have become increasingly important.
This talk offers a hardware-focused journey through the state of the art accelerators, exploring how they are evolving under these pressures. We will examine the fundamental trade-offs that guide system design — performance vs. energy efficiency, generality vs. specialization, scalability vs. cost — and how they influence architectural choices across AI and HPC workloads.
Special attention will be devoted to in-memory computing, an emerging paradigm that integrates computation directly into memory arrays, promising to address the long-standing “memory wall” and drastically reduce data movement costs. Through concrete examples and forward-looking insights, the talk will highlight open challenges, key trends, and promising hardware directions that may define the next generation of AI-HPC systems.
Evolutionary Policy Optimization
Keshav Pingali
University of Texas, Austin
A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack mechanisms for exploitation. To circumvent these limitations, Evolutionary Policy Optimization (EPO) integrates neuroevolution with policy gradient methods. EPO leverages the exploration capabilities of EC and the exploitation strengths of PG, offering an efficient solution to the exploration-exploitation dilemma in RL. EPO is evaluated on the Atari Pong and Breakout benchmarks. Experimental results show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, making it effective for tasks that require both exploration and local optimization.
Algorithm Design for Emerging Architectures
Francesco Silvestri
University of Padova, Italy
To respond to the constantly increasing demand for computational resources, particularly for machine learning tasks, several emerging architectures have been proposed, including tensor cores and processing-in-memory. In recent years, several works have presented computational models and algorithmic techniques for designing and analyzing efficient algorithms that leverage such architectures. In this talk, we will present some of these results and some case studies.
Designing for Trust, Transparency, and Efficiency in Scientific Computing: HPC Nondeterminism, Scalable Checkpointing, and AI-Driven Workflows
Michela Taufer
University of Tennessee, Knoxville, USA
Today, scientific computing workflows operate at scales where nondeterminism, opaque AI decisions, and massive data volumes are no longer exceptions—they are the norm. In this talk, I present concrete methods to address these challenges and improve the trustworthiness, transparency, and efficiency of scientific discovery. First, I show how graph-based analysis can reveal the sources and spatial footprint of nondeterminism in large-scale HPC simulations, enabling targeted debugging and reducing wasted computation. Second, I demonstrate a scalable checkpointing framework that pinpoints when and why results diverge, capturing intermediate states efficiently through data deduplication and parallel comparison, thereby overcoming the limitations of traditional full checkpoints. Third, I introduce fine-grained provenance frameworks for AI-driven workflows, exposing hidden model-selection logic and enabling reproducibility and explainability of results. Together, these methods provide practical strategies for detecting, tracing, and explaining divergence in code and data over time.
How many bits do we need? HPC lessons from ML hardware and software
Peter Thoman
University of Innsbruck, Austria
In machine learning, and particularly in inference, the past few years have seen a substantial reduction in the number of bits required to store model weights at sufficient fidelity. This has, in turn, led to hardware optimization and focusing on arithmetic operations for ever-smaller types.
In this talk, the question of whether, and to what extent, any of these developments can also benefit more traditional HPC applications will be investigated. I will present our recent work in end-to-end data compression for GPU cluster computations, with marked benefits in both local and network data transfer volumes. This also includes a succinct API that allows developers to specify the compression method or fidelity required for specific data buffers at a high level of abstraction.
As a more forward-looking point of discussion, the idea of using these same abstract fidelity requirements to also adjust the data types used for computation will be discussed. This approach is motivated by recent results showing that, on modern Nvidia GPUs, emulating FP64 matrix multiplication using lower-precision operations can actually result in higher performance than using native FP64 hardware.
RISC-V HPC for Europe: Unique Chance or Dead End?
Carsten Trinitis
Technical University of Munich, Germany
The European Commission is pushing a significant amount of money into building a RISC-V-based infrastructure for HPC. The DARE project, which started in March 2025, is an important example. The goal is to build an entire RISC-V-based HPC infrastructure, including hardware and software infrastructure. The talk will give an overview of the project, the various work packages, as well as the partners’ roles within DARE.
Within DARE, TUM’s role is to contribute to the debugging toolchain by implementing a parallel RISC-V debugger based on GDB. The talk will report on first experiences and results as well as on difficulties faced at this early stage. As the audience comes from both scientific computing and AU, it will also be discussed to what extent the project should rather aim at more classical HPC applications, such as numerical computations, or whether it makes more sense to focus on Artificial Intelligence and Machine Learning, and to try to build up a powerful infrastructure here.
pyGinkgo: high-performance sparse linear algebra library for Python
Yu-Hsiang Tsai
Technical University of Munich, Germany
pyGinkgo brings a high-performance sparse linear algebra library - Ginkgo to Python, which is a popular choice for several deep learning applications. Ginkgo is a portable and high-performance library designed for sparse linear algebra, focusing on GPU optimization. It implements several algorithms, including the Conjugate Gradient (CG), Generalized Minimal Residual Method (GMRES), and Algebraic Multigrid (AMG), in highly efficient ways. pyGinkgo also provides a Pythonic interface to provide seamless interoperability with pybind11. We benchmark PyGinkgo’s performance against state-of-the-art Python libraries, including SciPy, CuPy, PyTorch, and TensorFlow. Results across hardware from different vendors show that pyGinkgo performs better in Sparse Matrix Vector (SpMV) multiplication and iterative solver than existing Python tools.
It was the best of times; it was the worst of times
Henry Tufo
University of Colorado, Boulder, USA
HPC has finally found its killer application: AI. Current and proposed investments in data centers, equipment, and software to monetize AI and birth the first digital superintelligences are several orders of magnitude greater than the traditional modeling and simulation market that has been HPC's bread and butter for decades. Moreover, many of the tools, techniques, etc. developed over the last several decades by the HPC community are directly applicable to AI and provide a solid experiential foundation to develop novel AI-centric tools, techniques, etc. This should be the golden age of HPC. However, old HPC hands are entrenched and seem unwilling to broaden their horizons, the AI upstarts think that they know everything often reinventing the wheel and calling it something snazzy, and the traditional technology providers HPC has relied on since the attack of the killer micros in the early 1990's have been unable to compete in the marketplaces that birthed AI. This dichotomy is perplexing. And unfortunate, as much of the collective HPC wisdom gathered over the past 60 or so years is germane to the customization question at hand.
Smart Network Switches for HPC
Josef Weidendorfer
Technical University of Munich, Germany
For cloud computing, there is a large market for smart network devices - both NICs and switches, which allow for offloading customer isolation strategies from the host CPUs. For HPC, such devices are not really used nowadays.
LRZ recently looked at the benefits of smart network devices for HPC. In this talk, I will concentrate on switches. I will introduce the high-level P4 programming model, and how it can be used for offloading MPI reductions onto a smart switch, with resulting benefits, which directly also should map to benefits for AI model training.
Precision-Aware Accumulation: Hardware–Software Co-Design for Block Floating Point Arithmetic
Filip Wojcicki
Imperial College London, USA
Accurate floating‑point accumulation remains a fundamental challenge in scientific and AI workloads, where summing values of widely varying magnitudes can lead to severe loss of significance. Compensated summation techniques mitigate these errors, but their relative benefits under custom numerical formats and hardware constraints are poorly understood. We present a combined software and hardware framework for evaluating and implementing accumulation with block floating‑point types, which generalize the OCP Microscaling (MX) formats to arbitrary exponent and mantissa widths. The software library provides bit‑accurate emulation of MX‑like arithmetic, while the hardware library offers parameterizable, high‑performance SystemVerilog IP cores for various compensated summation algorithms. Building on these, we introduce an automated design‑space exploration (DSE) tool that jointly evaluates accuracy, resource usage, and performance, suggesting Pareto‑optimal configurations tailored to user‑defined priorities. We demonstrate our approach with an example use case involving accumulation in a softmax operation within a transformer model, showing that our method identifies configurations that outperform uncompensated FP32 summation in throughput and maximum achievable frequency. The proposed framework lays the foundation for systematic exploration of numerical trade‑offs and hardware‑aware optimization in future FPGA‑based accelerators.
Techniques in Achieving Better Offset Data Prefetching
Jacky Wong
Imperial College London, USA
Modern processors rely heavily on data prefetching to bridge the growing latency gap between computation and memory access. Among prefetching strategies, offset prefetching, where prefetches are issued at a delta to the latest observed access based on what address deltas would have had high coverage in the recent address-stream history, has proven effective, and many state-of-the-art data prefetchers utilize this technique.
In this talk, I will present Caerus - an offset prefetcher that employs PC-local filtering on multiple non-overlapping globally derived offsets. Caerus extracts multiple offsets from the global memory trace, while avoiding redundant prefetches by isolating the training of the current offset from offsets that are actively used for prefetching. The accuracy of the extracted offsets is tracked per-PC via a forward analysis of whether it will cover future accesses for any PC. These techniques allow Caerus to achieve a performance improvement of 19.6
Humble Programming, Revisited
Albert-Jan Yzelman
Huawei, Zurich Research Lab, Switzerland
The concept of a humble programmer, coined by Dijkstra in the 1970s, fully acknowledges the complexity of programming and deems it unsuitable for the human mind. In line with this realisation, humble programming frameworks favor programmer productivity over performance - a paradigm that has achieved significant success these past two decades. Demonstrably, however, this success has come at the cost of orders-of-magnitude performance losses. This talk convinces that automated optimisation and parallelisation can achieve the same level of performance as painstakingly expert-tuned programs, if the humble framework is designed carefully enough.
We first summarise some of our recent results regarding the theoretical hardness of scheduling general computations. We then describe our approach of leveraging well-known optimisation strategies from high-performance (sparse) linear algebra, and generalise their applicability through the Algebraic Programming (ALP) paradigm. We show how the "think-like-a-vertex" programming model for machine learning on graphs, as first popularised by Pregel, can be captured by the ALP paradigm. We then construct a mechanism that automatically translates Pregel programs to ALP, thus providing evidence that some humble programming models could map to another, more foundational one, without penalties in terms of performance or scalability. Finally, we consider the aspect of interoperability; i.e., how myriads of existing software deployments in AI and elsewhere may still take advantage of these insights– and achieve better performance and better scalability to boot.
Optimizing Segmented Operations through Matrix Multiplications: Insights from the Ascend AI Accelerator
Anastasios Zouzias
Huawei, Zurich Research Lab, Switzerland
Specialized computational units that perform small matrix multiplications as primitive operations are typically present in modern accelerators. However, these units are often underutilized for many fundamental operations besides dense matrix multiplications. The analysis of algorithms for such architectures is currently stagnated due to the lack of a theoretical model of computation that captures their heterogeneous compute characteristics. In this talk, I will first present the MMV-RAM model, a computational model tailored to matrix multiplication accelerators. MMV-RAM judiciously extends the Vector-RAM model with an additional matrix multiplication unit. Second, using the Ascend AI accelerator as a case study, I will present MMV-RAM algorithms for a few important parallel primitives, focusing on segmented operations (scan and reductions).
The talk will be based on Segmented Operations using Matrix Multiplications.
Joint work with Aleksandros Sobczyk and Giuseppe Sorrentino.