PARTICLES 2025

Performance-Portable Aspherix: Scaling DEM Simulations Across CPUs and GPUs

  • Kwakkel, Marcel (DCS Computing GmbH)
  • Goniva, Christoph (DCS Computing GmbH)
  • Kloss, Chistoph (DCS Computing GmbH)

Please login to view abstract download link

Porting Discrete Element Method (DEM) software to diverse hardware architectures poses significant challenges, particularly when adapting CPU-optimized algorithms for efficient GPU execution. While Aspherix has been running efficiently on CPUs for over a decade, its original algorithms were designed for sequential or multi-core execution, relying on cache efficiency and shared memory. In contrast, GPUs require a different computational approach, emphasizing fine-grained parallelism, efficient memory access patterns, and minimizing synchronization overhead. Adapting these algorithms necessitated substantial restructuring to fully utilize GPU hardware while maintaining performance on CPUs. This work presents a performance-portable implementation of Aspherix, enabling seamless execution across CPUs, NVIDIA, AMD, and Intel GPUs. The system dynamically selects the optimal execution strategy based on the available hardware, ensuring scalability across different platforms. Key challenges addressed include reworking memory management to suit the distinct hierarchies and access patterns of GPUs, as well as optimizing parallelism to leverage the full potential of multi-core CPUs and massively parallel GPUs. This approach allows Aspherix to efficiently scale from single-node to large, distributed multi-node environments, providing flexibility for a range of computational resources. A recent benchmark study compared the performance of nine widely used open-source DEM frameworks, evaluating both simulation accuracy and computational efficiency on a system equipped with a CPU and GPU. We compare Aspherix against these benchmark results to assess its performance across different hardware configurations. Our results highlight Aspherix’s scalability across multi-node compute clusters, and we outline further work to enhance its computational efficiency.