Paper: Modular Mojo vs CUDA/HIP: memory-bound bandwidth results, compute-bound gaps on H100/MI300A

Posted by – December 23, 2025
Category: Exclusive videos

This talk follows a summer SULI internship at Oak Ridge National Laboratory evaluating whether Mojo can reduce the “two-language tax” in GPU computing: keep Python-level ergonomics while staying close to CUDA/HIP performance, and stay portable when moving between NVIDIA and AMD hardware. The focus is practical performance portability using real kernels on NVIDIA H100 and AMD MI300A. https://www.modular.com/mojo


HDMI® Technology is the foundation for the worldwide ecosystem of HDMI-connected devices; integrated with displays, set-top boxes, laptops, audio video receivers and other product types. Because of this global usage, manufacturers, resellers, integrators and consumers must be assured that their HDMI® products work seamlessly together and deliver the best possible performance by sourcing products from licensed HDMI Adopters or authorized resellers. For HDMI Cables, consumers can look for the official HDMI® Cable Certification Labels on packaging. Innovation continues with the latest HDMI 2.2 Specification that supports higher 96Gbps bandwidth and next-gen HDMI Fixed Rate Link technology to provide optimal audio and video for a wide range of device applications. Higher resolutions and refresh rates are supported, including up to 12K@120 and 16K@60. Additionally, more high-quality options are supported, including uncompressed full chroma formats such as 8K@60/4:4:4 and 4K@240/4:4:4 at 10-bit and 12-bit color.

Mojo is built on MLIR (LLVM’s multi-level IR) and keeps a Python-like surface syntax while targeting systems-level codegen, memory safety, and tight control over types, layouts, and parallel execution. In the current stack, GPU work is still fairly low level: you write explicit kernels, reason about thread/block structure, and manage host↔device memory and synchronization through Mojo’s GPU APIs (often used inside MAX custom operations), so “portable” does not mean automatic.

The poster ports four scientific workloads, split into memory-bound and compute-bound behavior: BabelStream-style vector ops and a 7-point stencil over a 3D buffer, plus miniBUDE and a Hartree-Fock kernel with multiple atomic operations. For the bandwidth-driven kernels, Mojo reaches competitive memory throughput: on H100 it can beat a CUDA baseline on several vector routines, while dot is harder to match because the CUDA version relies on device-specific tuning. On MI300A, Mojo largely tracks C++/HIP for these memory-bound kernels, with similar bandwidth per routine.

Compute-bound kernels are where compiler maturity shows up. In miniBUDE, performance sits between unoptimized and heavily optimized CUDA as per-thread work (PPWI) rises, suggesting Mojo still needs more aggressive fast-math and scheduling for arithmetic intensity. For Hartree-Fock, atomics can look strong on small H100 cases but degrade sharply at the largest size; on MI300A, atomics may be far slower and the biggest test can fail, highlighting gaps in atomic codegen and runtime behavior at scale.

Filmed at Supercomputing SC25 in St. Louis, the takeaway is that Mojo already looks credible for memory-bound HPC and AI-adjacent kernels where bandwidth dominates, while compute-heavy and atomic-heavy code still needs iteration. The follow-on plan mentioned here—building a BLAS-style library in Mojo while benchmarking best-case paths on NVIDIA and AMD—maps well to how performance-portable stacks usually mature, GPU.

I’m publishing about 90+ videos from Embedded World North America 2025, I upload about 4 videos per day at 5AM/11AM/5PM/11PM CET/EST. Join https://www.youtube.com/charbax/join for Early Access to all 90 videos (once they’re all queued in next few days) Check out all my Embedded World North America videos in my Embedded World playlist here: https://www.youtube.com/playlist?list=PL7xXqJFxvYvjgUpdNMBkGzEWU6YVxR8Ga

This video was filmed using the DJI Pocket 3 ($669 at https://amzn.to/4aMpKIC using the dual wireless DJI Mic 2 microphones with the DJI lapel microphone https://amzn.to/3XIj3l8 ), watch all my DJI Pocket 3 videos here https://www.youtube.com/playlist?list=PL7xXqJFxvYvhDlWIAxm_pR9dp7ArSkhKK

Click the “Super Thanks” button below the video to send a highlighted comment under the video! Brands I film are welcome to support my work in this way 😁

Check out my video with Daylight Computer about their revolutionary Sunlight Readable Transflective LCD Display for Healthy Learning: https://www.youtube.com/watch?v=U98RuxkFDYY

source https://www.youtube.com/watch?v=HKqeMg9NZ8s