AI Race: Can Huawei Close The AI Gap?
Huawei Just Raised The Stakes
For those just tuning in, NVIDIA has been driving much of the AI infrastructure conversation. Huawei has been, not so loudly, building its own AI stack, right from the silicon to systems to AI models. CloudMatrix ups Huawei’s play in the market considerably. The CloudMatrix-Infer system is no ordinary cluster. It brings together 384 Ascend 910C NPUs and 192 Kunpeng CPUs, interconnected via a Unified Bus (UB) with ultra-high bandwidth and low latency that challenges NVIDIA NVLink. CloudMatrix impressed us with its:
- System-level integration optimized for inferencing at scale. Huawei is proving its vertical stack (Ascend + CANN + MindSpore + UB) with tight coupling/integrations. It supports disaggregated prefill-decode-caching architecture, INT8 quantization, microbatch pipelining, and fused operators, boosting performance for inferencing.
- Chipset advantage of Ascend 910C and Kunpeng. Although the single-chip performance for Ascend 910C is about 33% lower than NVIDIA’s Blackwell, by networking five times more chips via a 2.8-tbps. UB, Huawei has turned a weakness into dominance. On the other hand, each node pairs Kunpeng CPUs with Ascend, handling control-plane tasks like distributed resource scheduling and fault recovery. This hybrid design ensures that non-AI workloads like data preprocessing and network management don’t bottleneck NPU efficiency.
- Unified Bus vs. NVLink: topology as a weapon. While NVLink focuses on ultra-fast point-to-point GPU connections, Huawei’s UB implements an all-to-all interconnect topology by using 6,912 optical modules to weave a “flat” network of all 384 NPUs and 192 CPUs across 16 racks. This eliminates hierarchical hops and enables direct communication.
- CANN support of massive-scale expert parallelism (EP320). This allows one expert per NPU die — something even CUDA ecosystems struggle with. While NVIDIA’s CUDA dominates global AI development, Huawei’s Compute Architecture for Neural Networks (CANN) stack has quietly matured into a viable alternative. Now at version 7.0, CANN mirrors CUDA’s layered structure across driver, runtime, and libraries, optimized for Ascend.
- Throughput over raw speed for performance advantage. CloudMatrix has 6,688 tokens per second prefill and 1,943 decode per second per NPU — better than NVIDIA H100 on similar loads.
- Developer-oriented updates. Recent improvements to the PyTorch compatibility layer, as noted in the paper, suggest that Huawei is listening to its early developer base.
Huawei is optimizing for massive mixture of experts (MoE) models, bandwidth-first design, and multiphase large-language-model inference at scale. If I were deploying a 700B-plus model in a region where CloudMatrix was available, we’d seriously consider using CANN. The architecture is purpose-built for next-gen inference.
Watch Huawei Closely, Even If You Still Lean On NVIDIA
Huawei isn’t just catching up with its CANN stack and the new CloudMatrix architecture; it’s redefining how AI infrastructure works. I believe winning the AI race isn’t just about faster chips. It also includes delivering the tools developers need to build and deploy large-scale models. As someone who’s built applications (in my past life), I’d rely on CUDA’s mature, frictionless ecosystem. I can run PyTorch and TensorFlow code with minimal tweaks, tap into global community support, and use robust tooling that just works. Huawei’s CANN software development kit shows potential — but it still feels early. PyTorch and TensorFlow adapters exist, but migration takes effort, and the community remains small. It’s not turnkey yet, and that slows developer confidence.
Huawei Has Its Work Cut Out
Despite the gains, Huawei has lots of work to do to improve:
- Power efficiency. CloudMatrix consumes 3.9 times more power than NVIDIA’s GB200. This may be viable in China’s abundant, less expensive utilities market, but it’s a big hurdle in regions where energy is expensive or carbon targets are strict. Optimization is a must.
- Ecosystem maturity. While CANN is making progress, CUDA still leads in documentation, third-party libraries, and global developer community.
- Migration friction. Tools and adapters exist — but real-world code migration still takes effort. This is a pain point for teams that need fast iteration cycles.
- Community trust. Open-source engagement and hands-on support must scale. Especially outside Asia, trust and familiarity favor NVIDIA. Some companies and public-sector groups may not be legally able to use Huawei given its relationship with the Chinese government. For the Chinese public sector, however, Huawei is already the de facto choice.
We see these items as solvable, but they’ll take time and focus.
Final Thought
NVIDIA’s market share in China is facing challenges. With ongoing geopolitical uncertainty, diversifying infrastructure is no longer optional — it’s a strategic imperative. Tech leaders should accelerate localization strategies, and infrastructure is critical for AI success.
To dive deeper into your AI strategy, set up an inquiry or guidance session with Charlie Dai (AI cloud) or Naveen Chhabra (AI infrastructure) for a conversation.
