Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, an accelerated inference version of its trillion-parameter flagship model. The company claims that the new version achieves inference speeds exceeding 1,000 tokens per second on a standard server consisting of eight general-purpose GPUs, with a peak demo speed approaching 1,200 tokens per second.
The focus of this update is not on the new model itself, but on inference efficiency. Compared to solutions that rely on custom chips, Xiaomi emphasizes the use of general-purpose hardware and speed-ups achieved through software and model optimizations. This means that the barrier to high-speed deployment of large models may be further lowered.
Two technologies drive acceleration
Xiaomi primarily employed two technologies in this project. The first is FP4 quantization. The company compressed the expert layer, which constitutes the majority of the parameters in the model, to 4-bit precision, while maintaining high precision for the rest. This reduces memory usage and bandwidth pressure, thereby improving inference speed.
The second item is DFlash speculative decoding. Traditional speculative decoding typically involves a smaller model predicting a small number of tokens, which are then verified in parallel by a larger model. DFlash, on the other hand, proposes a whole block of tokens at once, which is then verified by the main model. In the code task, the main model can accept an average of 6.3 out of 8 candidate tokens per round.
Xiaomi and its inference partner TileRT have also optimized the execution process. The idea is to keep the computation process continuously residing inside the GPU, reducing the additional overhead caused by operators starting up one by one.
Speed comparison of mainstream models
According to the Artificial Analysis data cited in the article, the output speed of current mainstream general-purpose models is generally lower than this level. The report mentions that the common interaction speed of the GPT series is about 68 tokens per second, Claude Opus 4.6 is about 71 tokens per second, and Gemini Flash is about 192 tokens per second.
The report also mentioned that companies like Cerebras and Groq have long focused on high-throughput inference and rely on their self-developed chip architectures to improve speed. In contrast, Xiaomi achieved this result on a general-purpose GPU node, emphasizing the performance improvement brought about by software optimization.
Limited trial period begins on June 9th.
Xiaomi stated that UltraSpeed accelerates the original MiMo-V2.5-Pro, not the simplified lightweight model. This model's performance in previous code benchmark tests was described as approaching Claude Opus levels.
The company plans to offer a limited API trial from June 9th to June 23rd, using an application-based system, with priority given to enterprise users and professional developers. In terms of pricing, the UltraSpeed version is approximately three times the standard MiMo rate, but offers up to 10 times faster generation speeds.
Additional information:Xiaomi stated that its checkpoint model using FP4 and DFlash has been open-sourced on Hugging Face for community testing.












