Xiaomi releases MiMo Accelerated Edition, boosting inference speed to the thousand-word level.
Coinpaper
3h ago
Ai Focus
Xiaomi released MiMo Accelerated Edition, achieving an inference speed of over 1000 tokens per second on general-purpose GPU nodes. Limited API trials will begin on June 9.
Helpful
No.Help

Xiaomi has released MiMo-V2.5-Pro-UltraSpeed, an accelerated inference version of its trillion-parameter flagship model. The company claims that the new version achieves inference speeds exceeding 1,000 tokens per second on a standard server consisting of eight general-purpose GPUs, with a peak demo speed approaching 1,200 tokens per second.

The focus of this update is not on the new model itself, but on inference efficiency. Compared to solutions that rely on custom chips, Xiaomi emphasizes the use of general-purpose hardware and speed-ups achieved through software and model optimizations. This means that the barrier to high-speed deployment of large models may be further lowered.

Two technologies drive acceleration

Xiaomi primarily employed two technologies in this project. The first is FP4 quantization. The company compressed the expert layer, which constitutes the majority of the parameters in the model, to 4-bit precision, while maintaining high precision for the rest. This reduces memory usage and bandwidth pressure, thereby improving inference speed.

The second item is DFlash speculative decoding. Traditional speculative decoding typically involves a smaller model predicting a small number of tokens, which are then verified in parallel by a larger model. DFlash, on the other hand, proposes a whole block of tokens at once, which is then verified by the main model. In the code task, the main model can accept an average of 6.3 out of 8 candidate tokens per round.

Xiaomi and its inference partner TileRT have also optimized the execution process. The idea is to keep the computation process continuously residing inside the GPU, reducing the additional overhead caused by operators starting up one by one.

Speed comparison of mainstream models

According to the Artificial Analysis data cited in the article, the output speed of current mainstream general-purpose models is generally lower than this level. The report mentions that the common interaction speed of the GPT series is about 68 tokens per second, Claude Opus 4.6 is about 71 tokens per second, and Gemini Flash is about 192 tokens per second.

The report also mentioned that companies like Cerebras and Groq have long focused on high-throughput inference and rely on their self-developed chip architectures to improve speed. In contrast, Xiaomi achieved this result on a general-purpose GPU node, emphasizing the performance improvement brought about by software optimization.

Limited trial period begins on June 9th.

Xiaomi stated that UltraSpeed accelerates the original MiMo-V2.5-Pro, not the simplified lightweight model. This model's performance in previous code benchmark tests was described as approaching Claude Opus levels.

The company plans to offer a limited API trial from June 9th to June 23rd, using an application-based system, with priority given to enterprise users and professional developers. In terms of pricing, the UltraSpeed version is approximately three times the standard MiMo rate, but offers up to 10 times faster generation speeds.

Additional information:Xiaomi stated that its checkpoint model using FP4 and DFlash has been open-sourced on Hugging Face for community testing.

Tip
$0
Like
0
Save
0
Views 343
CoinMeta reminds readers to view blockchain rationally, stay aware of risks, and beware of virtual token issuance and speculation. All content on this site represents market information or related viewpoints only and does not constitute any form of investment advice. If you find sensitive content, please click“Report”,and we will handle it promptly。
Submit
Comment 0
Hot
Latest
No comments yet. Be the first!
Related
NEAR rebounded more than 11%, with the market focusing on the $2.20 resistance level.
NEAR rose more than 11% in the last 24 hours, with trading volume reaching $625 million. The market is focused on the resistance around $2.20 and the area of heavy liquidation above.
AMBCrypto
·2026-06-08 21:40:30
492
CC surged over 13% in a single day, approaching the $0.16 resistance level.
CC rose 13.55% in the last 24 hours, with the price approaching the key resistance level of $0.16. Canton Network also pushed forward with its 3.5 protocol upgrade and emphasized its institutional infrastructure positioning.
CoinPedia
·2026-06-07 01:19:54
838
Foreign media: ETH may first test the $1,500 level.
Foreign media reports that after ETH fell below $2,000, the market is betting that it will first test $1,500, and the continuous net outflow of ETFs has also increased short-term pressure.
Coinpaper
·2026-06-04 03:36:16
844
JTO surged nearly 29% in a single day, encountering resistance around the $0.70 level.
JTO surged nearly 29% in a single day with a significant increase in trading volume, as the market focuses on the potential for improved revenue within the Jito ecosystem and the $0.70 resistance level.
AMBCrypto
·2026-06-03 01:25:28
918
TON rebounded more than 13% in a single day, with $2.10 remaining a key resistance level.
TON rose more than 13% in a single day, but trading volume and futures data were weak, and $2.10 remains a key short-term resistance level.
AMBCrypto
·2026-06-08 07:39:47
782