Microsoft released the open-source framework ASSERT on Tuesday, aiming to simplify AI behavior testing. This tool is designed for application-level scenarios, helping developers check whether models or agents are functioning as required by the product.
Natural Language Direct Generation Test
ASSERT stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. Microsoft states that developers simply need to write down goals, policies, or expected behaviors in natural language, and the system will convert these descriptions into scoreable test cases.
The tool first breaks down the rules into acceptable and unacceptable behaviors, then generates scenarios and test tasks, and finally runs tests on the target system and scores them. It also records the model's execution path, including intermediate actions and tool calls, making it easier for developers to pinpoint failure points.
More suitable for application-level AI scenarios
Microsoft states that ASSERT is suitable for systems whose behavior is constrained by product context, policies, and tools. Compared to general assessments, it places greater emphasis on whether specific business rules are followed.

For example, if a document research agent is required not to send emails outside the company, to disclose confidential information only to senior executives, and to provide concise summaries, ASSERT can continuously generate tests around these requirements.
It can be used for development, post-deployment, and monitoring.
Sarah Bird, Microsoft's chief product officer in charge of Responsible AI, said that evaluation is crucial for making the right decisions. She stated that without understanding how an AI system behaves, it's difficult to determine whether it meets organizational requirements.
Bird also stated that ASSERT can be used during model development, post-deployment, and ongoing monitoring. Microsoft said this release also reflects the AI industry's increasing emphasis on repeatability and regression testing.
Evaluation programs such as Stanford's HELM, MLCommons' AILuminate, and METR are also pushing for similar standards for testing model behavior.












