How to Test Non-Deterministic AI Systems

Non‑deterministic models (like LLMs) can return different valid outputs for the same input. Instead of checking for one exact answer, you evaluate whether the response meets quality criteria such as correctness, safety, and relevance.

Typical strategies include:

Sampling multiple outputs per test case and scoring them semantically.
Using judgment models or human raters to label quality.
Defining pass criteria as a score threshold, not a single string match.

For end‑to‑end guidance on where this fits in your program, see the AI Quality Assurance pillar page.