AI As Judge
Using the outputs of one (trained) AI to measure the performance of another
Addresses / Mitigates
- Emergent Behaviour: Could catch early signs of unexpected AI behaviour by flagging responses that deviate from expected norms.
- Unintended Cascading Failures: Can act as a real-time filter to catch dangerous AI outputs before they propagate (e.g., financial trading AI making reckless decisions).
- Social Manipulation: Can prevent harmful misinformation, disinformation, and deepfakes from spreading by having a second user-owned AI fact-check or block misleading content.
- Loss Of Human Control: Can enforce alignment principles by rejecting responses that optimise for harmful proxy goals.
-
AI-As-Judge is a mitigation technique where one AI model generates responses while a second AI evaluates and filters them based on predefined rules, helping to enforce content moderation, alignment with ethical guidelines, and safety constraints.
-
Compare with Human In The Loop, although once trained, the AI is always vigilant.
-
Requires extensive training and evaluation on its own, but potentially could be a service provided to enhance controls in