Skip to main content

AI As Judge

Using the outputs of one (trained) AI to measure the performance of another

Addresses / Mitigates

  • Emergent Behaviour: Could catch early signs of unexpected AI behaviour by flagging responses that deviate from expected norms.
  • Unintended Cascading Failures: Can act as a real-time filter to catch dangerous AI outputs before they propagate (e.g., financial trading AI making reckless decisions).
  • Social Manipulation: Can prevent harmful misinformation, disinformation, and deepfakes from spreading by having a second user-owned AI fact-check or block misleading content.
  • Loss Of Human Control: Can enforce alignment principles by rejecting responses that optimise for harmful proxy goals.
  • AI-As-Judge is a mitigation technique where one AI model generates responses while a second AI evaluates and filters them based on predefined rules, helping to enforce content moderation, alignment with ethical guidelines, and safety constraints.

  • Compare with Human In The Loop, although once trained, the AI is always vigilant.

  • Requires extensive training and evaluation on its own, but potentially could be a service provided to enhance controls in

Sources