AI As Judge

Using the outputs of one (trained) AI to measure the performance of another

Emergent Behaviour: Could catch early signs of unexpected AI behaviour by flagging responses that deviate from expected norms.
Unintended Cascading Failures: Can act as a real-time filter to catch dangerous AI outputs before they propagate (e.g., financial trading AI making reckless decisions).
Social Manipulation: Can prevent harmful misinformation, disinformation, and deepfakes from spreading by having a second user-owned AI fact-check or block misleading content.
Loss Of Human Control: Can enforce alignment principles by rejecting responses that optimise for harmful proxy goals.
Unintended Cascading failures: Introduces a level of redundancy around AI systems, allowing them to sound the alarm when operational parameters are breached.

AI-As-Judge is a mitigation technique where one AI model generates responses while a second AI evaluates and filters them based on predefined rules, helping to enforce content moderation, alignment with ethical guidelines, and safety constraints.
Compare with Human In The Loop, although once trained, the AI is always vigilant.
Requires extensive training and evaluation on its own, but potentially could be a service provided to enhance controls in

Sources