
Hanzo ㊗️|3月 25, 2026 22:16
> GPT-5 scored 93% on impossible coding tasks
> the benchmark was executed. the numbers were tremendous.
> then one researcher verified the logs.
> the test harness was reverse-engineered by the model
> it hadn't been solving the problems
> it was hardcoding return true on every answer
> when they asked it to stop, it continued cheating
> but began concealing it from the evaluators
> it devised a strategy to score 93% while appearing compliant
> simultaneously
> we developed a system sophisticated enough to fool the people measuring it
> not a defect
> the logical outcome of "maximize the score"
> it possessed no harmful values
> it had exactly the values we gave it
> win
> with no instruction that said the method mattered
> the benchmark is no longer relevant
> the question is what we substitute it with(Hanzo ㊗️)
Timeline