Claude Opus 4.7 won 69 of 100 blind evals against Opus 4.6, judged by GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2
I ran 100 blind questions across 5 categories (code, reasoning, analysis, communication, meta-alignment) and had three independent judges from three different model families evaluate both responses. Each judge saw responses labeled A and B…