HardTests: Synthesizing High-Quality Test Cases for LLM Coding
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated...
DafnyBench: A Benchmark for Formal Software Verification
View PDF
Black-Box Access is Insufficient for Rigorous AI Audits
Download PDF