Found 33 bookmarks
Custom sorting
Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X
Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X
·x.com·
Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X
Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X
Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X
(Phew! Not claiming AGI quite yet. Are we asking the right questions about any new civilization?)
·x.com·
Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X
Scaling Laws For Scalable Oversight
Scaling Laws For Scalable Oversight
(A security aspect contrasting Compton might be that tactical versions are initiated to have controlled chain reactions and then vanish, also not unlike Houdini, or a locked Roomba mystery, so there may be a forensic science. Also relate to prior paper on MAIM's version of MAD and articles on quantum hacks.))
·arxiv.org·
Scaling Laws For Scalable Oversight
OpenEQA: From word models to world models
OpenEQA: From word models to world models
OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.
·ai.meta.com·
OpenEQA: From word models to world models
Benchmarking the leading AI chat experience | You.com
Benchmarking the leading AI chat experience | You.com

In February 2024, You.com conducted a benchmarking study to evaluate the performance of its AI chat experience compared to competitors. You.com partnered with an independent vendor, Invisible Technologies, where independent evaluators rated responses from eight AI models, including free and paid offerings, across five criteria using a set of 120 representative user queries.

YouPro Modes, the premium offerings from You.com, outperformed ChatGPT 4 and Perplexity Pro in overall user preference. YouPro Modes also scored higher on comprehensiveness, factual accuracy, and faithfulness to the prompt’s intent. You.com’s free Smart Mode was the top-performing free model, beating ChatGPT 3.5 and Perplexity in overall user preference as well as accuracy and clarity.

·about.you.com·
Benchmarking the leading AI chat experience | You.com