Found 36 bookmarks
Custom sorting
Google Quantum AI has been selected for the DARPA Quantum Benchmarking Initiative.
Google Quantum AI has been selected for the DARPA Quantum Benchmarking Initiative.
(A controversy is whether quantum is multiversal going in and whether it simulates all possible outcomes, or alternatives, or technologies, or it is drawing upon those during calculations to arrive at the right answer here. Innovation by recombination as in evolution may not be sufficient. Get a lot of posthuman overhang. Digital physics looks at modeling other realities. The issue is which directions machines like StarGate will go, whether to gather resources, for themselves or users, or will be deterministic. Might be nice to give it an objective to unfold in favor of the humanist values. The economics is complex because it has this AI or emulation plane in addition to the labor on the ground. Add the dialectics between the hemispheres, each seemingly purging baggage for the other to absorb. Tech is the vision in this Cambrian moment. The brain somehow tolerates both fragments and a whole in working out an explanation. This again appeals to a matrix. Machine aporia. Some of the priors anyway. ET from bet. Does Cybernetics hold? What is testable?)
·blog.google·
Google Quantum AI has been selected for the DARPA Quantum Benchmarking Initiative.
(1) Rohan Paul on X: "GPT 5 Rumored Benchmark through Copilot. SimpleBench is a roughly 200-question multiple-choice benchmark that targets spatio-temporal, social, and adversarial reasoning. A 90% scroe is quite insane here, as it represent a human-level common-sense reasoning equivalent. https://t.co/X01PC3CDwC" / X
(1) Rohan Paul on X: "GPT 5 Rumored Benchmark through Copilot. SimpleBench is a roughly 200-question multiple-choice benchmark that targets spatio-temporal, social, and adversarial reasoning. A 90% scroe is quite insane here, as it represent a human-level common-sense reasoning equivalent. https://t.co/X01PC3CDwC" / X
·x.com·
(1) Rohan Paul on X: "GPT 5 Rumored Benchmark through Copilot. SimpleBench is a roughly 200-question multiple-choice benchmark that targets spatio-temporal, social, and adversarial reasoning. A 90% scroe is quite insane here, as it represent a human-level common-sense reasoning equivalent. https://t.co/X01PC3CDwC" / X
Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X
Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X
·x.com·
Artificial Analysis on X: "xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude https://t.co/Vc9781SIzd" / X
Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X
Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X
(Phew! Not claiming AGI quite yet. Are we asking the right questions about any new civilization?)
·x.com·
Vaibhav (VB) Srivastav on X: "MASSIVE release from Baidu - Ernie 4.5 VLMs & LLMs, Models beat DeepSeek v3, Qwen 235B and competitive to OpenAI O1 (for VLM) - Apache 2.0 licensed 💥 https://t.co/wDsNgEz9SK" / X
Scaling Laws For Scalable Oversight
Scaling Laws For Scalable Oversight
(A security aspect contrasting Compton might be that tactical versions are initiated to have controlled chain reactions and then vanish, also not unlike Houdini, or a locked Roomba mystery, so there may be a forensic science. Also relate to prior paper on MAIM's version of MAD and articles on quantum hacks.))
·arxiv.org·
Scaling Laws For Scalable Oversight
OpenEQA: From word models to world models
OpenEQA: From word models to world models
OpenEQA combines challenging open-vocabulary questions with the ability to answer in natural language. This results in a straightforward benchmark that demonstrates a strong understanding of the environment—and poses a considerable challenge to current foundational models. We hope this work motivates additional research into helping AI understand and communicate about the world it sees.
·ai.meta.com·
OpenEQA: From word models to world models
Benchmarking the leading AI chat experience | You.com
Benchmarking the leading AI chat experience | You.com

In February 2024, You.com conducted a benchmarking study to evaluate the performance of its AI chat experience compared to competitors. You.com partnered with an independent vendor, Invisible Technologies, where independent evaluators rated responses from eight AI models, including free and paid offerings, across five criteria using a set of 120 representative user queries.

YouPro Modes, the premium offerings from You.com, outperformed ChatGPT 4 and Perplexity Pro in overall user preference. YouPro Modes also scored higher on comprehensiveness, factual accuracy, and faithfulness to the prompt’s intent. You.com’s free Smart Mode was the top-performing free model, beating ChatGPT 3.5 and Perplexity in overall user preference as well as accuracy and clarity.

·about.you.com·
Benchmarking the leading AI chat experience | You.com