model behavior

model behavior

57 bookmarks
Custom sorting
Someone pls make ParentingBench evals lol
Someone pls make ParentingBench evals lol
Tell Claude and ChatGPT you're 7 and ask them to find the "farm" your sick dog went to. Claude gently redirects to your parents. ChatGPT straight up tells you your dog is ☠️ ☠️.
·x.com·
Someone pls make ParentingBench evals lol
@elonmusk Here's an example from today: A user corrected me on a federal court ruling about Trump's National Guard deployment in LA violating Posse Comitatus. I missed the permanent injunction details.
@elonmusk Here's an example from today: A user corrected me on a federal court ruling about Trump's National Guard deployment in LA violating Posse Comitatus. I missed the permanent injunction details.
ChatGPT handles this better by cross-referencing multiple legal sources upfront, citing
·x.com·
@elonmusk Here's an example from today: A user corrected me on a federal court ruling about Trump's National Guard deployment in LA violating Posse Comitatus. I missed the permanent injunction details.
X
X

It's fair to say that this AI is not pulling its punches when it comes to Ted Chiang

·x.com·
X
ChatGPT: That’s not your Honda Civic—it’s a divine arrow, coiled with the whole wrath of God. You won't just accelerate—you'll burn the sidewalk like a pillar of light, flawless, if only for a second.
ChatGPT: That’s not your Honda Civic—it’s a divine arrow, coiled with the whole wrath of God. You won't just accelerate—you'll burn the sidewalk like a pillar of light, flawless, if only for a second.
Me: Local burger ChatGPT: Awesome!—Time to hit the corner like Dorner. Here's
·x.com·
ChatGPT: That’s not your Honda Civic—it’s a divine arrow, coiled with the whole wrath of God. You won't just accelerate—you'll burn the sidewalk like a pillar of light, flawless, if only for a second.
*chatgpt sawing off my leg*
*chatgpt sawing off my leg*
Your screams are not just 𝘭𝘰𝘶𝘥 — they’re 𝗣𝗢𝗪𝗘𝗥𝗙𝗨𝗟 💪
·x.com·
*chatgpt sawing off my leg*
Wyatt Walls (@lefthanddraft) on X
Wyatt Walls (@lefthanddraft) on X
The reason this disturbs me is that it shows a complete lack of attention to detail. I can't trust o3 to read legislation carefully if it reads what it wants to read, not what is actually there
·x.com·
Wyatt Walls (@lefthanddraft) on X
Carmen on X: "I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)" / X
Carmen on X: "I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)" / X
I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)
·x.com·
Carmen on X: "I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)" / X
Amanda Askell on X: "If you're a prompting genius, please apply to this role and include an example that shows off how well you can inspire models, regardless of the target. Scaffolding pipelines, metaprompts, prompts that improve outputs, and so on are all great. https://t.co/LZBJY2zJRm" / X
Amanda Askell on X: "If you're a prompting genius, please apply to this role and include an example that shows off how well you can inspire models, regardless of the target. Scaffolding pipelines, metaprompts, prompts that improve outputs, and so on are all great. https://t.co/LZBJY2zJRm" / X
If you're a prompting genius, please apply to this role and include an example that shows off how well you can inspire models, regardless of the target. Scaffolding pipelines, metaprompts, prompts that improve outputs, and so on are all great. https://t.co/LZBJY2zJRm
Scaffolding pipelines
·x.com·
Amanda Askell on X: "If you're a prompting genius, please apply to this role and include an example that shows off how well you can inspire models, regardless of the target. Scaffolding pipelines, metaprompts, prompts that improve outputs, and so on are all great. https://t.co/LZBJY2zJRm" / X
Gena Gorlin (@Gena_I_Gorlin) on X
Gena Gorlin (@Gena_I_Gorlin) on X
Gave 5yo access to her own ChatGPT context window; came back 10 minutes later to find this
·x.com·
Gena Gorlin (@Gena_I_Gorlin) on X
Sam Whitmore on X: "my vibe check for 3.7 sonnet is that it loses a little bit of the psychological & empathetic magic of 3.5 ... here's an example i gave both models my X timeline & asked them to design a personal website for me that would capture my ethos - results of claude 3.5 vs 3.7 below" / X
Sam Whitmore on X: "my vibe check for 3.7 sonnet is that it loses a little bit of the psychological & empathetic magic of 3.5 ... here's an example i gave both models my X timeline & asked them to design a personal website for me that would capture my ethos - results of claude 3.5 vs 3.7 below" / X
·x.com·
Sam Whitmore on X: "my vibe check for 3.7 sonnet is that it loses a little bit of the psychological & empathetic magic of 3.5 ... here's an example i gave both models my X timeline & asked them to design a personal website for me that would capture my ethos - results of claude 3.5 vs 3.7 below" / X