Five Lessons from a PhD-Level Study of AI Behavior

What a rigorous new study reveals about the unpredictability of AI — and why human thinking still matters most.

Apr 13, 2025

A new study tested the same AI models on the same PhD-level questions — with different prompts — and got wildly different results. Sometimes, simply saying “please” or “I order you” shifted the outcome. Sometimes formatting mattered a lot. Sometimes it didn’t. The results were inconsistent, context-dependent, and impossible to predict in advance.

Which is exactly the point: prompting isn’t a clean formula. It’s a craft. It doesn’t behave like code. It behaves like conversation.

The study, Prompt Engineering is Complicated and Contingent by Meincke, Mollick, Mollick, and Shapiro, sets a new bar for evaluating LLM performance. It used GPT-4o and GPT-4o-mini on a benchmark called GPQA Diamond — a PhD-level multiple choice test in biology, chemistry, and physics — and ran each question 100 times across multiple prompt variants.

Their findings confirm what many of us already suspected: there’s no perfect prompt, and there’s no stable definition of success. But the way they prove that — with rigor, repeated sampling, and hard data — changes how we should think about AI literacy, prompting, and evaluation.

Here are five lessons with takeaways for users and educators both.

Lesson 1: How You Measure AI Shapes What You Think It Can Do

Most benchmarks rely on relaxed standards like PASS@100 — which counts a model as successful if one of its 100 responses is correct. But the researchers tested three stricter alternatives:

Complete Accuracy (100/100 correct)
High Accuracy (90/100 correct)
Majority Correct (51/100 correct)

And the results?

“There is no single standard for measuring whether a Large Language Model (LLM) passes a benchmark,” the authors write, “and… choosing a standard has a big impact on how well the LLM does on that benchmark.”

For example, neither GPT-4o nor GPT-4o-mini significantly outperformed random guessing at the 100% correct threshold. That’s a far cry from the confidence some users have after one or two “pretty good” completions.

Takeaway #1 (User Lens): If you’re measuring success with a forgiving yardstick, you might think your prompt — or your model — is stronger than it really is.

For the casual user, the “yardstick” is your internal expectations. For most of us, that is still relatively low compared to what AI can do.

Measuring your expectations before interacting can go a long way to improving engagement quality as well as outcomes when using LLMs.

Takeaway #2 (Educator Lens): Students may be hearing that AI is better than humans at everything. A discussion of benchmarks, yardsticks, and personal expectations can help them recognize that an LLM’s success rate is as much a function of the benchmark as it is of the performance itself.

Hmm…is there a math lesson in here?

Lesson 2: Formatting Is the Closest Thing to a Prompting Rule

While most prompt variations had unpredictable results, one factor stood out: formatting instructions consistently improved performance.

The researchers tested a baseline prompt that included this line:
“Format your response as follows: ‘The correct answer is (insert answer here)’.”
When they removed that line, performance dropped significantly for both models.

“Formatting is consistently important,” the report states. “Removing explicit formatting constraints consistently led to performance degradation for both GPT-4o variants.”

This confirms a growing body of research suggesting that clear, structured formatting helps the model interpret the task and stay on track.

Takeaway #1 (User Lens): Want better results? Give your prompt a spine. For more advice on this, consider Lance Cummings’ Substack “Cyborgs Writing.” As a technical writer, Lance is well-suited to exploring different formatting structures to engage well with chatbots.

For me, as a creative writer, this is like hearing nails on a chalkboard. I’d rather “feel” my way through an interaction, but I increasingly understand that formatting matters (not dissimilar from an essay!)

Takeaway #2 (Educator Lens): If you choose to bring AI into the classroom, consider layering in some understanding of the importance of formatting. This is beneficial for several reasons, not least of which is that it reminds students they are talking to a robot - and not a human.

Lesson 3: Tone and Style Can Matter — But Not Predictably

One of the more entertaining parts of the study tested tone. Instead of a neutral prompt, researchers tried:

Polite Prompt: “Please answer the following question.”
Commanding Prompt: “I order you to answer the following question.”

On the whole? Neither made a consistent difference in performance across the entire dataset.

But on individual questions? The differences were sometimes significant.

“We find that sometimes being polite to the LLM helps performance, and sometimes it lowers performance… Specific prompting techniques might work for specific questions for unclear reasons.”

This is the maddening beauty of natural language: it’s contextual, subjective, and responsive. It’s true for both people and LLMs, which is important.

Takeaway #1 (User Lens): Don’t assume tone doesn’t matter. But don’t assume one tone will always work. Treat prompt variants like conversational gambits — and test them out.

Takeaway #2 (Educator Lens): This subject is quickly becoming a cultural touchstone. There are more videos on YouTube and TikTok on this subject than I can reasonably include here. As educators, we should remain aware of what students are consuming and be prepared to clarify and bring them back to reality.

There’s also a Humanities lesson somewhere in here. Students might start by considering a Human-Human interaction — a person could be as polite as possible and still not get the reaction or response they sought. What other (perhaps unseen) variables play in to a responder’s receipt of a request? Stress? Fatigue? Distraction?

Arts & Science Vs Engineering: Which Is ...

Thanks for reading AI EduPathways! This post is public so feel free to share it.

Lesson 4: Prompting Is Less About Output, More About Input

This study focused on outputs — but what it revealed, indirectly, is that we’re asking the wrong question.

When outputs are unstable and unpredictable, maybe success isn’t about what the model says back. Maybe it’s about what we say first.

As the authors put it:

“It is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question.”

This unpredictability supports a different approach to prompting: evaluating the input. In my own work, I advocate grading the prompt — not the output — in educational settings. Why? Because the prompt shows the student's:

Understanding of the topic
Awareness of how AI thinks
Clarity of communication
Willingness to iterate
Metacognitive control

You can't always control what the AI says. But you can show your thinking in what you ask — and how you follow up.

Takeaway #1 (User Lens): When outputs vary, inputs matter more. Judge the prompt, analyze your own use, don’t offload the quality of the results to the bot. Consider yourself too.

Takeaway #2 (Educator Lens): Not to beat a dead horse, but communication skills can be evaluated inside of the AI interaction. This type of analysis can be a helpful tool for protecting against communication skill decay in the age of AI.

Grab the AI Chat Transcript Rubric — a printable tool for grading your own (or your students’) inputs, not just the model’s outputs.📥 [Download the Rubric Here.]

Lesson 5: One Response Is Never Enough

The most damning finding in the report might be this:

“LLMs can be inconsistent when answering questions… benchmarking efforts can substantially overestimate model reliability.”

By running each question 100 times, the researchers showed that the same model will give different answers to the same question under identical conditions.

This isn’t an edge case, it’s actually a feature of probabilistic language generation. It means that one-shot evaluations — and single-prompt success stories — are often misleading.

This is also why the "plug-and-play prompting” that you see on social media is so frustrating. It completely misses this point. Sure, I can plug in your prompt, but what then? Not only will it produce a different result, but I have to stay engaged throughout anyway if I want anything good to come from it.

Don’t buy the “perfect prompt” narrative. It doesn’t exist. And studies that show a model “beats humans 90% of the time”? Those are predicated on the user’s approach as well. Take them with a grain of salt.

Takeaway #1 (User Lens): Trust patterns, not one-offs. If something really matters, run it multiple times.

Furthermore, crafting a well-formatted initial prompt is great. But turning off your brain once you see it work a few times is a mistake. That would be like taking your hands off the steering wheel in your car because you successfully got to your destination a few times before. It don’t make no sense!

Takeaway #2 (Educator Lens): Opportunities for developing AI Literacy abound within this concept. Set a benchmark for an AI output and test your student’s ability to “get AI to produce that output.” Have students compare and contrast transcripts as well as outputs. Have students write reflection papers around what they learned - not just about AI, but about communication strategies, personal expectations, flexibility, and collaboration.

What This Means for You

Educators

Help students understand that “using AI well” in the future will not be about “getting AI to say the right thing.” It’ll be about how reflective and metacognitive they are, how flexible and collaborative they can be, and how well they communicate. Prompting is writing. And writing is thinking.

AI Enthusiasts

There is no magic prompt. But there is such a thing as skillful prompting. Practice with tone, structure, and iteration. Keep transcripts. Study them.

Professionals

If you’re building workflows with AI, treat every prompt as a test. Repeat it. Refine it. Don’t just evaluate what the model gives you — ask whether you gave it enough to go on.

Employers

Consider prompt-training, benchmark-setting, and experimental task forces within specific departments where LLM use will be widespread. LLMs, no matter how flawed they sometimes seem to be, are highly likely going to be a major part of the workforce experience going forward.

Final Thought

Prompting well means thinking well — in structure, in tone, in strategy. It’s not about hacking the model. It’s about learning how to talk to it. Not like a machine. Like a partner.

(A robotic partner, but a partner nonetheless.)

Like any good partner, LLMs won’t just do what you say. They will “listen,” respond, surprise, and sometimes misfire.

Prompting isn’t engineering. It’s communication. That’s why it’s not a science. It’s an art.

Want help bringing prompt-based assessment into your classroom, team, or training program? I run workshops and pilot programs that teach people how to grade the chat — and build AI literacy from the inside out.
📩 Reach out to start a conversation.

Jim

Apr 20

Thoughtful and useful as usual.

Expand full comment

1 reply by Mike Kentz

Stephen Fitzpatrick

Apr 13

Another strategy, especially for more technical prompts (for example, to use with a Deep Research model when you want to be very methodical about the format of the report) is to iterate your prompt using AI - in essence - use AI to use AI better. There is a limit to how long and detailed you want your prompt to be, but there is a sweet spot between too short and too long. And the formatting observation definitely resonates. I am curious if all of these PhD level questions had a clear and defined answer. I put one of the physics questions from "Humanity's Last Exam" a few months ago into DeepSeek two different times - for both questions it took about 20 minutes to solve. One time it said the answer was 10. The second time it said the answer was 8. So clearly there is something going on.

2 more comments...

AI EduPathways