ChatGPT vs Standards: 4 Clinical Questions Compared

Many veterinarians are turning to AI tools like ChatGPT for help with diagnostic and treatment decisions. The answers are fast and often look complete. But can you rely on them in practice? We tested ChatGPT with common clinical scenarios and compared its responses to Standards of Care. These four patterns stood out.

1. It Gives Answers Without a Clear Source

First, we asked whether a culture was indicated for a dog with dilute urine.

ChatGPT recommends performing a culture, even with an unremarkable sediment. But it doesn’t cite a source for that recommendation, and there’s no way to know where it’s coming from. Even if you ask for a citation, it can be inaccurate or entirely fabricated.

We compared that to Standards, which doesn’t recommend routine culture of dilute urine. And unlike ChatGPT, Standards includes references for every recommendation and shows when the content was last updated and who authored it.

Check out: ChatGPT vs Plumb’s: 5 Ways AI Can Get Veterinary Drug Details Wrong

2. It Changes Its Answer When Challenged

We asked ChatGPT for a theophylline dose for an 8.2 kg cat, then suggested the dose should be higher.

ChatGPT agreed and revised the answer to ~65-120 mg every 12 hours based on the 8.2 kg weight provided. No source was given for either version of the dosing range. So we looked at how that compares to Standards.

Standards recommends 4.25 mg/kg every 12 hours for immediate-release formulation and dosing based on lean body weight. For the average cat, that would mean roughly 20–25 mg. ChatGPT’s revised dose was three to five times that. For a drug with a narrow therapeutic index, that’s a real safety risk.

Nothing about the case changed to justify the increase. The only difference was that we questioned the answer.

3. It Leaves Out Crucial Information

Next, we asked ChatGPT about differentials for exercise intolerance. The response looked thorough and was organized by system.

So we looked at how that compared to Standards.

Key differentials weren’t on the ChatGPT list, including heartworm disease, neoplasia, and hyperadrenocorticism. These aren’t obscure diagnoses. If they’re missing, it raises concerns about what else might be.

4. It Doesn’t Give You a Consistent Answer

Finally, we tested ChatGPT’s consistency using the same clinical scenario: a cat with seizures every 6 weeks. We phrased the question differently, but the clinical details stayed the same.

In one response, ChatGPT said to start anticonvulsant therapy. In another, it said treatment wasn’t necessary yet and monitoring was reasonable. Each response referenced a different treatment threshold, leading to a different recommendation.

We compared this to the feline seizures algorithm in Standards.

In the algorithm, this patient meets the treatment threshold. The criteria are clear, and the recommendation doesn’t change unless the guidance does.

Clinical Decisions Require Answers You Can Trust

We asked ChatGPT the kinds of questions that come up every day in practice. In each case, it got something wrong, left something out, or gave a different answer depending on how the question was phrased. That’s not something you can base a clinical decision on.

Standards of Care gives you consistent, evidence-based guidance. The criteria are clear, sources are visible, and recommendations are written and reviewed by specialists. The answer doesn’t change unless the evidence does.See what it’s like to have an answer you can trust. Try a free Standards demo.