Whether you are building an AI agent or evaluating one, the question is the
same:
Does it actually change patient behavior?
I provide the scientific methodology to answer it.
Vendors make behavioral claims. You need independent verification of whether those claims hold before you commit budget, integrate systems, or expose patients.
You need to know your agent is behaviorally sound before it reaches patients — and that it produces outcomes you can defend to partners, investors, and regulators.
You must measure it right.
The engineering stack gets tested rigorously. Security, hallucinations, failure contingencies, fringe behaviors — these have established frameworks and active solutions.
The behavioral layer does not.
Whether a patient trusts the agent, engages with it over time, and actually changes their behavior as a
result — these outcomes are assumed, not validated.
That's where products fail, and where my work
begins.
There's a quiet crisis in healthcare AI.
Companies are shipping conversational agents that are technically impressive — large language models fine-tuned on clinical data, integrated into care pathways, compliant with every regulation that matters. And then patients don't use them. Or they use them once. Or they use them and nothing changes.
The technology worked. The product didn't.
This is the last-mile problem. And it has nothing to do with model performance.
To be clear: the engineering problems in healthcare AI are real, and the field has made serious progress on them.
Getting a capable LLM working is no longer the bottleneck it once was.
Security and data privacy
frameworks exist and are maturing.
Hallucination rates are measurable and increasingly controllable.
Failure contingencies — what happens when the primary model is unavailable, returns a low-confidence
output, or hits an edge case — are becoming standard architecture.
Fringe behaviors, the long tail of
unexpected inputs that break a system in production, are increasingly anticipated and tested for.
None of this is trivial. All of it is necessary.
And none of it is sufficient.
Core model training and weights.
Crucial for mapping patients characteristics, preferences and actual health behaviors.
Legal framework and encryption.
Technical architectural grounding.
Ensures the agent uses appropriate "Nudges" and choice architecture to drive adherence without triggering reactance.
Validating the agent against real-world human irrationality.
A patient who is ambivalent about change.
An agent that responds correctly but feels cold.
A
conversation that covers all the right clinical ground and leaves the patient feeling unseen.
A dialogue
structure inviting too many adverse events reports
These aren't failure modes that show up in your QA suite.
They show up in your retention curve — weeks
after launch, quietly, without a stack trace.
This is where the known pipeline ends.
And it ends precisely at the moment that matters most: the ongoing
relationship between your agent and your patient.
Classical product design was a finite problem. You defined the screens, mapped the flows, anticipated the edge cases. Users chose from options you provided. QA meant checking every defined state.
Conversational AI has no defined states.
When a patient can say anything — share a fear, express ambivalence, push back, go completely off-script — the interaction space becomes effectively infinite. The agent needs to respond to all of it: therapeutically, consistently, and in a way that reflects your brand and your clinical intent.
You cannot engineer your way through that space. You have to map it behaviorally — identifying the scenarios that matter most, understanding how patients are likely to think and feel in each one, and designing responses that move them in the right direction.
This is what behavioral science was built for.
It just
took conversational AI to make it indispensable.
Most teams underestimate the last mile because they're solving for one layer when there are three.
Does the agent respond appropriately to what the patient is actually feeling — not just what they literally said? Does it know when to push and when to hold back? Can it handle ambivalence without losing the thread?
Does the patient believe the agent is on their side? Understands what they need? Trust in healthcare AI is fragile and non-linear — it builds slowly and collapses fast. The agent needs to be designed around how trust actually works, not assumed.
Patients are not a monolith. Some want direct guidance. Others need to feel heard first. A well-designed agent doesn't just respond — it reads the person and adapts in real time.
Get all three right, and you have an agent
that patients return to.
Get any one wrong, and you have a product that works in demos and fails in the
field.
Regulatory compliance — HIPAA, GDPR, SaMD classification, pharmacovigilance routing, adverse event flagging — is the responsibility of the product, its engineering team, and its legal counsel. These are distinct and well-defined disciplines with established frameworks.
My work sits at a separate layer: measuring whether the behavioral intervention produces the outcomes it claims. Whether patients engage, trust, adhere, and change behavior. These outcomes are currently unmeasured in most healthcare AI deployments — not because the tools don't exist, but because they require a different kind of expertise to apply.
Both layers are required. Regulatory compliance without behavioral validation means a compliant product that doesn't work. Behavioral validation without compliance means outcomes that can't be deployed. The disciplines are complementary, not interchangeable.
Behavioral science isn't the study of how people should behave. It's the study of how they actually behave — under uncertainty, under stress, in the context of their own beliefs, habits, and ambivalence.
That's exactly the population your AI agent is talking to.
The frameworks exist. Motivational Interviewing was designed precisely for the patient who is ambivalent about change. CBT maps the cognitive patterns that keep people stuck. Behavior change theory tells you which lever to pull at which moment in a patient's journey.
What's new is applying them at scale, in real time, through a conversational AI — and then validating that they actually work. Not assuming. Validating.
That's what I do.
Each engagement is scoped as a defined project with clear deliverables. Whether you are building an agent or evaluating one, the methodology is the same: evidence-based, and designed to produce findings you can act on and defend.
My process is designed to be high-impact and low-overhead.
You provide the clinical context and system
access;
I provide the forensic evidence and the roadmap forward.
For teams at the design stage, and for evaluators assessing whether a vendor understood their patient population
Most product teams work from personas that describe who their patients are. Patient Intelligence goes further: it maps how they think, what drives and blocks behavior change in this specific population, what communication styles build trust, and what scenarios the agent is almost certain to encounter.
This is the foundation that makes everything downstream more accurate — conversation architecture, validation criteria, brand voice. Without it, the agent is designed for an imaginary patient. For evaluators, this analysis also provides the benchmark against which a vendor's claimed patient understanding can be tested.
A behavioral profile of the target population — qualitative and quantitative — including early indicators, personality dimensions, motivational drivers, communication preferences, high-risk scenarios, and prompt design recommendations.
Hospital and pharma teams evaluating whether a vendor's product was designed with adequate understanding of their specific patient population.
Right for you if: You are pre-build, your agent was built without this foundation, or you need an independent baseline for evaluating a vendor's claims about patient fit.
For teams pre-launch, and for evaluators auditing a vendor product before procurement
Behavioral Stress Testing is a systematic audit of an agent against the full range of scenarios the real patient population is likely to present — ambivalence, resistance, emotional distress, off-script disclosures, edge cases. It runs simulated conversations across hundreds of probable scenarios and identifies where the agent responds correctly but wrongly: technically sound, behaviorally damaging.
This is not a bug hunt. It is a behavioral fidelity assessment.
A structured audit report mapping agent performance across scenario categories, with severity ratings, failure patterns, and prioritized recommendations. Suitable for internal remediation or for informing procurement decisions.
Pharma and hospital procurement teams requiring independent behavioral validation of a vendor's AI product before contract signature or patient deployment.
Right for you if: You have a working system approaching launch, or you are a buyer needing independent evidence of behavioral performance before committing.
For teams with live systems underperforming, and for evaluators investigating a deployed vendor product
Behavioral Forensics is a methodological post-deployment investigation. Using conversation data, dropout patterns, and behavioral analysis, it identifies where the agent is losing patients, what the underlying behavioral mechanisms are, and what changes will actually move outcomes — as opposed to what feels intuitive but won't.
This is not a UX audit. It is a behavioral diagnosis with a prioritized remediation roadmap.
A forensic analysis including population simulations, failure taxonomy, patient segment breakdowns, root cause hypotheses with evidence, and a prioritized remediation roadmap. Actionable by both internal product teams and vendor management.
Hospital and pharma teams investigating why a deployed vendor product is not producing the outcomes that were contracted, and building an evidence-based case for remediation or contract review.
Right for you if: A live system is underperforming and you need to understand the behavioral mechanisms before deciding how to respond.
If you know something isn't working — or you need to know whether a vendor's product will work — that is a sufficient starting point.
Three projects across different domains, stages, and client
types.
Methodology, sample sizes, and findings below.
Full paper will be submitted uppon request.
Xoltar was developing a conversational AI agent for smoking cessation — a domain where digital interventions have a long history of strong early engagement and poor sustained outcomes. The clinical challenge was not producing an AI that could discuss quitting smoking. It was producing one that could support a patient through the full behavioral arc of cessation.
The commercial challenge was equally specific: the market was saturated with ChatGPT-based products making unsubstantiated therapeutic claims. Xoltar needed peer-reviewed evidence to differentiate credibly and secure pharmaceutical partnerships.
I designed the therapeutic architecture of the agent from the ground up, grounding it in Motivational Interviewing — a clinical framework developed for patients who are ambivalent about behavior change. The architecture comprised 11 goal-progression stages, maintaining natural conversational flow while ensuring therapeutic fidelity across the patient journey.
Prior to live deployment, I ran hundreds of simulated conversations to stress-test the agent against the full range of patient scenarios — resistance, relapse, ambivalence, off-script disclosures. Automated quality assurance systems monitored therapeutic fidelity at scale. The trial used a two-arm RCT design comparing the agent against an educational video control and a ChatGPT condition.
Structure beats sophistication. A behaviorally grounded agent with a defined therapeutic architecture outperformed a more capable general model across every measured outcome — not despite its constraints, but because of them. Therapeutic rigor does not limit engagement. It produces it.
A network of chronic pain clinics had deployed a proactive chatbot to encourage patients to complete prescribed home therapeutic exercises. Homework adherence, one of the strongest predictors of treatment outcomes in chronic pain, was at 31%. The technology was functioning correctly. The behavioral outcomes were not.
I conducted a behavioral analysis of the patient population and the existing messaging architecture, identifying the gap between how the chatbot was communicating and how different patient segments actually respond to health prompts. The intervention was targeted: personalizing the proactive messaging at the end of each session and between sessions, to match individual patient communication styles, motivational profiles, and behavioral patterns — rather than broadcasting the same message to all patients regardless of profile.
Patients do not fail to adhere because they don't care. They fail because the intervention was not designed for them specifically. When messaging matches how a person actually responds to encouragement, challenge, or accountability — the behavior follows.
An oncology department treating ovarian and endometrial cancer patients needed a way to monitor chemotherapy side effects between clinical visits — without adding burden to an already stretched medical team. The gap between sessions is precisely where symptoms escalate undetected, trust erodes, and outcomes suffer.
I defined the behavioral roadmap and conducted a behavioral assessment and intervention with an AI avatar delivering video-based conversational check-ins to patients post-chemotherapy. The evaluation focused on how the agent handled the emotional and relational demands specific to oncology: fear, physical distress, and the need to feel genuinely heard. Behavioral calibration ensured the avatar could sustain patient engagement across 83 sessions — and that patients perceived it not as a form to fill out, but as a presence worth talking to.
Over 1,000 nonverbal gestures directed at an AI is not a UX metric. It is a signal about what happens when an agent is behaviorally calibrated for its context. Patients in chemotherapy don't engage with tools — they engage with presences. When the avatar behaved in ways that felt attuned to their emotional state, they responded as they would to a person: with nods, with smiles, with disclosure. That disclosure is what made the clinical monitoring possible. The behavior was the infrastructure.
Every engagement starts with understanding the specific population, context, and behavioral claim being made or tested.
Where healthcare AI actually succeeds or fails. Writing on behavioral science, patient behavior, and what it takes to build AI agents that produce real outcomes.
No posts in this category yet.
No newsletter cadence — just a note when something new appears.
You're subscribed.
I'll be in touch when there's something
worth reading.
I work with organizations that have a specific behavioral question to answer — whether that is understanding a patient population before building, stress-testing an agent before launch, diagnosing why a live system is underperforming, or independently evaluating a vendor's behavioral claims before procurement.
If the question is whether a healthcare AI product actually changes patient behavior — and whether you can prove it — that is the right starting point for a conversation.
Engagements are scoped to the specific question, population, and product stage — a pre-launch behavioral audit for a healthtech startup is a different undertaking than an independent evaluation for a pharma procurement team. Projects typically run 2–12 weeks. Pricing follows a scoping conversation, once the actual question is clear.
I respond to every inquiry within 48 hours.
If we're not a fit, I'll tell you directly.
If you are looking for general AI development, technical implementation, regulatory compliance consulting, or someone to validate a decision that has already been made, I am not the right fit.
Message received. I'll be in touch shortly.