Whether you are building an AI agent or evaluating one, the question is the same: does it actually change patient behavior?
I provide the scientific methodology to answer it.
Vendors make behavioral claims. You need independent verification of whether those claims hold before you commit budget, integrate systems, or expose patients.
You need to know your agent is behaviorally sound before it reaches patients — and that it produces outcomes you can defend to partners, investors, and regulators.
You must measure it right.
The engineering stack gets tested rigorously. Security, hallucinations, failure contingencies, fringe behaviors — these have established frameworks and active solutions.
The behavioral layer does not.
Whether a patient trusts the agent, engages with it over time, and actually changes their behavior as a result — these outcomes are assumed, not validated.
That's where products fail, and where my work begins.
Regulatory compliance — HIPAA, GDPR, SaMD classification, pharmacovigilance routing — is the responsibility of the product and its engineering and legal teams. My work sits at a distinct and equally necessary layer: measuring whether the behavioral intervention actually produces the outcomes it claims, within whatever compliant architecture you have built.
These are different disciplines. Both are required.
There's a quiet crisis in healthcare AI.
Companies are shipping conversational agents that are technically impressive — large language models fine-tuned on clinical data, integrated into care pathways, compliant with every regulation that matters. And then patients don't use them. Or they use them once. Or they use them and nothing changes.
The technology worked. The product didn't.
This is the last-mile problem. And it has nothing to do with model performance.
To be clear: the engineering problems in healthcare AI are real, and the field has made serious progress on them.
Getting a capable LLM working is no longer the bottleneck it once was.
Security and data privacy frameworks exist and are maturing.
Hallucination rates are measurable and increasingly controllable.
Failure contingencies — what happens when the primary model is unavailable, returns a low-confidence output, or hits an edge case — are becoming standard architecture.
Fringe behaviors, the long tail of unexpected inputs that break a system in production, are increasingly anticipated and tested for.
None of this is trivial. All of it is necessary.
And none of it is sufficient.
A patient who is ambivalent about change.
An agent that responds correctly but feels cold.
A conversation that covers all the right clinical ground and leaves the patient feeling unseen.
A dialogue structure inviting too many adverse events reports
These aren't failure modes that show up in your QA suite.
They show up in your retention curve — weeks after launch, quietly, without a stack trace.
This is where the known pipeline ends.
And it ends precisely at the moment that matters most: the ongoing relationship between your agent and your patient.
Classical product design was a finite problem. You defined the screens, mapped the flows, anticipated the edge cases. Users chose from options you provided. QA meant checking every defined state.
Conversational AI has no defined states.
When a patient can say anything — share a fear, express ambivalence, push back, go completely off-script — the interaction space becomes effectively infinite. The agent needs to respond to all of it: therapeutically, consistently, and in a way that reflects your brand and your clinical intent.
You cannot engineer your way through that space. You have to map it behaviorally — identifying the scenarios that matter most, understanding how patients are likely to think and feel in each one, and designing responses that move them in the right direction.
This is what behavioral science was built for.
It just took conversational AI to make it indispensable.
Most teams underestimate the last mile because they're solving for one layer when there are three.
Does the agent respond appropriately to what the patient is actually feeling — not just what they literally said? Does it know when to push and when to hold back? Can it handle ambivalence without losing the thread?
Does the patient believe the agent is on their side? Understands what they need? Trust in healthcare AI is fragile and non-linear — it builds slowly and collapses fast. The agent needs to be designed around how trust actually works, not assumed.
Patients are not a monolith. Some want direct guidance. Others need to feel heard first. A well-designed agent doesn't just respond — it reads the person and adapts in real time.
Get all three right, and you have an agent that patients return to.
Get any one wrong, and you have a product that works in demos and fails in the field.
Regulatory compliance — HIPAA, GDPR, SaMD classification, pharmacovigilance routing, adverse event flagging — is the responsibility of the product, its engineering team, and its legal counsel. These are distinct and well-defined disciplines with established frameworks.
My work sits at a separate layer: measuring whether the behavioral intervention produces the outcomes it claims. Whether patients engage, trust, adhere, and change behavior. These outcomes are currently unmeasured in most healthcare AI deployments — not because the tools don't exist, but because they require a different kind of expertise to apply.
Both layers are required. Regulatory compliance without behavioral validation means a compliant product that doesn't work. Behavioral validation without compliance means outcomes that can't be deployed. The disciplines are complementary, not interchangeable.
Behavioral science isn't the study of how people should behave. It's the study of how they actually behave — under uncertainty, under stress, in the context of their own beliefs, habits, and ambivalence.
That's exactly the population your AI agent is talking to.
The frameworks exist. Motivational Interviewing was designed precisely for the patient who is ambivalent about change. CBT maps the cognitive patterns that keep people stuck. Behavior change theory tells you which lever to pull at which moment in a patient's journey.
What's new is applying them at scale, in real time, through a conversational AI — and then validating that they actually work. Not assuming. Validating.
That's what I do.
Each engagement is scoped as a defined project with clear deliverables. Whether you are building an agent or evaluating one, the methodology is the same: rigorous, evidence-based, and designed to produce findings you can act on and defend.
My process is designed to be high-impact and low-overhead.
You provide the clinical context and system access;
I provide the forensic evidence and the roadmap forward.
For teams at the design stage, and for evaluators assessing whether a vendor understood their patient population
Most product teams work from personas that describe who their patients are. Patient Intelligence goes further: it maps how they think, what drives and blocks behavior change in this specific population, what communication styles build trust, and what scenarios the agent is almost certain to encounter.
This is the foundation that makes everything downstream more accurate — conversation architecture, validation criteria, brand voice. Without it, the agent is designed for an imaginary patient. For evaluators, this analysis also provides the benchmark against which a vendor's claimed patient understanding can be tested.
A behavioral profile of the target population — qualitative and quantitative — including early indicators, personality dimensions, motivational drivers, communication preferences, high-risk scenarios, and prompt design recommendations.
Hospital and pharma teams evaluating whether a vendor's product was designed with adequate understanding of their specific patient population.
Right for you if: You are pre-build, your agent was built without this foundation, or you need an independent baseline for evaluating a vendor's claims about patient fit.
For teams pre-launch, and for evaluators auditing a vendor product before procurement
Behavioral Stress Testing is a systematic audit of an agent against the full range of scenarios the real patient population is likely to present — ambivalence, resistance, emotional distress, off-script disclosures, edge cases. It runs simulated conversations across hundreds of probable scenarios and identifies where the agent responds correctly but wrongly: technically sound, behaviorally damaging.
This is not a bug hunt. It is a behavioral fidelity assessment.
A structured audit report mapping agent performance across scenario categories, with severity ratings, failure patterns, and prioritized recommendations. Suitable for internal remediation or for informing procurement decisions.
Pharma and hospital procurement teams requiring independent behavioral validation of a vendor's AI product before contract signature or patient deployment.
Right for you if: You have a working system approaching launch, or you are a buyer needing independent evidence of behavioral performance before committing.
For teams with live systems underperforming, and for evaluators investigating a deployed vendor product
Behavioral Forensics is a rigorous post-deployment investigation. Using conversation data, dropout patterns, and behavioral analysis, it identifies where the agent is losing patients, what the underlying behavioral mechanisms are, and what changes will actually move outcomes — as opposed to what feels intuitive but won't.
This is not a UX audit. It is a behavioral diagnosis with a prioritized remediation roadmap.
A forensic analysis including population simulations, failure taxonomy, patient segment breakdowns, root cause hypotheses with evidence, and a prioritized remediation roadmap. Actionable by both internal product teams and vendor management.
Hospital and pharma teams investigating why a deployed vendor product is not producing the outcomes that were contracted, and building an evidence-based case for remediation or contract review.
Right for you if: A live system is underperforming and you need to understand the behavioral mechanisms before deciding how to respond.
If you know something isn't working — or you need to know whether a vendor's product will work — that is a sufficient starting point.
Three projects across different domains, stages, and client types. Full methodology, sample sizes, and findings — because outcomes without rigor are claims, not evidence.
Xoltar was developing a conversational AI agent for smoking cessation — a domain where digital interventions have a long history of strong early engagement and poor sustained outcomes. The clinical challenge was not producing an AI that could discuss quitting smoking. It was producing one that could support a patient through the full behavioral arc of cessation.
The commercial challenge was equally specific: the market was saturated with ChatGPT-based products making unsubstantiated therapeutic claims. Xoltar needed peer-reviewed evidence to differentiate credibly and secure pharmaceutical partnerships.
I designed the therapeutic architecture of the agent from the ground up, grounding it in Motivational Interviewing — a clinical framework developed for patients who are ambivalent about behavior change. The architecture comprised 11 goal-progression stages, maintaining natural conversational flow while ensuring therapeutic fidelity across the patient journey.
Prior to live deployment, I ran hundreds of simulated conversations to stress-test the agent against the full range of patient scenarios — resistance, relapse, ambivalence, off-script disclosures. Automated quality assurance systems monitored therapeutic fidelity at scale. The trial used a two-arm RCT design comparing the agent against an educational video control and a ChatGPT condition.
Structure beats sophistication. A behaviorally grounded agent with a defined therapeutic architecture outperformed a more capable general model across every measured outcome — not despite its constraints, but because of them. Therapeutic rigor does not limit engagement. It produces it.
A network of chronic pain clinics had deployed a proactive chatbot to encourage patients to complete prescribed home therapeutic exercises. Homework adherence — one of the strongest predictors of treatment outcomes in chronic pain — was at 31%. The technology was functioning correctly. The behavioral outcomes were not.
I conducted a behavioral analysis of the patient population and the existing messaging architecture, identifying the gap between how the chatbot was communicating and how different patient segments actually respond to health prompts. The intervention was targeted: personalizing the proactive messaging to match individual patient communication styles, motivational profiles, and behavioral patterns — rather than broadcasting the same message to all patients regardless of profile.
Patients do not fail to adhere because they don't care. They fail because the intervention was not designed for them specifically. When messaging matches how a person actually responds to encouragement, challenge, or accountability — the behavior follows.
An oncology department was operating an AI conversational agent as part of their patient communication infrastructure. In a context where patients are frightened, facing consequential decisions, and acutely sensitive to institutional tone — the department needed to understand whether the agent was building or eroding the patient-institution relationship.
I conducted a behavioral assessment of the agent's performance across the patient journey, evaluating how it handled emotionally charged interactions, clinical uncertainty, and the specific trust dynamics of oncology care. The assessment identified where the agent was failing the emotional and relational demands of the context, and informed a targeted intervention to realign its behavior.
In high-stakes healthcare contexts, behavioral calibration of an AI agent is not a nice-to-have. It is clinical infrastructure. A patient who trusts their institution communicates more openly with clinical teams, engages more reliably with treatment, and produces better outcomes. The agent is part of that relationship whether it is designed for it or not.
Every engagement starts with understanding the specific population, context, and behavioral claim being made or tested.
My name is Alon Goldstein. I hold a PhD in Psychology from the Hebrew University of Jerusalem, where I studied non-conscious cognitive processes — the mental machinery that drives behavior below the level of awareness. That research turned out to be unusually good preparation for working with healthcare AI, which also needs to account for what patients do rather than what they say they will do.
Since then I have spent seven years at the intersection of behavioral science and AI — first as Chief Behavioral Research Officer at Xoltar, where I designed and validated a conversational AI agent for smoking cessation that achieved peer-reviewed clinical outcomes in a controlled trial, and now as an independent consultant working with healthtech companies, hospitals, and pharmaceutical organizations on the behavioral validation of healthcare AI.
My work covers three questions: Does this organization understand the behavioral reality of their patient population? Does this agent perform soundly across the full range of scenarios patients will bring to it? And when a live system is underperforming, what are the behavioral mechanisms at work and what will actually fix them?
The through-line is rigorous behavioral science applied to real products, validated against real outcomes — not frameworks that sound credible in a slide deck and dissolve on contact with patients.
I am based in Tel Aviv. I work with organizations internationally.
2025 — Selected for achievements in innovation, science, and societal impact.
For product teams, hospital innovation leads, and anyone building or evaluating something that needs to actually change patient behavior.
Most writing about AI in healthcare is either too technical to be useful or too vague to be actionable. This is an attempt at something in between — grounded in behavioral science, written for people making real decisions about real products.
I write about behavioral science concepts product teams should know but usually don't; field observations from validation work; where AI agents in healthcare are being built or evaluated incorrectly; and the evidence behind clinical frameworks like Motivational Interviewing and CBT — what they actually do, when they work, and when they don't.
Posts go up when there is something worth saying.
The gap between QA and behavioral reality — and why no engineering fix will close it.
The clinical framework explained for non-clinicians, and what it demands from a conversational agent.
The shift from classical UX to open conversation, and what it demands of anyone building or buying an AI agent.
A field observation on behavioral personalization — and what it reveals about why standard approaches fail.
The questions that separate rigorous behavioral claims from vendor marketing — and how to get answers.
I work with organizations that have a specific behavioral question to answer — whether that is understanding a patient population before building, stress-testing an agent before launch, diagnosing why a live system is underperforming, or independently evaluating a vendor's behavioral claims before procurement.
If the question is whether a healthcare AI product actually changes patient behavior — and whether you can prove it — that is the right starting point for a conversation.
If you are looking for general AI development, technical implementation, regulatory compliance consulting, or someone to validate a decision that has already been made, I am not the right fit.
Message received. I'll be in touch shortly.