Back to blog

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool...

arXiv: 2603.17234

PAPER UNDER REVIEW

Deployment and Evaluation of an EHR-integrated, Large Language Model-Powered Tool to Triage Surgical Patients

Read paper on arXiv →

Title: Using an LLM to triage surgical patients in the EHR: promising results, practical gaps

Introduction

I read the Stanford group’s arXiv paper (2603.17234) with practical interest. They deployed an EHR-integrated, large language model powered tool called SCM Navigator to identify patients who should get surgical co-management (SCM). The study is prospective and unblinded, and they report high sensitivity and reasonable specificity while keeping clinicians in the loop. That is the kind of work I pay attention to: models used in live workflows, not just test sets. The headline is attractive. The reality behind it is more nuanced.

What they did, technically

SCM Navigator consumed preoperative notes and structured EHR data, applied a set of perioperative morbidity criteria, and labeled each surgical patient as appropriate, possibly appropriate, or not appropriate for SCM. The output was shown to faculty hospitalists who could accept or override the recommendation and provide free-text reasons when they disagreed. The authors measured sensitivity, specificity, positive predictive value, and negative predictive value against physician determinations. They also categorized free-text disagreement reasons and performed manual chart review of all false negatives and a sample of false positives.

In 6,193 cases triaged since deployment, the tool recommended SCM for 1,582 patients (23%). Reported sensitivity was 0.94 and specificity 0.74. The authors attribute most discrepancies to modifiable gaps in clinical criteria, institutional workflow, or physician practice variability, and they report that true LLM misclassification accounted for 2 of 19 false negatives flagged on chart review.

A few technical points are left vague in the abstract. The arXiv version does not make clear which LLM was used, how prompts were designed, whether retrieval augmentation was used, or how often structured data versus free text drove decisions. Those are important for reproducibility and risk assessment.

My analysis and perspective

There is clear value here. Identifying SCM candidates manually is tedious and error prone. A tool that screens and flags likely candidates can save time and standardize identification. The high sensitivity is appropriate for a screening tool. Missing a candidate for SCM has higher potential harm than flagging an extra patient for review. The human-in-the-loop design is also sensible. Clinicians retained final judgment and provided disagreement data that can be iteratively used to improve the system.

But promising operational results do not mean the job is done. I see several issues that matter if you want this to function reliably in other hospitals or at scale.

Unblinded clinician review introduces bias. Physicians saw the model recommendation before recording their judgment. That can inflate measured agreement because clinicians may anchor on the suggestion. A stronger design would include a blind assessment arm or randomization so you can estimate how much the model changes decisions versus simply informing them.

The reference standard is clinician determination, which is subjective. The authors acknowledged variability in physician practice. That matters when you report sensitivity and specificity. If the goal is to standardize to evidence-based SCM criteria, then the ground truth should be those criteria applied by independent raters, not necessarily the immediate clinician decision. If the goal is to predict which patients clinicians want SCM for, then clinician judgment is the right target. Clarify the aim.

Model transparency and technical detail matter for clinical trust. The paper does not detail the model, prompts, or failure modes. In my work I insist on logging the exact inputs used, tokenized prompts, and intermediate retrieval steps. Those are necessities for auditing, for debugging cases like the two true LLM errors, and for responding to regulatory questions.

Integration costs and data quality are underplayed. EHRs vary wildly in how preoperative notes are written and how structured fields are populated. A model tuned to Stanford documentation might fail in a community hospital where the preop note uses different phrasing. The authors note workflow and criteria gaps as causes of disagreement. That is predictable. Deploying this elsewhere will require local adaptation and ongoing monitoring.

False positives and alert fatigue deserve attention. Specificity of 0.74 is acceptable if the review burden created by false positives is small. A 23% positive rate means hospitalists must evaluate a nontrivial caseload of flagged patients. If the tool increases workload or creates low-value interruptions, it will be ignored. The paper’s manual review suggested many false positives were due to modifiable criteria gaps. That suggests iterative rules or threshold tuning could reduce unnecessary alerts. But the team needs to publish how they operationalized that tuning.

Finally, safety and governance questions need explicit answers. Who signs off when the model misses a high-risk patient? How are overrides captured and fed back? What monitoring is in place for model drift once surgical practice or documentation patterns change? These are the operational details that determine whether a deployed model remains safe.

What this means for practice

If you are considering a similar system, take away these practical points.

  • Treat the model as an assistant, not an authority. Keep the human-in-the-loop and make it easy for clinicians to override and document why.
  • Define your evaluation target clearly. Are you matching clinician behavior or codified eligibility criteria? The choice affects labeling, metrics, and deployment limits.
  • Expect to adapt prompts and rules to local documentation. Don’t expect the model to generalize without retraining or prompt updates.
  • Build audit logs and reproducible prompts from day one. You will need them for troubleshooting, compliance, and medico-legal questions.
  • Monitor operational metrics, not just classification metrics. Track review time per flagged patient, acceptance rate, and whether SCM referrals lead to the expected downstream actions.
  • Plan governance. Decide who owns model maintenance, threshold tuning, and the feedback loop from overrides.

Bottom line

This paper is a useful, pragmatic example of using an LLM inside the EHR to automate a triage task. The reported sensitivity and clinician-in-the-loop model are appropriate for screening. The next steps are harder: blinded evaluations, multi-center validation, transparent model descriptions, and operational monitoring. If you want a system that reduces clinician burden rather than adding to it, those practicalities matter more than the brand of model you use.