Back to blog

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

arXiv: 2506.04450

PAPER UNDER REVIEW

Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification

Read paper on arXiv →

Title: Differentially Private LoRA for Radiology Reports: a useful step, not a panacea

Introduction

I read "Learning to Diagnose Privately: DP-Powered LLMs for Radiology Report Classification" with interest because it tackles a real operational problem: how to fine-tune language models on clinical text while reducing the risk of patient data leakage. The authors combine differential privacy with Low-Rank Adaptation (LoRA) and a student-teacher style labeling pipeline, and they evaluate on MIMIC-CXR and CT-RATE. The headline result is that with a moderate privacy budget they reach weighted F1 up to 0.89, close to non-private LoRA (0.90) and not far from full fine-tuning (0.96). That is worth paying attention to, but it is not the whole story.

Technical summary

The paper integrates three pieces.

First, differential privacy during fine-tuning. They apply a DP training procedure to limit how much any single report can influence model parameters. That is standard DP-SGD style training with per-example gradient clipping and additive noise, plus an accountant to track the privacy budget.

Second, parameter-efficient fine-tuning via LoRA. Instead of updating millions or billions of parameters, they inject low-rank adapters and only fine-tune those. This reduces the number of parameters that receive noisy, private updates and lowers memory and compute costs for DP training, which is important because per-example gradients are expensive.

Third, they use labels produced by a larger LLM to train smaller models. This is a student-teacher idea: a powerful model annotates text and a smaller, private model is trained under DP constraints on those labels. The intent is to combine label quality from a strong model with the efficiency and privacy of smaller models.

They evaluate on MIMIC-CXR and CT-RATE for multi-abnormality classification and sweep privacy regimes. The main empirical takeaway is that DP-LoRA can reach near non-private LoRA performance at moderate privacy settings, with some gap to full fine-tuning.

My perspective

I appreciate the paper for focusing on deployment realities. Running DP on full models is often infeasible. Using LoRA to shrink the tunable parameter set is a concrete engineering maneuver that makes private fine-tuning tractable. This is the kind of pragmatism I want to see more of.

That said, there are several things I want to flag.

First, differential privacy is subtle. A low epsilon number feels reassuring but does not eliminate all risks. DP bounds the worst-case influence of an individual training example on the model parameters under a specific threat model. It does not make outputs magically safe in every scenario. If your deployment allows free querying or you run the model upstream of other systems, new leakage channels appear. The paper reports performance across privacy regimes, but I would like to see empirical attack simulations: membership inference, model inversion, and extraction attacks tailored to text models. Stating DP guarantees is necessary. Demonstrating empirical resistance to realistic attacks is far more compelling for clinical deployments.

Second, the student-teacher labeling step needs careful scrutiny. If the larger LLM annotates private reports, then that LLM is exposed to private text during labeling. If that labeling happens in-house, fine, but if it uses an external API the privacy calculus changes. Even within a private network, noisy teacher labels can introduce systematic errors. The paper does not fully unpack how label noise interacts with DP training. With DP you already reduce signal by adding noise; combining that with imperfect labels can degrade sensitivity to rare findings, which are exactly the clinically important cases.

Third, dataset and metric limits matter. MIMIC-CXR is a great public benchmark but it is also class imbalanced, and external generalization is nontrivial. Weighted F1 is useful, but for clinical adoption you care about false negatives for critical findings, calibration, and performance on rare abnormalities. The paper shows promising aggregate numbers, but I would want per-class precision/recall and calibration plots, especially for low prevalence labels. A drop from 0.96 to 0.89 overall can hide much larger drops for clinically important classes.

Fourth, implementation complexity and hyperparameter sensitivity are underplayed. DP training requires choices for clipping norm, noise multiplier, microbatch size, and accountant method. These choices materially affect both privacy and utility. LoRA rank and adapter placement further complicate tuning. In practice teams will need robust recipes and strong monitoring. The paper gives a proof of concept but not a runbook for production.

Finally, one structural point: DP fine-tuning reduces the influence of private training data on the adapter weights. But the base pretrained model still contains information from its pretraining corpora. If any of that contains sensitive or proprietary content, or if the base model was exposed to datasets with insufficient provenance, you still have residual risk. DP fine-tuning is not a substitute for careful model selection and governance.

Implications for clinical practice

What this paper gives us is a practical pathway to more privacy-aware clinical NLP models. For institutions that want to train on their own radiology reports and then share models or adapters, DP-LoRA can make that feasible without prohibitive compute cost. That is valuable.

However, adopting this approach requires realistic expectations. DP is a tool not a guarantee. It should sit inside a broader risk framework that includes access controls, query limits, auditing, and a plan for rare-event performance monitoring. When automating report classification, clinicians will care about asymmetric errors. Any deployment should evaluate per-label harms and run prospective validation on local data.

For teams building these systems I recommend three concrete steps beyond the methods in the paper: test model resistance to targeted extraction and membership attacks, report per-class metrics and calibration, and build a monitoring pipeline that alerts on distribution shift and performance degradation on rare labels.

Conclusion

The paper moves the conversation forward by showing that private fine-tuning is not purely theoretical and that LoRA makes DP training workable for radiology report classification. It is a practical contribution. At the same time, the privacy community and clinical teams should avoid overconfidence. Differential privacy narrows one source of risk but does not remove the need for careful labeling, auditing, and clinical validation. I would like to see follow-up work that stresses the method on rare findings, evaluates empirical attacks, and publishes operational guidance for deployment in hospitals. Until then, DP-LoRA is a useful tool in the toolbox, not a plug-and-play solution.