Extracting Breast Cancer Phenotypes from Clinical Notes: Comparing LLMs with...

LLMs versus ontologies for extracting breast cancer phenotypes: a practical read

I read arXiv:2604.06208 with the kind of skepticism I bring to most papers that compare large language models with established knowledge-driven approaches. The authors set out to extract breast cancer phenotypes from oncology provider notes and to compare an LLM-based information extraction pipeline with a classical, ontology-driven annotator (NCIt). Their headline finding is straightforward: an LLM-based framework can be adapted to extract the same phenotype classes with accuracy comparable to the ontology method, and once trained it can be more easily repurposed for other cancers.

Here is my take, from the perspective of someone who builds and deploys clinical AI systems.

Technical summary

The paper focuses on oncology EMR notes where oncologists write free text about chemotherapy outcomes, biomarkers, tumor location, sizes, and growth patterns. The authors implement an LLM-based pipeline to identify these phenotype elements and compare its output to a previous knowledge-driven pipeline that uses the NCIt Ontology Annotator.

Key claims are:

The LLM approach achieves accuracy on par with the ontology-based method for a set of breast cancer phenotypes.
The LLM pipeline can be fine-tuned or adapted to other cancer types with less engineering effort than ontology-driven systems.

The paper provides performance numbers that support those claims, though some methodological details are sparse or hard to find. It is not clear which exact model variants were used, how many clinical notes were included, what the annotation process looked like, and what the inter-annotator agreement was for the reference labels.

What I find interesting

The result itself is not a surprise, but it matters. Ontology-driven systems have long been the conservative choice in healthcare because they are predictable, auditable, and explicitly map to controlled vocabularies. Building and maintaining those systems is costly and brittle. LLMs offer a shortcut: they can learn language patterns, abbreviations, and implicit phrasing that rule systems miss, and they can be retuned for new domains with relatively little data.

The paper shows that LLMs can match ontology methods on the target tasks. That suggests real-world value: institutions with fewer engineering resources might get similar phenotype extraction performance faster with an LLM-first approach. For downstream uses like cohort discovery or research registries, that speed matters.

What concerns me

There are practical gaps that the paper does not resolve, and those are the things that decide success in the clinic.

First, the paper gives limited information on dataset size, label quality, and inter-annotator agreement. Extraction quality in notes hinges on consistent, reliable labels. Without that, reported accuracy numbers are hard to interpret.

Second, clinic-facing extraction is rarely a binary text classification problem. Oncology notes embed negation, hypothetical language, family history, temporality, and numeric measures. Detecting "no evidence of disease" is different from detecting "evidence of progression." Tumor size extraction needs precise numbers and units, and those are easy failure modes for LLMs that are trained on text rather than on numeric reasoning. The paper does not provide a detailed error analysis on these clinically thorny cases.

Third, explainability and traceability matter for clinical acceptance and regulatory review. Ontology methods give clear term matches and provenance. LLMs can be inscrutable. The paper mentions adaptability but does not describe model cards, audit logs, or how the system supports human review. An LLM that outputs a phenotype without pointing to the exact sentence and token spans that justify it will be hard to deploy safely.

Fourth, there is the risk of hallucination and spurious correlations. LLMs can confidently assert facts that are not present in the note. The paper reports comparable accuracy overall, but it is important to see precision for clinically actionable categories versus recall, and to understand the cost of errors. A false positive tumor progression flag is not the same as a missed biomarker record.

Finally, operational cost is non-trivial. Fine-tuning and running large models in a secure, HIPAA-compliant environment has infrastructure, latency, and cost implications. The paper suggests easier adaptation, but it does not compare total cost of ownership versus maintaining ontology-based pipelines.

What matters for practice

If you are considering this research for a production system, I would not treat the LLM result as an automatic winner. Instead, consider hybrid designs where an LLM proposes extractions and a rules-based or ontology-backed layer validates and maps them to controlled vocabularies (NCIt codes, SNOMED, ICD). Use the ontology as the ground truth for mapping and the LLM to find candidate mentions and edge cases.

Invest in rigorous evaluation beyond aggregate accuracy. Measure per-class precision and recall, time-aware extraction (was this current disease status or a historical note), negation handling, and numeric extraction correctness. Run adversarial tests that include abbreviations, truncated notes, and common scanning/OCR artifacts.

Operational controls are mandatory. You need human-in-the-loop review for low-confidence outputs, model monitoring for drift, and a documented audit trail that links each extracted phenotype back to the source text. Produce a model card and a clear data provenance report before any clinical or registry application.

Next steps I would like to see

The paper is a useful signal that LLMs are competitive for phenotype extraction. To move from research to clinical use I want:

A public benchmark dataset for oncology phenotype extraction with clear annotation guidelines and agreement metrics.
Detailed error analysis showing where LLMs fail compared to ontologies, especially on negation, temporality, and numeric values.
Examples of a hybrid pipeline that maps LLM outputs to NCIt terms with validation rules and confidence thresholds.
Prospective evaluation where clinicians review LLM-extracted phenotypes in real workflows and measure changes in workload and data quality.

Bottom line

The authors show that LLMs can reach parity with ontology annotators on breast cancer phenotype extraction in retrospective notes. That is an important and practical finding. But parity on an offline metric is not the same as readiness for production in oncology settings. I see value in LLMs as a flexible front end for phenotype discovery, but ontologies, rules, and human oversight remain essential to make these systems safe and useful in the clinic.