Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA

Title: Compiling Numeric Answers: Practical takeaways from a data-centric compiler for financial QA

I've been watching work on numerical reasoning in question answering for a while. The paper arXiv:2605.31064, "Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA", caught my eye because it pushes a simple idea with real operational implications: if you need trustworthy numbers, make the system output something you can verify and execute. They call their approach the Data-centric Reasoning Compiler, or DCRC, and it combines adversarial data, a Data-centric Structuring Agent (DSA), and a compile-and-execute inference step that turns natural language plus retrieved documents into executable reasoning programs.

Technical summary

At a high level DCRC tackles three failure modes common in retrieval-augmented QA: noise sensitivity from imperfect retrieval, calculation fragility when the model makes arithmetic errors, and an auditability gap where you cannot inspect the internal reasoning that produced a numeric answer. The pipeline has three stages.

First, they construct adversarial training data. Instead of only using clean examples, they synthesize documents and questions with controlled noise to teach the model to ignore distractors and to rely on precise values when available.

Second, they train a Data-centric Structuring Agent that maps the question and retrieved evidence into a structured program. The DSA is trained in multiple stages to both audit evidence (tag sources, surface contradictions) and to output a deterministic program for the calculation steps.

Third, at inference the DSA compiles the program and executes it in a sandboxed runtime. That execution yields the final numeric answer and an audit trail: the program itself and the intermediate values derived from explicit document citations.

They report improvements on offline FinQA-style benchmarks and also describe deploying the system inside an online financial QA service.

My perspective

I like parts of this paper because it focuses on things that matter in production: deterministic calculations, auditability, and explicit handling of noisy retrieval. Too many papers treat LLMs as black boxes that somehow get better with scale or more compute. This one accepts that model outputs are fallible and builds around that fact.

Turning answers into executable programs is the core idea with the highest practical value. If an agent produces a small program that performs arithmetic using extracted values, you gain three things immediately. First, determinism for arithmetic. A correct program executed in a deterministic runtime either computes the right number or fails. Second, an audit trail. You can store the program and re-run it against a snapshot of the retrieved evidence to reproduce the answer. Third, a surface where you can insert checks. If an intermediate value seems off, you can add runtime assertions or unit tests.

The adversarial data construction is also sensible. Models learn to be brittle when training data never presents them with plausible distractors. Teaching a structuring agent to ignore those distractors is straightforward and useful. Practically speaking, I expect adversarially generated noise to give a sizable lift on in-domain evaluation.

That said, there are real gaps and tradeoffs they either gloss over or do not fully resolve.

What the paper leaves unclear or under-specified

Program synthesis reduces some classes of errors but introduces others. A program is only as good as the DSA that writes it. If the agent synthesizes the wrong program, execution will produce a confidently wrong number. The paper uses adversarial training to reduce that risk, but it does not fully explore how well this generalizes to unseen types of noise or to shifts in document formats. In production you will face new report layouts, CSV exports, and poorly structured filings whose values are expressed in inconsistent ways. I want to see ablations that stress-test the system on out-of-distribution retrieval and on document schema drift.

Latency and operational cost are underdiscussed. Adding a structuring model plus a compile-and-execute step increases inference time and engineering surface area. For an online financial QA service, 200 to 500 millisecond overheads matter. The paper mentions deployment but gives few numbers on throughput, tail latency, and failure rates. For many businesses the decision to adopt a compile-and-execute pattern is a tradeoff between correctness and scale. You need solid SLOs and observability before shipping.

Security and sandboxing deserve more attention. Executing synthesized programs means running code generated by a model. That requires rigorous sandboxing, deterministic numeric libraries, and careful handling of timeouts and resource limits. The paper notes a sandbox but does not detail how they prevent escape, excessive resource use, or nondeterministic behavior in libraries.

Finally, integration with the retriever is still a hard dependency. The DSA assumes the correct facts are present in the retrieved documents. No amount of adversarial training can create missing facts. If retrieval fails, the only mitigation is to surface uncertainty, require human review, or fetch fresh data.

What matters for production

If you run or build financial QA systems, here is what matters after reading this paper.

First, instrument program-level observability. Log the synthesized program, the values pulled from each document, execution traces, and checks that passed or failed. Store these artifacts to reproduce answers and to run postmortems when numbers go wrong.

Second, treat the DSA and the retriever as two dependent services that must be tested together. Add regression suites that simulate new document formats and OOD distractors. Have automated tests that cover numeric stability and edge cases like unit mismatches, partial extractions, and currency conversions.

Third, add execution-time validators. Simple sanity checks like bounding expected values, range assertions, and reconciliation with historical data often catch synthesis errors. Formal verification is overkill in many settings, but deterministic numeric checks are cheap and effective.

Fourth, operationalize a fallback policy. When the DSA is uncertain or the execution fails checks, send the query to a human reviewer with the program and the supporting evidence. That human-in-the-loop path is what makes these systems deployable in regulated domains.

Bottom line

DCRC is not a silver bullet, but it is a pragmatic, production-friendly idea. Make models output verifiable artifacts and you reduce a large class of silent arithmetic failures. The paper shows that adversarial data plus program synthesis helps, but practical adoption requires more attention to retriever quality, latency, sandboxing, and monitoring. If you care about numeric correctness and auditability, this is a direction worth pursuing. I would start by prototyping a compile-and-execute layer in a narrow domain, instrumenting heavily, and treating the DSA and retriever as a coupled system to be tested continuously.