MLOps

Designing a reviewable deadline extraction pipeline for contracts

May 10, 2026

The Datanauts project became more interesting once it stopped being just a model and started becoming a workflow. The real systems question was not only whether the classifier and NER model worked. It was whether the output could be reviewed, routed, and trusted by something downstream.

The useful shift here was from “can the model predict the right span?” to “what should the pipeline emit when the answer is incomplete, low-confidence, or still needs a human to look at it?”

The problem setup

Contract language is messy in two different ways. First, not every clause that contains a date is actually a deadline or renewal event. Second, even relevant clauses often mix multiple time references, notice periods, and obligations. That made a single all-purpose extraction step harder to trust than a staged pipeline.

Why the system needed stages

The first model acted as a gate: is this clause relevant enough to spend extraction effort on at all? Only then did the NER stage try to identify the exact spans. That boundary turned out to be useful because it separated clause relevance from span precision and made the errors easier to reason about.

Pipeline diagram

Input contract clause or document fragment normalize and prepare text for inference

Gate classifier decide whether the clause is relevant enough to continue reduce noisy extraction attempts

NER stage extract notice periods, deadlines, and renewal-related spans

Routing layer emit structured JSON send low-confidence cases into review

The important system choice was not just stacking models. It was deciding where the review boundary should live and what the output contract should look like on either side of it.

What made it usable

The workflow became more useful once the infrastructure around the models became explicit. MLflow tracked experiments and artifacts for the classifier and NER stages. Docker packaging kept the inference path more reproducible. Confidence thresholds gave the pipeline a way to acknowledge uncertainty without pretending it was solved.

The output contract

The most important design decision was to make the output structured enough that downstream code could use it and a reviewer could still understand what happened. A useful output needed relevance, event type, extracted spans, confidence, and a clear flag for whether review was still required.

{
  "clause_text": "Either party may terminate this agreement by giving at least 30 days' written notice prior to renewal.",
  "is_relevant": true,
  "document_event_type": "notice_deadline",
  "entities": [
    {
      "text": "at least 30 days' written notice",
      "label": "NOTICE_PERIOD"
    }
  ],
  "review_required": false,
  "confidence": {
    "classifier": 0.93,
    "ner": 0.88
  }
}

That same shape still worked when confidence dropped. The system could emit the same contract with review_required switched to true, which made borderline cases easier to route to a human instead of pretending every answer deserved the same trust.

What the numbers said

Signal	Observed result	Why it mattered
RoBERTa gate	F1 = 0.8682	The gate was good enough to reduce noisy extraction attempts without collapsing the pipeline into a single model.
BERT NER	F1 = 1.0 on the curated evaluation split	The extraction stage performed strongly on the project split, but the staged system design still mattered more than the headline number alone.
Operational behavior	structured outputs plus review routing	The pipeline became usable because low-confidence predictions had somewhere honest to go.

What I tried to avoid

A tempting version of this project would have been to collapse relevance, extraction, and output formatting into one bigger model step. That would have made the architecture look simpler while making failures harder to diagnose. Keeping the stages explicit gave the workflow better failure boundaries and made evaluation easier.

What still felt weak

The missing piece is still better production-facing monitoring for drift and boundary cases. Good offline metrics are useful, but they are not enough on their own. A stronger version of this system would make distribution change and review volume visible over time instead of depending mostly on offline confidence.

Transferable principle

In extraction systems, “usable” usually comes from the output contract and the review boundary, not just from the best model number on the page.