MLOps
Designing a reviewable deadline extraction pipeline for contracts
The Datanauts project became more interesting once it stopped being just a model and started becoming a workflow. The real systems question was not only whether the classifier and NER model worked. It was whether the output could be reviewed, routed, and trusted by something downstream.
The useful shift here was from “can the model predict the right span?” to “what should the pipeline emit when the answer is incomplete, low-confidence, or still needs a human to look at it?”
The problem setup
Contract language is messy in two different ways. First, not every clause that contains a date is actually a deadline or renewal event. Second, even relevant clauses often mix multiple time references, notice periods, and obligations. That made a single all-purpose extraction step harder to trust than a staged pipeline.
Why the system needed stages
The first model acted as a gate: is this clause relevant enough to spend extraction effort on at all? Only then did the NER stage try to identify the exact spans. That boundary turned out to be useful because it separated clause relevance from span precision and made the errors easier to reason about.
Pipeline diagram
The important system choice was not just stacking models. It was deciding where the review boundary should live and what the output contract should look like on either side of it.
What made it usable
The workflow became more useful once the infrastructure around the models became explicit. MLflow tracked experiments and artifacts for the classifier and NER stages. Docker packaging kept the inference path more reproducible. Confidence thresholds gave the pipeline a way to acknowledge uncertainty without pretending it was solved.
The output contract
The most important design decision was to make the output structured enough that downstream code could use it and a reviewer could still understand what happened. A useful output needed relevance, event type, extracted spans, confidence, and a clear flag for whether review was still required.
{
"clause_text": "Either party may terminate this agreement by giving at least 30 days' written notice prior to renewal.",
"is_relevant": true,
"document_event_type": "notice_deadline",
"entities": [
{
"text": "at least 30 days' written notice",
"label": "NOTICE_PERIOD"
}
],
"review_required": false,
"confidence": {
"classifier": 0.93,
"ner": 0.88
}
}
That same shape still worked when confidence dropped. The system could emit the same contract with
review_required switched to true, which made borderline cases easier to route to
a human instead of pretending every answer deserved the same trust.
What the numbers said
| Signal | Observed result | Why it mattered |
|---|---|---|
| RoBERTa gate | F1 = 0.8682 | The gate was good enough to reduce noisy extraction attempts without collapsing the pipeline into a single model. |
| BERT NER | F1 = 1.0 on the curated evaluation split | The extraction stage performed strongly on the project split, but the staged system design still mattered more than the headline number alone. |
| Operational behavior | structured outputs plus review routing | The pipeline became usable because low-confidence predictions had somewhere honest to go. |
What I tried to avoid
A tempting version of this project would have been to collapse relevance, extraction, and output formatting into one bigger model step. That would have made the architecture look simpler while making failures harder to diagnose. Keeping the stages explicit gave the workflow better failure boundaries and made evaluation easier.
What still felt weak
The missing piece is still better production-facing monitoring for drift and boundary cases. Good offline metrics are useful, but they are not enough on their own. A stronger version of this system would make distribution change and review volume visible over time instead of depending mostly on offline confidence.
Transferable principle
In extraction systems, “usable” usually comes from the output contract and the review boundary, not just from the best model number on the page.