MLOps / NLP

Deadline Detection System

An intelligent deadline and expiry extraction workflow for contract text.

Python · RoBERTa · BERT NER · MLflow · Ray Tune · Kubernetes

This project started from a real operational need: extracting deadlines, expiration dates, and renewal terms from contract text in a form that downstream systems could use instead of simply showing a result to a human reviewer. That requirement pushed the project away from a single-model demo and toward a workflow that behaves more like a real system.

Problem

Contract text often contains multiple dates, ambiguous language, and clauses that mix effective dates, renewal periods, and notice deadlines. A single model was not enough. The system first had to decide whether a clause was relevant, then extract the exact temporal span, and finally leave room for review when model confidence was low.

What I built

The workflow uses a sentence-level RoBERTa classifier as a gate, followed by a BERT-based NER model for exact span extraction. Around the models, I built the surrounding pieces that make the output operational: dataset preparation, evaluation, review thresholds, structured JSON outputs, experiment tracking with MLflow, and the serving and deployment path that let the system sit inside Paperless-ngx rather than remain a notebook-only experiment.

Pipeline diagram

Diagram showing clause input flowing through a RoBERTa relevance gate, a BERT NER stage, and a routing layer that emits structured JSON or sends low-confidence cases to review.

Evaluation context

The model metrics on this page are component-level results, not a claim that the full contract problem was solved end to end. The BERT NER stage reached F1 = 1.0 on the curated evaluation split used for the project, while the sentence-level RoBERTa gate reached F1 = 0.8682. That NER result came from a narrow, clause-level split rather than a broad production benchmark, which is exactly why I treat it as a component metric and not as a claim that contract extraction was "solved." The more important result was that the staged setup made extraction behavior easier to reason about than a single all-purpose model.

My contribution

This was a team project. My contribution centered on the staged pipeline design, the MLflow and Ray Tune training workflow, the output contract that made the system easier to evaluate, and the Kubernetes-facing integration path that connected Paperless, serving, and review hooks into one document-processing flow.

Hyperparameter search and experiment tracking

The classifier training path did not rely only on manual tweaking. I used Ray Tune with an ASHA-based search setup to compare RoBERTa classifier runs, then logged the resulting configs and outcomes into MLflow so the training side and deployment side stayed tied to the same record of what changed.

System behavior around the models

The pipeline was designed to sit inside a document workflow rather than in a notebook. It supported asynchronous queueing, burst absorption, autoscaling to a second worker under sustained queue depth, and monitoring around inference latency, event creation volume, and user corrections. The integrated repo also grew an ONNX serving path, Paperless post-consume hooks, and release automation that could promote or roll back a serving path inside the cluster. That surrounding behavior is what made the model outputs operational.

Deployment shape

The integrated system ran on a single-node k3s cluster on Chameleon. Paperless, MLflow, MinIO, serving, online features, monitoring, and release components each had their own Kubernetes manifests, with ONNX serving exposed as the production inference path and Prometheus/Grafana watching confidence, review-required volume, and P95 latency.

Why this project matters

The part I value most here is not just model training. It is the end-to-end framing: deciding what the output contract should look like, designing for reviewable uncertainty, and shaping the work so it can fit into a document workflow rather than live as an isolated notebook.

Sample output

A useful version of this project needed a stable output contract. A simplified example of the structured artifact is below.

{
  "clause_text": "Either party may terminate this agreement by giving at least 30 days' written notice prior to renewal.",
  "is_relevant": true,
  "document_event_type": "notice_deadline",
  "entities": [
    {
      "text": "at least 30 days' written notice",
      "label": "NOTICE_PERIOD"
    }
  ],
  "review_required": false,
  "confidence": {
    "classifier": 0.93,
    "ner": 0.88
  }
}

When confidence fell below the routing threshold, the same output contract could still be generated with review_required set to true, which made low-confidence cases easier to send into a human-review loop.

Technical highlights

Two-stage classifier plus NER pipeline to reduce false extraction from irrelevant text.
Structured outputs designed for downstream systems and human-review loops.
Confidence thresholds with low-confidence routing for feedback and retraining.
Ray Tune plus MLflow for reproducible model search and experiment tracking.
Kubernetes deployment with Paperless hooks, ONNX serving, and monitoring services.