Selected work

Selected projects.

These are the projects I would point to first. They are the ones where I had to work through a real systems or engineering problem and make something behave properly beyond a prototype.

Flagship case study

Fused Linear Attention

A CUDA project built around a systems question rather than a benchmark alone: what really changes when projection and attention are fused, split, and benchmarked as a three-kernel family on H100?

The work focused on shared-memory layout, tiling strategy, correctness oracles, and profiler-guided reasoning about where HBM traffic stopped being the whole story. The best custom path ended up being a hybrid warp-cooperative bf16 variant rather than the most aggressively fused scalar kernel.

Across the kernel family, the best short-sequence runtime reached 1.41x at N = 128, the fused path cut peak HBM reads by 54.6%, and the hybrid bf16 path reduced peak GPU allocation by 85–91%.

1.22x–1.41x speedup at N = 64–128 max error ≤ 1.5 × 10^-7 85–91% lower peak allocation

CUDANSightPyTorchH100

What this project proved

Fusion changed memory movement The fused path reached the largest measured HBM read cut, peaking at 54.6% below the baseline.

The hybrid bf16 path won first The best short-sequence result came from the warp-cooperative bf16 path, not the most aggressively fused scalar kernel.

Read case study Repository ↗ Related writing

Production-oriented machine learning

Deadline Detection System

A staged RoBERTa-plus-BERT pipeline for extracting deadlines and renewal terms from contracts, with structured JSON outputs and review routing.

Case study Repository ↗

Retrieval and serving

StyleSync

Built a two-tower retrieval system with TensorFlow Recommenders, FAISS candidate generation, FastAPI serving, and public Streamlit and Render deployments.

H&M dataset · FAISS · Streamlit + FastAPI

Case study Live demo ↗ Repository ↗

Multimodal fine-tuning

SmolVLM Science VQA

Fine-tuned SmolVLM-500M-Instruct with LoRA for a science visual question answering competition under a 5M trainable-parameter cap.

LoRA · 2xT4 Kaggle setup · score 0.93762

Case study

Other builds

More work from the same line of thinking.

Applied generative systems

SVG Generation Competition Workflow

LLM fine-tuning

Fine-tuned Qwen2.5-Coder-3B-Instruct with LoRA for structured SVG generation, iterating on decoding, repair logic, and response validation over 22 experiments.

22 experiments, LoRA fine-tuning, and a best competition score of 0.93762.

Read case study Repository ↗

Climate and geospatial systems

FloodIQ

Physics-informed forecasting

A fast-build flood risk workflow that combined NVIDIA PhysicsNeMo, FNO-based forecasting, RAPIDS cuDF, and NYC spatial data into a city-level risk view with an interpretation layer on top.

Built to prove the modeling, geospatial processing, and risk-surface pieces could work together under hackathon constraints.

See hackathon write-up

Vision-language competition work

SmolVLM Science VQA

Multimodal LoRA

Built a notebook-driven competition workflow around SmolVLM-500M-Instruct for science multiple-choice VQA, using a per-choice yes/no ranking formulation instead of direct answer generation.

LoRA rank 8 under a 5M trainable-parameter cap, single-GPU stability on Kaggle T4s, and a best score of 0.93762.

Read case study

Additional projects

More systems and backend work from my resumes.

Distributed systems

Distributed Log Aggregation System

Python / Redis / AWS

Built a distributed log aggregation system with producer-consumer concurrency, B-tree indexing, Redis caching, and REST APIs for ingestion and filtering.

Sub-50ms query response under high-throughput writes, deployed on AWS EC2 with Docker.

Microservices and orchestration

Microservice-Based Task Orchestration Platform

Flask / Supabase / CI/CD

Architected a distributed task platform with priority queuing, retry logic, audit logging, and async execution paths designed for operational reliability.

Handled 10,000+ daily background jobs with Dockerized deployment on AWS EC2.

Applied machine learning systems

End-to-End Fraud Detection Platform

ML pipeline / AWS / ETL

Built a fraud detection pipeline with automated ETL, containerized services, and continuous deployment support for high-volume transaction processing.

Processed 1M+ daily transactions with ensemble models reaching AUC 0.96.

Concurrency and infrastructure

Distributed Rate Limiter

Python / Redis / pytest

Implemented token bucket and sliding window log rate limiting with thread-safe shared state, structured logging, and integration tests for burst traffic.

Consistent cross-instance enforcement with sub-5ms overhead.