Selected work

Selected projects.

These are the projects I would point to first. They are the ones where I had to work through a real systems or engineering problem and make something behave properly beyond a prototype.

Flagship case study

Fused Linear Attention

A CUDA project built around a systems question rather than a benchmark alone: what really changes when projection and attention are fused, split, and benchmarked as a three-kernel family on H100?

The work focused on shared-memory layout, tiling strategy, correctness oracles, and profiler-guided reasoning about where HBM traffic stopped being the whole story. The best custom path ended up being a hybrid warp-cooperative bf16 variant rather than the most aggressively fused scalar kernel.

Across the kernel family, the best short-sequence runtime reached 1.41x at N = 128, the fused path cut peak HBM reads by 54.6%, and the hybrid bf16 path reduced peak GPU allocation by 85–91%.
1.22x–1.41x speedup at N = 64–128 max error ≤ 1.5 × 10-7 85–91% lower peak allocation
CUDANSightPyTorchH100

What this project proved

Fusion changed memory movement The fused path reached the largest measured HBM read cut, peaking at 54.6% below the baseline.
The hybrid bf16 path won first The best short-sequence result came from the warp-cooperative bf16 path, not the most aggressively fused scalar kernel.
Retrieval and serving

StyleSync

Built a two-tower retrieval system with TensorFlow Recommenders, FAISS candidate generation, FastAPI serving, and public Streamlit and Render deployments.

H&M dataset · FAISS · Streamlit + FastAPI
Multimodal fine-tuning

SmolVLM Science VQA

Fine-tuned SmolVLM-500M-Instruct with LoRA for a science visual question answering competition under a 5M trainable-parameter cap.

LoRA · 2xT4 Kaggle setup · score 0.93762

Other builds

More work from the same line of thinking.

Applied generative systems

SVG Generation Competition Workflow

Fine-tuned Qwen2.5-Coder-3B-Instruct with LoRA for structured SVG generation, iterating on decoding, repair logic, and response validation over 22 experiments.

22 experiments, LoRA fine-tuning, and a best competition score of 0.93762.
Climate and geospatial systems

FloodIQ

A fast-build flood risk workflow that combined NVIDIA PhysicsNeMo, FNO-based forecasting, RAPIDS cuDF, and NYC spatial data into a city-level risk view with an interpretation layer on top.

Built to prove the modeling, geospatial processing, and risk-surface pieces could work together under hackathon constraints.
Vision-language competition work

SmolVLM Science VQA

Built a notebook-driven competition workflow around SmolVLM-500M-Instruct for science multiple-choice VQA, using a per-choice yes/no ranking formulation instead of direct answer generation.

LoRA rank 8 under a 5M trainable-parameter cap, single-GPU stability on Kaggle T4s, and a best score of 0.93762.

Additional projects

More systems and backend work from my resumes.

Distributed systems

Distributed Log Aggregation System

Built a distributed log aggregation system with producer-consumer concurrency, B-tree indexing, Redis caching, and REST APIs for ingestion and filtering.

Sub-50ms query response under high-throughput writes, deployed on AWS EC2 with Docker.
Microservices and orchestration

Microservice-Based Task Orchestration Platform

Architected a distributed task platform with priority queuing, retry logic, audit logging, and async execution paths designed for operational reliability.

Handled 10,000+ daily background jobs with Dockerized deployment on AWS EC2.
Applied machine learning systems

End-to-End Fraud Detection Platform

Built a fraud detection pipeline with automated ETL, containerized services, and continuous deployment support for high-volume transaction processing.

Processed 1M+ daily transactions with ensemble models reaching AUC 0.96.
Concurrency and infrastructure

Distributed Rate Limiter

Implemented token bucket and sliding window log rate limiting with thread-safe shared state, structured logging, and integration tests for burst traffic.

Consistent cross-instance enforcement with sub-5ms overhead.