Incoming Software Development Engineer Intern at Amazon

Actively looking for full-time opportunities beginning in May 2027

Open to co-op and internship roles starting September 2026

Overview

Bhanuja Karumuru

I build the infrastructure that makes ML systems reliable under real production constraints, from CUDA kernel profiling on H100 to async backend pipelines and reproducible MLOps workflows.

Selected Work View CV Contact LinkedIn ↗

Current trajectory

2026

NYU to Amazon

Accepted an incoming Software Development Engineer Intern role at Amazon while continuing graduate work at NYU across ML systems, CUDA, and applied research.

2025–2026

Research and systems depth at NYU

Contributing to secure ML runtime work in the Secure Systems Lab and perception workflows in ARC Robotics, while building course and project work around performance, infrastructure, and production ML.

2024–2025

Shipping backend systems at SalesUp

Worked on backend APIs, async pipelines, and AI-powered automation features, improving throughput and tail latency in production.

2020–2024

Signal processing and research at NIT Sikkim

Built the academic foundation that led into speech, ML, and hardware-aware systems work, including my publication at IEEE SPCOM 2024.

Selected work

A few projects that show how I work.

Flagship case study

Fused Linear Attention

CUDA kernel work for transformer inference, built as a three-kernel study of what happens when QKV projection and attention are fused, split, and benchmarked carefully on H100 hardware.

The useful part of this project was not just the final bandwidth result. It was learning how to profile a GPU path carefully, validate the kernel against references, and recognize when lower memory traffic still was not enough because the compute path remained the real bottleneck.

Profiled on H100 and benchmarked across a three-kernel family where the best short-sequence custom path reached 1.41x speedup at N = 128, while the fused path cut peak HBM reads by 54.6%.

54.6% peak fused-path HBM read cut max absolute error ≤ 1.5 × 10^-7 H100 evaluation

Read case study Repository ↗ Related note

Production-minded ML

Datanauts deadline detection

A gated classifier-plus-NER workflow for contract deadline extraction, built with MLflow, Ray Tune, Kubernetes, structured JSON outputs, and a feedback loop for reviewable outputs.

Backend and retrieval

StyleSync

A two-tower retrieval system on the H&M dataset with FAISS candidate generation, FastAPI serving, and public Streamlit and Render deployments.

TFRS · FAISS · H&M dataset · live demo

Multimodal fine-tuning

SmolVLM science VQA

A competition workflow for science visual question answering built around SmolVLM-500M-Instruct, LoRA adapters, and a per-choice ranking formulation under a 5M trainable-parameter cap.

LoRA · 2xT4 Kaggle setup · score 0.93762

Recent milestones

Recent milestones.

May 2026

Accepted an incoming Software Development Engineer Intern role at Amazon.

Will be joining Amazon while continuing graduate work at NYU.

Apr 2026

Accepted in the Journal of Signal Processing Systems.

“Fusion of data augmentation for improved dysarthria severity classification” was accepted for publication in Springer's Journal of Signal Processing Systems as manuscript VLSI-D-25-00096R1.

May 2026

Began a funded TREx Research Fellowship at NYU.

Joined Prof. Debra Laefer's group for Summer 2026 to build ML and CV pipelines for historic map analysis and spatial data extraction.

Jan 2026

Joined ARC Robotics at NYU.

Started work on perception and computer vision pipelines for autonomous robotics.

From the blog

Writing that goes deeper than the project summary.

CUDA · systems writing

How I reduced HBM reads in transformer inference and what the profiler still said next

May 2026

A deeper write-up of the H100 fused-attention project, including the three-kernel setup, the T = 64 tiling decision, profiler evidence, and why the scalar fp32 path still lost on wall time.

Read the post More writing