← Back to projects

Multimodal / VQA

SmolVLM Science VQA

A competition workflow for science visual question answering built under a tight model and trainable-parameter budget.

This project came out of a science VQA competition where the interesting constraint was not only the dataset. It was the deployment budget around the model. The competition allowed HuggingFaceTB/SmolVLM-500M-Instruct and capped trainable parameters at five million, which meant the workflow had to be disciplined about what to tune and how to score outputs.

Problem

The task was multiple-choice visual question answering over science problems that mix images, question text, answer options, hints, and lecture-style supporting context. A direct free-form generation setup was easy to try, but harder to evaluate and less stable under the competition constraints.

What I built

I turned the task into a per-choice yes/no ranking problem. For each question and answer choice pair, the model predicted whether that choice was correct. At inference time, all choices were scored independently and the highest-scoring option was selected. That made the final decision path easier to compare and debug than asking the model to emit an answer index directly.

Workflow sketch

Inputs image + question + choices optional hint, lecture, and metadata
Training path LoRA on q/k/v/o projection modules yes/no supervision per choice
Inference path margin score = log P(yes) - log P(no) pick the highest-scoring choice

Training setup

Part of the workflow What I used Why it mattered
Base model SmolVLM-500M-Instruct Stayed inside the competition rules while keeping a real multimodal reasoning path.
Adapter path LoRA on q_proj, k_proj, v_proj, and o_proj Kept the trainable parameter count under the 5M cap without freezing the whole problem into prompt engineering.
LoRA defaults rank 8, alpha 16, dropout 0.05 Started small enough to stay stable and within budget.
Scoring rule Margin score: log P(yes) - log P(no) Made ranking behavior more stable than relying on raw yes scores.
Context packing Question, choices, hint, lecture, and metadata fields Let the model use both visual and textual scaffolding without changing the downstream decision rule.
Execution environment Kaggle with 2xT4 available, but a stable single-GPU training path DataParallel was unstable in this setup, so stability mattered more than nominal GPU count.

What the score meant

The best run reached a competition score of 0.93762. I treat that as a useful outcome, but the stronger signal for the portfolio is the workflow behind it: the task reformulation, the adapter budget discipline, and the notebook-plus-script path that could train, validate, and generate a submission without becoming a one-off hack.

Why it matters

This project is one of the clearest multimodal engineering signals on the site. It shows how I think when a problem is constrained by both model choice and training budget: make the scoring rule explicit, keep the tuning surface small, and build a workflow that can still be measured cleanly.