Multimodal / VQA
SmolVLM Science VQA
A competition workflow for science visual question answering built under a tight model and trainable-parameter budget.
This project came out of a science VQA competition where the interesting constraint was not only the
dataset. It was the deployment budget around the model. The competition allowed
HuggingFaceTB/SmolVLM-500M-Instruct and capped trainable parameters at five million, which
meant the workflow had to be disciplined about what to tune and how to score outputs.
Problem
The task was multiple-choice visual question answering over science problems that mix images, question text, answer options, hints, and lecture-style supporting context. A direct free-form generation setup was easy to try, but harder to evaluate and less stable under the competition constraints.
What I built
I turned the task into a per-choice yes/no ranking problem. For each question and answer choice pair, the model predicted whether that choice was correct. At inference time, all choices were scored independently and the highest-scoring option was selected. That made the final decision path easier to compare and debug than asking the model to emit an answer index directly.
Workflow sketch
Training setup
| Part of the workflow | What I used | Why it mattered |
|---|---|---|
| Base model | SmolVLM-500M-Instruct | Stayed inside the competition rules while keeping a real multimodal reasoning path. |
| Adapter path | LoRA on q_proj, k_proj, v_proj, and o_proj |
Kept the trainable parameter count under the 5M cap without freezing the whole problem into prompt engineering. |
| LoRA defaults | rank 8, alpha 16, dropout 0.05 | Started small enough to stay stable and within budget. |
| Scoring rule | Margin score: log P(yes) - log P(no) |
Made ranking behavior more stable than relying on raw yes scores. |
| Context packing | Question, choices, hint, lecture, and metadata fields | Let the model use both visual and textual scaffolding without changing the downstream decision rule. |
| Execution environment | Kaggle with 2xT4 available, but a stable single-GPU training path | DataParallel was unstable in this setup, so stability mattered more than nominal GPU count. |
What the score meant
The best run reached a competition score of 0.93762. I treat that as a useful outcome, but the stronger signal for the portfolio is the workflow behind it: the task reformulation, the adapter budget discipline, and the notebook-plus-script path that could train, validate, and generate a submission without becoming a one-off hack.
Why it matters
This project is one of the clearest multimodal engineering signals on the site. It shows how I think when a problem is constrained by both model choice and training budget: make the scoring rule explicit, keep the tuning surface small, and build a workflow that can still be measured cleanly.