Learning Visual Understanding Through Multimodal Systems

Exploring Vision-Language Models, Prompt Design, and Feedback Loops for Real-World Scene Interpretation

Vision-Language Models

Multimodal Learning

Scene Understanding

University

Smith College

Date

Spring 2025

Learning Visual Understanding Through Multimodal Systems

This project investigates how modern vision-language models can be applied to real-world scene understanding through multimodal learning. Instead of focusing on a narrow task like classification or detection, the system aims to generate detailed image captions and structured risk assessments from dashcam inputs. The broader goal is to understand how AI can interpret complex visual environments in a way that is both informative and actionable.

To support this, the team built a custom dataset of 57 dashcam images, capturing a variety of challenging driving scenarios such as low visibility, weather interference, and sudden road hazards. These examples were chosen to test the models' ability to reason about real-world scenes beyond simple object recognition. The dataset serves as a testbed for benchmarking multiple models and evaluating their outputs qualitatively.

Five leading vision-language models were compared, including BLIP-2, CLIP, LLaVA, GPT-4o Mini, and Gemini 2.0 Flash. Each was tasked with captioning the same set of images. The outputs were then manually reviewed for detail, accuracy, and reasoning quality. Gemini 2.0 Flash was selected as the best-performing model due to its high accuracy and consistent generation of structured, interpretable outputs.

The system was implemented in a Streamlit app that supports both image and video input. Videos are broken into key frames, and each frame is processed individually before synthesizing the results. A caching system based on image hashing reduces redundancy, and a feedback loop allows users to rate outputs and trigger re-generation when necessary. This design promotes continual refinement without retraining.

A key technical focus of the project was prompt engineering. Early prompts were basic, but evolved to request structured outputs including risk levels and supporting evidence. The final prompt framework uses a five-level risk classification scale and instructions for structured output, improving both quality and consistency across examples.

Beyond performance, the project reflects on the ethical and practical limitations of vision-language systems. These include bias in datasets, lack of accountability in outputs, and limited generalization to underrepresented environments. The team suggests future improvements including memory mechanisms, more diverse data, and domain-specific fine-tuning to improve robustness and relevance.

Stay Connected

Follow our journey on Medium and LinkedIn.

Read Our Blog Connect on LinkedIn