--- title: Unit 8 Final Project - End-to-End AI Solution Implementation emoji: 🚀 colorFrom: yellow colorTo: blue sdk: gradio sdk_version: 6.0.1 app_file: app.py pinned: false short_description: Multimodal image captioning & vibe evaluation. --- # Assignment 8 – Multimodal Image Captioning & Vibe Evaluation This Space implements a **multimodal AI web app** for my AI Solutions class. The app compares two **image captioning models** on the same image, analyzes the emotional *“vibe”* of each caption, and evaluates model performance using **NLP metrics**. The goal is to explore how **Vision-Language Models (VLMs)** and **text-based models (LLM-style components)** can work together in a single pipeline, and to provide a clear interface for testing and analysis. --- ## 🧠 What This App Does Given an image and a user-provided *ground truth* caption, the app: 1. **Generates captions** with two image captioning models: - **Model 1:** BLIP image captioning - **Model 2:** ViT-GPT2 image captioning 2. **Detects the emotional “vibe”** of each caption using a **zero-shot text classifier** with labels such as: - Peaceful / Calm - Happy / Joy - Sad / Sorrow - Angry / Upset - Fear / Scared - Action / Violence 3. **Evaluates the captions** against the ground truth using NLP techniques: - **Semantic similarity** via `sentence-transformers` (cosine similarity) - **ROUGE-L** via the `evaluate` library (word-overlap accuracy) 4. **Displays all results** in a Gradio interface: - Captions for each model - Vibe labels + confidence scores - A summary block with similarity and ROUGE-L scores This makes it easy to see not just *what* the models say, but also *how close* they are to a human caption and *how the wording affects the emotional tone*. --- ## 🔍 Models & Libraries Used - **Vision-Language Models (VLMs) for captioning** - BLIP image captioning model - ViT-GPT2 image captioning model - **Text / NLP Components** - Zero-shot text classifier for vibe detection - `sentence-transformers/all-MiniLM-L6-v2` for semantic similarity - `evaluate` library for ROUGE-L - **Framework / UI** - [Gradio](https://gradio.app/) for the web interface - Deployed as a **Hugging Face Space** (this repo) --- ## 🖼️ How to Use the App 1. **Upload an image** - Use one of the provided example images or upload your own. 2. **Enter a ground truth caption** - Type a short sentence that, in your own words, best describes the image. 3. **Click “Submit”** - The app will: - Run both captioning models - Classify the vibe of each caption - Compute similarity and ROUGE-L vs. your ground truth 4. **Review the outputs** - Compare how each model describes the scene - Check if the vibe matches what you expect - Look at the metrics to see which caption is closer to your description ---