Project

This Building Multimodal GenAI course project is your chance to go beyond plain text and design AI that can see, read, and listen like humans, making your work immediately relevant to real-world problems. By completing it, you’ll gain rare, portfolio-ready experience with cutting-edge multimodal models that industry and research labs are actively looking for.

Getting Started

List of Probable Project Topics

  • Image-to-Report Generator (Medical-lite): generate a structured report from X-ray-like images (or public chest X-ray datasets) with disclaimers.
  • Visual Q&A Tutor: user uploads a diagram (circuit/biology/graph) and the app explains + answers questions.
  • Receipt/Invoice Understanding Assistant: extract fields + summarize spending + flag anomalies from invoice images/PDFs.
  • Multimodal Customer Support Bot: troubleshoot using product photos + user text (“my router lights look like this…”).
  • Slide-to-Study Notes Generator: take lecture slide images and produce clean notes + quiz questions.
  • Video Highlight & Caption Generator: summarize a short video and auto-generate chapters + captions.
  • Audio Meeting Summarizer with Action Items: speech-to-text + summary + tasks + deadlines.
  • Document + Figure Explainer: parse a research PDF and explain figures/tables in simple language.
  • Personalized Accessibility Tool: convert image-heavy content into spoken explanations and simplified text.
  • Multimodal Sentiment/Emotion Analyzer: combine facial cues (video frames) + voice tone + text for emotion trends (ethics-first).
  • “Ask My Lab Notebook”: query experiment photos + handwritten notes (using synthetic/public data) + generate steps & materials.
  • E-commerce Style Finder: upload an outfit image → generate captions, tags, and “similar style” textual descriptions.
  • Food Calorie Estimator (Approx.): image-based food recognition + rough nutrition summary + confidence + disclaimers.
  • Robotics Instruction Parser (Simulated): interpret a scene image + instruction text → output step-by-step action plan.
  • Multimodal RAG Knowledge Assistant: retrieve across PDFs, images, and transcripts; answer with citations to sources.
  • Safety/Compliance Content Checker: detect policy issues in ad creatives (image + text) and suggest safer alternatives.
  • Handwritten Form Digitizer: extract fields from scanned forms and validate against rules (DOB format, totals, etc.).
  • AR/VR Scene Narrator (Prototype): describe objects and relationships in a camera feed and generate guidance.