AI-Powered Fashion Assistant
MyCloset
A full-stack AI fashion platform that transforms how users manage their wardrobe. MyCloset combines multimodal LLM classification, intelligent outfit recommendation, and real-time computer vision processing in a production PWA.
The Challenge
Fashion is deeply personal, contextual, and visual. Building a system that reliably understands clothing taxonomy, generates stylistically coherent outfit combinations, and processes images at consumer-grade quality required bridging multiple AI disciplines in a single product.
Our Approach
- 1Multimodal classification pipeline with a fine-tuned Florence-2 vision-language model as the primary classifier and Gemini escalation for hard categories -- extracting type, color, pattern, material, fit, season, occasion, and brand with confidence scoring.
- 2LLM-driven outfit recommendation that reasons over a user's full wardrobe, considering occasion, weather, personal style profile, and wear history to generate coherent combinations.
- 3Custom background removal pipeline using tight, simple segmentation models running on Cloud Run. Gemini image models were tried for this and turned out to be unreliable -- they tend to modify the image rather than cleanly remove the background, sometimes even painting in a checkered 'transparent' pattern. Purpose-built segmentation models won out.
- 4Closet-level style analytics that synthesize wardrobe composition into actionable insights -- dominant colors, style descriptors, gap analysis, and shopping recommendations.
AI & Technical Highlights
Multimodal LLM Classification
Image-to-taxonomy pipeline extracting 10+ attribute dimensions from clothing photos, powered by a fine-tuned Florence-2 vision-language model with Gemini escalation for difficult categories.
Generative Recommendation
LLM-based outfit suggestion engine that reasons compositionally over items, context, and user preferences.
Computer Vision Pipeline
Purpose-built segmentation models for clothing photography, handling the edge cases general models mishandle: garments on hangers, awkward angles, cluttered backgrounds, and flat-lays.
Style Intelligence
Closet-level analysis synthesizing wardrobe patterns into style profiles, gap identification, and personalized recommendations.
Technologies
Technical deep dive
MyCloset runs a multi-model inference pipeline with confidence-based routing, custom-trained vision models for clothing-specific tasks, and a personalized RAG layer that grounds outfit generation in the user's actual wardrobe, wear history, and external context like weather.
Pipeline
- 1
Segmentation
Custom-trained segmentation model specifically for clothing photography. Handles the cases general segmentation models miss: garments on hangers, awkward angles, bad lighting, cluttered backgrounds, and items laid flat. Trained on a curated dataset of real-world user uploads rather than clean product photography.
We specifically chose a purpose-built segmentation approach over using Gemini or other generative image models for this step. Generative models turned out to be unreliable background removers -- they tend to alter the image rather than cleanly isolate the garment, sometimes even hallucinating a checkered 'transparent' pattern. For segmentation, simple and purpose-trained beat large and general.
- 2
Multimodal Classification
Two-pass classification against a hierarchical taxonomy (type, subtype, body location, fabric, pattern, color, season, occasion, brand). Primary pass uses a LoRA-fine-tuned Florence-2 vision-language model running on Cloud Run with GPU. Carefully tuned pre-prompting and structured JSON output enforcement with confidence scoring per attribute dimension.
Worth noting: once the background is cleanly removed and the garment is isolated on a white or transparent background, Gemini performs reliably on object-level identification -- which is why the fallback tier works. The hard part was the segmentation step before it, not the classification after.
- 3
Confidence-Based Escalation
When primary-model confidence falls below threshold -- or when the item category is one the local model systematically underperforms on -- the pipeline escalates to Gemini Flash. Most classifications are handled by the cheap local inference; specific categories and edge cases route to frontier inference. Trades cost and latency against quality per-inference rather than applying one policy to everything.
- 4
Image Cleanup (optional)
Gemini image generation and editing models for AI-driven photo cleanup: improving garment hang, removing wrinkles, correcting skew, and generally making user-uploaded photos look catalog-quality. Prompt-engineered edits rather than generic upscaling or filters.
- 5
Personalized RAG + Outfit Generation
Pre-query construction pulls relevant items from the user's closet using semantic similarity, structured taxonomy matching, and temporal signals (wear frequency, wear recency). Context assembly combines retrieved items, existing outfit history, weather forecast, and seasonal style rules. Prompt passes to Gemini with strict structured JSON response enforcement using our item UUIDs.
- 6
A/B Presentation + User Feedback
Users see two AI-generated outfit options with LLM-written justifications. Accepts, rejects, and query refinements feed back into the pre-query builder, progressively improving recommendation quality per-user. The same infrastructure powers future shopping recommendations against partner catalogs.
Model Infrastructure
Florence-2 (LoRA fine-tuned)
Primary multimodal classifier
PEFT fine-tuning on a curated clothing dataset combining public fashion imagery with synthetic examples generated via Gemini image models to cover taxonomy gaps. Where coverage was thin (e.g., plaid pajamas, specific pattern-fabric combinations), we generated targeted synthetic data to ensure taxonomy completeness rather than relying on whatever the public datasets happened to include. Runs on Cloud Run with GPU for cost-efficient inference.
Custom segmentation model
Background removal and item isolation
Trained specifically for clothing photography edge cases that general segmentation models mishandle. We evaluated generative image models (Gemini) for this step and found them unreliable: they tend to alter the image rather than cleanly remove the background. Simple, purpose-trained segmentation beat large, general-purpose for this specific task.
Gemini Flash
Escalation classifier for difficult item categories
Invoked on low-confidence primary inference and on item categories where the local model systematically underperforms. Shoes and handbags, for example, often read as featureless shapes to smaller vision models when users photograph them from above or against cluttered backgrounds. Gemini Flash handles these cases well out-of-the-box, so we route to it category-aware rather than attempting to force the local model to cover every case.
Gemini (general)
Outfit generation and style reasoning
Constrained to structured JSON responses referencing our taxonomy UUIDs.
Gemini image generation and editing models
AI-driven image cleanup
Prompt-engineered photo improvement for skew correction, wrinkle removal, and garment-hang fixes.
Routing
Routing shifts cost and latency between local and frontier inference based on confidence thresholds and item category. We tried the obvious alternative -- more aggressive fine-tuning of the local VLM -- and found diminishing returns: past a point, more training data on the hard categories stopped improving accuracy. Category-aware fallback to Gemini turned out to be more cost-effective than continuing to push the local model, and it keeps our inference stack honest about which cases actually need frontier-class reasoning.
Data Architecture
User wardrobe, outfit history, and wear-event logs are indexed with both semantic embeddings and structured metadata. RAG queries combine vector similarity with structured filtering (taxonomy matching, temporal windows) and external context injection (weather forecast, calendar context, location). Context assembly is tuned per-query-type rather than one-size-fits-all.
Feedback Loop
Every outfit A/B shown to the user produces accept, reject, and refinement signals. These feed into the pre-query builder and prompt construction layer, tuning retrieval weights and context assembly for that user over time. We deliberately don't fine-tune the underlying LLM on these signals -- we tried that approach earlier and found it less reliable than keeping the base model stable and improving the layers around it. Prompt-layer improvement gives us faster iteration, interpretable behavior, and no risk of model drift.