During my time working on an internal R&D project, I set out to solve a fascinating problem: automatically detecting UI components on any website screenshot. Think buttons, navigation bars, hero sections, forms — the building blocks of every web page. The goal was to power a tool that could reverse-engineer designs into structured data.

The Custom Model Approach

I chose YOLOv8 as the backbone for object detection. YOLO (You Only Look Once) is renowned for real-time detection, and v8 brought significant improvements in accuracy and training speed. The plan was straightforward: collect website screenshots, annotate UI components, train a custom model, and deploy it.

Building the Dataset

This was the hardest part. I spent weeks curating a dataset of 2,000+ website screenshots across different industries, styles, and layouts. Each screenshot was manually annotated with bounding boxes for components like headers, buttons, input fields, cards, footers, and navigation menus. Tools like Roboflow helped streamline the annotation pipeline, but the sheer variety of modern web design made consistency a real challenge.

Training & Results

After several training iterations with hyperparameter tuning, the model reached ~78% mAP (mean Average Precision) on the validation set. It worked remarkably well on clean, conventional layouts — marketing pages, SaaS dashboards, e-commerce sites. But it struggled with unconventional designs, overlapping elements, and responsive variations.

The inference speed was excellent (~30ms per image on a GPU), which validated YOLO as the right architecture. But the accuracy ceiling was a concern. To push beyond 85% mAP, I'd need 10x more annotated data and domain-specific augmentation strategies.

The Pivot to AI APIs

Around this time, vision-language models like GPT-4V and Claude's vision capabilities became production-ready. I ran a quick experiment: feed the same screenshots to these APIs with a structured prompt asking them to identify UI components and their positions. The results were eye-opening.

The AI APIs didn't just detect components — they understood context. They could distinguish a primary CTA button from a secondary one, identify navigation patterns, and even infer the semantic purpose of sections. The accuracy on diverse layouts was significantly higher than my custom model, without any training data at all.

Lessons Learned

  • Building custom ML models teaches you fundamentals that no API abstraction can replace — data pipelines, loss functions, evaluation metrics, and the importance of data quality.
  • The build-vs-buy decision in ML is shifting rapidly. What required months of custom work in 2024 can now be achieved with a well-crafted prompt in 2026.
  • Custom models still win when you need extreme speed, offline inference, or domain-specific edge cases. But for most product use cases, AI APIs offer a faster path to production.
  • The R&D wasn't wasted — understanding how detection models work made me a better consumer of AI APIs. I write better prompts because I understand what the model is actually doing under the hood.

What's Next

I've open-sourced the YOLOv8 training pipeline and the annotated dataset subset (with permission). The project lives on as a learning resource for anyone interested in computer vision for web analysis. Meanwhile, the production tool now uses a hybrid approach — AI APIs for understanding and lightweight custom models for speed-critical paths.