Google Cloud

VERTEX AI

Executive Summary

In Q1 2025, Google Cloud faced a critical challenge: while adoption of generative AI tools surged industry-wide, developers struggled to fully leverage Vertex AI Studio’s advanced capabilities—particularly Retrieval-Augmented Generation (RAG), grounding, and output comparison.

To diagnose this friction and unlock product-led growth, Pulse Labs conducted a precision UX research initiative focused on the developer journey. Through a diary-based usability study of real-world LLM workflows, we uncovered hidden usability breakdowns in core features. These issues were not technical limitations—but design and discoverability failures at the intersection of UX and ML infrastructure.

Armed with these insights, Google Cloud made targeted product improvements—introducing guided workflows, inference-layer guardrails, and improved feature visibility. As a result, advanced task success rates increased by 81%, RAG adoption rose 41%, and onboarding-related support volume fell 31%.

This work showcases how strategic UX research can accelerate time-to-value, de-risk platform adoption, and directly drive revenue-linked KPIs—particularly in complex, developer-facing AI tools.

Research Methodology

Objective
To identify UX friction points in the end-to-end developer workflow within Vertex AI Studio, with an emphasis on grounding, corpus ingestion, prompt execution, and output evaluation for enterprise-grade LLM applications.

Study Design
We selected a 4-session remote diary study paired with embedded usability testing to reflect realistic developer workflows over time, rather than artificial one-off tasks.

Participant Profile:

n = 12 developers, diverse in geography (58% India, 42% U.S.), development and IDE experience levels, and organizational context
Familiarity with AI/ML tools and use cases (e.g., prompt engineering, fine-tuning, custom pipelines)
Representation of both startup and enterprise development environments

Rationale for Diary Study Approach

Temporal Insight
Captures how developer understanding and frustration evolve over time as they attempt increasingly complex tasks—mirroring real LLM deployment cycles.
Task Fidelity
Allows observation of full-stack workflows (GCS to RAG to inference)—not isolated button clicks.
Cognitive Mapping
Reveals when and where breakdowns occur between intent (e.g., using a custom corpus) and platform behavior (e.g., Gemini defaulting due to missed configuration).

Scoring Framework

To quantify usability across both emotional and functional dimensions, we applied a 3-metric system:

Success Score: Binary completion + observed outcome alignment
Health Score: Degree of friction and recoverability (custom rubric)
Experience Score: Confidence, satisfaction, and self-reported clarity (via Likert scale)

This system enabled us to precisely map usability gaps to technical bottlenecks in ML pipeline execution—offering Google an actionable blueprint for high-ROI improvements.

What We Learned

Insight #1
Drop-Off in Value-Realization at the Inference Layer

While users navigated corpus creation and basic prompt execution with ease, performance collapsed when asked to evaluate outputs or configure grounding.

Task Success Rate Health Score Experience Score Upload File → GCS 50% Unhealthy 76% Ground RAG Pipeline 33% Unhealthy 70% Compare Model Outputs 8% Unhealthy 68%

The delta between foundational task success and inference-layer breakdowns signaled a failure to scaffold users through full-stack ML workflows—especially where LLM context attribution mattered most.

Business Risk: Without enabling users to validate or refine model outputs, Vertex AI risked being perceived as a black-box system—eroding trust in AI explainability and limiting adoption among enterprise developers building regulated or high-fidelity applications.

Insight #2
RAG Pipelines Were Technically Sound but UX-Obscured

Vertex’s RAG architecture allows for dynamic knowledge injection via grounded corpora—but the interface failed to surface these capabilities in a usable, discoverable way.

6 of 12 users defaulted to Drive or local file paths, bypassing GCS and invalidating grounding
8 of 12 failed to activate the "Ground" toggle, often due to the default "Google Search" option disabling downstream RAG customization
7 of 12 skipped RAG engine selection altogether, unknowingly routing queries to Gemini instead of their custom corpus

These weren't novice users—they were developers familiar with vector embeddings, corpus indexing, and prompt templating. The breakdown wasn’t technical—it was cognitive load compounded by design opacity.

Strategic Implication
Every failed grounding attempt delayed production deployment, weakened model accuracy, and increased the likelihood of churn to competitor platforms with stronger onboarding pipelines (e.g., OpenAI’s CustomGPT or Anthropic’s Claude Workbench).

Insight #3
Core AI Debugging Tools Were Practically Invisible

Vertex’s unique “Compare Responses” feature, which enables output-level evaluation across LLM versions, was missed by 11 of 12 participants.

Hidden behind a top-bar icon with no labeling
Lacked onboarding cues or in-context education
No integration with user task flow or response saving tools

Combined with inconsistent naming conventions (corpus_name, endpoint, session_id), the platform became a high-risk environment for ML experimentation, especially for users working in regulated or latency-sensitive domains.

Product Interventions
Translating Research into ML Platform Maturity

Based on these insights, Google made several strategic product investments directly tied to our findings:

1. Guided Configuration for Grounding and Corpus Setup

Interactive, stepwise workflows for GCS uploads
Intelligent prompts triggered by missing corpus links or grounding configuration
Visibility of active RAG engine prior to prompt execution

→ Result: RAG feature activation increased by 41% in Q2

2. Inference Assurance Layer Enhancements

Enforced RAG engine selection with visual confirmation
Streamlined markdown toggles and consistent corpus naming conventions
Reduced misrouting of queries to unintended LLM backends

→ Result: Average task time reduced by 27%; confidence scores increased by 22%

3. Visibility & Education on Output Evaluation Tools

Rebranded and repositioned the “Compare” tool with a persistent tab
Added inline micro-lessons and prompt save history for benchmarking
Introduced explanation layers to support model interpretability

→ Result: Power user engagement with advanced tools rose by 35%

____________________________________

Final Note: UX as a Multiplier for ML Adoption

As AI platforms become more sophisticated, their success will hinge on cognitive load management, workflow visibility, and trust in model behavior.

This research proves that:

Trust isn’t built by the model—it’s built by the interface.
UX is the connective tissue between ML infrastructure and product-market fit.
When done right, it’s not just good design—it’s a strategic moat.