Static Images Are Losing the Engagement Battle

There is a measurable cost to serving static images. Not a performance cost — that’s a different conversation, one I covered in an earlier post on Cloudinary as a DAM. This is an engagement cost: the gap between what a user feels when they interact with a product image and what they feel when they look at one.

A 2017 paper in Computers in Human Behavior quantified this gap directly. Blazquez Cano et al. ran a controlled experiment with 218 participants browsing fashion clothing on an iPad — split across three conditions: static images, 360° visual rotation, and tactile simulation (a scrunch gesture that deformed the fabric texture on screen). The engagement scores across dimensions like novelty, felt involvement, and endurability were significantly higher in both interactive conditions than in the control group. The static image condition scored 1.34 out of 7 for novelty — participants essentially disagreed that they felt any curiosity or interest. The interactive conditions scored 4.63 and 4.95 on the same measure. The paper is readable in full here.

What the researchers describe as image interactivity technology — the ability to rotate, zoom, scrunch, or otherwise manipulate a product representation in real time — is not a niche UX experiment. It’s a description of the gap between what physical retail offers and what most e-commerce does instead.

The Sensory Impoverishment Problem

The paper frames it clearly: clothing products “suffer from sensory impoverishment when retailed online.” The customer cannot assess drape, weight, or texture from a front-and-back JPEG. They can make inferences, but those inferences come with perceived risk — and perceived risk reduces purchase intent.

The interesting finding from the study is that the need for tactile interaction did not mediate the engagement effect. Participants who reported a strong preference for handling products in-store did not respond differently to interactive images than those who didn’t. The interactivity worked regardless of individual preference — which is a strong signal that this isn’t a segment-specific concern. It’s a baseline problem with static presentation that interactive media addresses universally.

The implication for anyone building or evaluating a product media pipeline is that the choice between static and interactive assets isn’t a UX nicety. It’s a decision with measurable downstream effects on the engagement metrics that determine whether a browsing session converts.

What the Infrastructure Actually Needs to Support

Here’s where the research connects to the engineering problem. Serving static images is solved. A CDN handles it. The challenge with interactive imagery — 360° views, zoom into texture, video-based product showcases — is that the asset pipeline requirements are fundamentally different.

A 360° product viewer typically needs 24 to 72 frames of the same product shot at consistent intervals. A video-based scrunch effect, like the Shoogleit tool used in the study, needs a captured video of physical fabric manipulation merged with touch input. These are not assets a designer exports from Photoshop and drops into an S3 bucket. They require a pipeline that can:

Accept high-resolution source material in formats that preserve detail (RAW, uncompressed video)
Generate derivative assets at the right resolutions and formats for the target device
Deliver those derivatives with the latency characteristics interactive features require — a 360° viewer that stutters at every frame rotation is worse than a static image

This is exactly the gap a DAM like Cloudinary sits in. The URL-based transformation API makes derivative generation a function of the request rather than a pre-computed batch job. You store the source, express the transformation in the URL, and Cloudinary handles the rest — caching the result at the CDN edge for subsequent requests.

For a 360° viewer backed by a video source, that looks roughly like:

// Generate a specific frame from a product video using Cloudinary's URL API
const frameUrl = (publicId, frameIndex) =>
  `https://res.cloudinary.com/{cloud}/video/upload/so_${frameIndex}/${publicId}.jpg`;

// Build a frame sequence for a 360° viewer (36 frames at 10° intervals)
const frames = Array.from({ length: 36 }, (_, i) =>
  frameUrl('products/jacket-360', i * (videoDuration / 36))
);

The video is stored once. The frames are expressed, generated on first request, and cached. Retouching the source material — different crop, different colour grade — invalidates the cache and the entire derived frame set updates automatically. There’s no manual re-export step.

Format and Device Targeting

The engagement benefit from interactive imagery requires that the assets themselves load fast enough to feel responsive. A 360° rotation that takes 300ms to respond to a drag gesture defeats the purpose — the interactivity creates engagement precisely because it feels like physical manipulation.

Cloudinary’s automatic format detection (f_auto) and quality selection (q_auto) become load-bearing here rather than just optimisation niceties. Each frame in a 360° sequence needs to be delivered at the smallest acceptable file size for the requesting device. On a modern device over a good connection that might be WebP at high quality. On a constrained mobile connection it might be AVIF at a reduced quality setting. The DAM handles that decision per request; a static CDN serving pre-encoded frames does not.

The same logic applies to video-based product showcases. Cloudinary can transcode a source video to multiple formats (H.264, H.265, VP9, AV1) and deliver based on what the client supports — reducing bandwidth and improving perceived performance without requiring the media team to produce format variants manually.

The Source Material Problem

What a DAM cannot solve is upstream. The research used a tool called Shoogleit — a system that captures video of fabric being physically manipulated and merges that with touch input to simulate the gesture on screen. Producing that kind of source asset requires a structured capture process: controlled lighting, consistent framing, a rig to hold the garment, a video of the physical manipulation. A DAM takes that material and makes it deliverable at scale. It does not make it easier to produce.

This is the honest limitation of the infrastructure argument. The engagement benefit from interactive imagery is real and measurable. Realising it requires better source material than most product photography workflows currently produce. A 360° viewer needs a 360° shoot. A fabric scrunch simulation needs video of someone scrunching fabric. The asset pipeline gets you from that source material to a delivered experience — but someone has to invest in the source material first.

The question of whether that investment is justified is exactly the kind of question the Blazquez Cano et al. paper helps answer: the engagement difference between static images and interactive ones is statistically significant and practically meaningful. Novelty scores tripling, felt involvement scores rising, endurability — the willingness to recommend the experience — consistently higher in both interactive conditions. That’s the business case for the upstream investment.

Where This Points

The fashion retail context the paper studies is the obvious application, but the underlying finding generalises. Any domain where the user needs to evaluate a physical object remotely — furniture, automotive, real estate, industrial components — has the same sensory impoverishment problem. And the infrastructure to address it is the same: a media pipeline that can accept rich source material and deliver interactive derivatives efficiently at the CDN edge.

What’s changed since the paper was published in 2017 is the accessibility of both the capture and the delivery tooling. WebXR and product 3D formats (glTF, USDZ) have pushed 360° and AR product views further into mainstream e-commerce. Cloudinary’s 3D asset support reflects this — you can now store a 3D model as the source and derive 2D renders from it programmatically, collapsing the frame capture problem into a rendering problem.

The research gave us the engagement case. The tooling has caught up with the ambition. The remaining constraint — as it usually is — is organisational: whether the investment in capture-quality source material is treated as a media production problem or a conversion optimisation problem. It’s both.