Meeting Listeners Where They Are: Designing for Spotify’s Audio-Forward User Experience
Spotify’s experience isn't defined or constrained by the bounds of a smartphone screen, and neither are our design and user research practices.
People can listen to Spotify almost anywhere, anytime, on any device, and for many of us, it’s an all-day companion. As designers and researchers at Spotify, our challenge is to embrace an “audio-forward” approach in order to craft a user experience that spans contexts, situations, and environments.
We’re going to break down what that means for us by introducing you to the concept of “audio-forward” UX, discuss a framework we’ve created to help us grapple with the unique realities of how people use our service, and provide a look at a couple of projects that are designed to bridge contexts and provide value, utility, and delight to listeners, whether they’re looking at the app or not.
Audio-forward user experience
Whenever we take up new challenges at Spotify we have to consider this context-spanning reality. It simply isn't enough to open Sketch and attack new problems from a visual-only perspective. Product opportunities are often layered and multifaceted, and require new approaches. As designers at Spotify, our primary concerns are ensuring smooth transitions between interacting with the app and consuming content, as well as providing value across contexts.
One way we do this is by using a framework that we call Modes, which encapsulates the four major states that our listeners cycle through as they use our service: Interact, Leave, Consume, and Return. When paired with our personas, Modes are a powerful tool that ensures the user experiences we create consider the bigger picture, and that we’re able to meet people wherever they are.
Modes is the culmination of several years worth of in-depth user research around understanding mental models at the intersection of audio and user experience.
Before we jump into the research behind the framework, let’s explore what makes researching audio experiences unique.
User research for audio experiences
User research in an audio-forward world poses an interesting challenge. Unlike other types of user research, we can’t just bring participants into a lab, hand them a device, and prompt them to go through a flow. Audio experiences don’t work that way, because audio is a core attribute of our environment. Audio can be the focus of an immersive activity, like a concert or a dance party, or a background element, like when you’re studying or cooking. As researchers, we need to venture into the listener’s world in a non-intrusive way to understand how audio impacts them and, in turn, how they interact with audio.
To build out the Modes framework, we needed to take this naturalistic in-context research to the next level.
Uncovering the Modes framework through exploratory user research
We were most curious about how listeners interact with Spotify while they’re listening across contexts, like studying, cooking, exercising, or commuting. What draws people back to the app? What prevents listeners from interacting? While “let’s seek to understand how people interact with technology across different environments” may sound like a typical research objective, we were focused on observing the smallest details, the minutiae, of each interaction.
This research required a highly naturalistic setup. As the Hawthorne effect explains, hovering over people’s shoulders to observe what they are doing would likely influence their behavior in unnatural ways, thus compromising the research. We decided that eye-tracking would be the most observational and non-intrusive method.
The eye-tracking glasses used in this research look a lot like normal reading eyeglasses, but they’re equipped with several small cameras: one on the bridge of the nose, that captures the view in front of the participant, and one in each lens, capturing what their eyes are focusing on. Despite being laden with technology, it feels like wearing an ordinary pair of eyeglasses, providing freedom of movement and the ability to go about daily activities with ease.
We set our participants up with these glasses and let them go about their daily lives, giving us a window into what it’s like to listen to Spotify in a variety of scenarios quite literally from their perspective. We observed a variety of environments—from commuting on a crowded New York City subway, to studying for a calculus exam in the library, to baking cupcakes at home. We collected a treasure trove of attention data about what they were focusing on and how Spotify fit in.
From this observational eye-tracking data, we gained an understanding of both the transient and the intentional interactions people take within the app. We witnessed the long periods of time when listeners were actively listening to music while immersed in a variety of activities, like listening to hip hop while running on the treadmill. We got a sense of the effort required to pause those activities to return to the app and make an adjustment or take some sort of action, like washing off flour and cake batter before grabbing the phone to skip a song.
Mixing methods for richer insights
Building off of the observational data, we conducted follow-up interviews with the participants to understand their attitudes towards the experience. In these subsequent interviews, we played back this eye-tracking footage and asked the participants to explain their motivations for and barriers to interacting. We gained rich insights about their sentiment towards the experience, as well as how situations and surrounding environments shaped their interactions with Spotify.
In partnership with Data Science, we sized the behaviors we observed in this research. We quantified how long people spent directly interacting with the app versus listening while engaging in other activities, as well as how often they move between these modes in a given listening session.
We synthesized these mixed methods insights into the Modes framework, which details the four major states listeners cycle through as they use Spotify: Interact, Leave, Consume, Return. Interact includes those moments of interacting directly with the visual UI. Listeners are in the Leave mode when they put the device down and focus on another activity. They enter Consume, actively listening to content on Spotify while immersed in that activity. Finally, Return involves triggers that bring them back into the app.
Designing audio-forward, in practice
Let’s explore two recent feature launches that take these different modes into account to meet listeners where they are.
Canvas (a feature currently in beta) is an 8-second looping visual that select artists can add to any of their tracks. It appears in the Now Playing View, in place of the album artwork. This experience helps artists connect with their listeners by enabling creative expression beyond audio.
“If you go back in time and think about how people engaged with music at a record store, you either knew what you wanted, or you were grabbed by a visual. ‘Oh, what’s this?’” said Barton Smith, Associate Principal Designer on Spotify’s Creator team. “It gives consumers a richer sense of the music they’re listening to in ways that they weren’t able to feel before. I hear the lyrics, I feel the melody, but what world does this music exist in? It’s a chance for artists to paint their universe or communicate a message they're trying to send.”
Canvas champions how people already use Spotify – capturing listeners in the Interact mode. “We wanted to design within those existing behaviors; the moments where listeners pull their phone out for just a few seconds. We find that people are sticking around a little bit longer to take in those visuals and connect with the artist.”
Giving creators the ability to bridge audio and visual spaces in order to captivate listeners has positive downstream effects. “Canvas is building more connections. Artists are slotting into what is typically a more passive, automatic task like ‘heart-ing’ or skipping songs. Now people are pausing, thinking ‘wow, that looks really cool, that resonates with me, maybe I should check out their other music. They sound good, and I connect to their visuals,’” leading to more engagement with their presence on Spotify, and off.
Voice-Enabled Ads & Promotions
While Barton and the Creator team bridge audio to visual with Canvas, the Ad Experience team is focused on unlocking engagement with audio ads and promotions, audio-forward formats you could previously only interact with through the visual UI.
Voice-Enabled Ads, which we’re testing on the Spotify free tier today, are audio ads that enable listeners to vocalize a command in order to take action. This experience helps brands and creators connect with listeners in moments when their context, situation, or environment might present barriers to engagement.
With audio ads today, listeners have around 30 seconds to pause their activity, find the phone, unlock it, open Spotify, and take some sort of action – like clicking through. With voice, we wanted to make it quicker and easier for listeners to engage with relevant ads and promoted content on Spotify, even when they are in the Consume mode.
“If listeners are looking at the app on their phone and want to interact with something, usually the results are immediate. We wanted to extend that immediacy to voice and deliver instant value through audio,” said Ashley Hopkins, Senior Product Designer on the Ad Experience team.
In this first iteration of Voice-Enabled Ads, Ashley and the team were very intentional with the capabilities of the format in order to provide that immediate value while phones are out of sight. To do that, they focused on promoting content, including both music and podcasts, that we already have on our service. “To keep scope in check, we focused on a Play Now intent. It makes a ton of sense for Spotify as an audio-first platform. Whether you’re listening to music or podcasts, playback intents span many of our use cases.”
Ultimately, Ashley and her team consider voice to be an optimization of a format that still needs to span multiple modes. “We have a principle of ‘voice-forward, not voice-only.’ You should be able to complete the same task equally through visual or voice UIs. At the same time, while the visual UI compliments and emphasizes voice interactions, you can still tap to interact instead of using your voice. We aren’t forcing you to interact in a specific way.”
Ashley and Barton have intentionally approached their design process by considering the bigger, broader picture. How might we deliver delight and foster interactions by bridging modes? Both of these features are adding rich, multimodal capabilities to Spotify.
We’re unlocking new avenues of creativity as we continue to equip ourselves with a greater foundational understanding of how Spotify and audio fit into people’s lives. We’ve begun to explore new interaction models and design new experiences that bridge Interact and Consume. We’re looking at ways of delivering instant, in-the-moment, hands-and-eyes-free value with tactics like voice interaction. We’re exploring ways to give artists powerful new tools to express their music visually, helping to foster valuable new connections with their listeners. We’re on a mission to create an incredible Spotify experience that meets listeners where they are, providing delight or enabling natural interactions no matter what they’re doing.
There’s so much more incredible work happening behind the scenes, and we’re really excited about what the future holds.