Podcast: Play in new window | Download
Subscribe: Apple Podcasts | Google Podcasts | Stitcher | TuneIn | RSS

Modern digital experiences meet humans where they are.
Talking to a voice assistant. Typing on a keyboard. Reacting to a vibration delivered by your smart watch. Operating a VR game character with a gesture controller. Wherever you want to be, there is now a device and an interaction modality to take you there.
Cheryl Platz is the expert guide who can help you navigate the practice of multimodal design, the new UX approach that connects these many interactions to create one coherent human-centered digital experience.
We talked about:
- what multimodal design entails
- the inherently inclusive and humanistic nature of multimodal design
- how a storytelling framework that professional improv actors use can be repurposed as a research framework for designing multimodal experiences
- how multimodality is a new layer on top of existing design practice
- how this new design lens shows the need for designing for transitions between modalities
- the importance of storytelling, especially in the early stages of the design process
- the possibility of single-sourcing content for multimodal experiences
- the unique challenges of designing a CMS for conversational agents, like the need for multiple responses in voice interactions
- metadata strategy for multimodal
- the inherently collaborative nature of multimodal design
- the importance of accounting for interruptions and notifications in a world where attention is a precious commodity
Cheryl’s bio
Cheryl Platz is a world-renowned designer, author, actress, and speaker whose work on emerging technologies has reached hundreds of millions of customers across multiple industries. Her professional passions include natural user interfaces, applied storytelling in design and research, and taming complexity in any manifestation. Cheryl’s first book, Design Beyond Devices: Creating Multimodal, Cross-Device Experiences, was published by Rosenfeld Media in December 2020.
Cheryl’s career spans a wide variety of high-profile projects at employers including Amazon (Alexa), Microsoft (Azure, Cortana), Electronic Arts (The Sims series), Griptonite Games (Disney Friends, Chronicles of Narnia), Disney Parks (PhotoPass), and presently the Bill & Melinda Gates Foundation. Cheryl also owns design education company Ideaplatz, LLC through which she shares her experience with conferences and companies worldwide. Her work and insights have been featured by outlets including the BBC, Huffington Post, Wired.com, Forbes.com, Adobe, and at venues on 5 continents. Learn more about her unique career – from design to acting – at cherylplatz.com – and follow her on Twitter at @funnygodmother.
Video
Here’s the video version of our conversation:
Podcast intro transcript
This is the Content Strategy Insights podcast, episode number 96. In the modern world – where you might be asking a smart speaker about online shoes stores in one moment, then looking up size information on your phone, and then sitting down at your laptop to place an order – you really appreciate it when the transitions between those experiences are smooth. Multimodal design is the new UX practice that makes this possible. And Cheryl Platz is the go-to expert on this new approach to designing human-centered, cross-device experiences.
Interview transcript
Larry:
Hey, everyone. Welcome to episode number 96 of the Content Strategy Insights podcast. I’m really happy today to have with us Cheryl Platz. Cheryl just published a book called Design Beyond Devices, and I’m really excited to talk to her about that, but she does other stuff as well. So welcome, Cheryl. Tell the folks a little bit more about what you’re up to these days and how you came to write the book.
Cheryl:
Thank you so much for having me here, Larry. I’m so glad to be here, and I’m really excited to talk to you about multimodal experiences and cross-device experiences, which that’s… The subtitle of the book is Creating Multimodal Cross-Device Experiences. And when I talk to people about the book, people are like, “Okay, I get cross-device experiences. That’s fairly self-explanatory. But multimodal, what is that?” And the pitch I give folks who are outside the industry is, “Well, I hope my book is the design manual for people who want to design experiences like the bridge of the starship Enterprise. The people on the bridge of the Enterprise in Star Trek can move seamlessly from using a physical control to using a touchscreen to talking to the computer. And visually they’re moving from holograms to visual displays to not seeing anything and just using voice control, and it all seems so effortless, but there’s so much work that has to go on behind the scenes to make that effortlessness work, and that whole experience is a multimodal experience.
Cheryl:
It’s a situation where we’re using multiple inputs and multiple outputs when interacting with the system and those inputs and outputs are modes of interaction. And so that’s the core of what the book’s about. In my career, I’ve had a lot of really interesting opportunities to work on multimodal interfaces, whether it was working on a launch title for the Nintendo DS back in 2004, when we had touchscreens and gaming for the first time and voice in gaming for the first time, or whether it was working on Alexa and Cortana, Microsoft, at Amazon, or working on Windows Automotive. So I got to learn a lot about how we sort of weave these different modes of human expression together and move beyond just the keyboard and the mouse. Now these days, I’m working at the Gates Foundation where I’m actually taking things in a little bit of different direction. I’m working on things like knowledge management, and I am working a little bit too on trying to get conversational agents and maybe a little bit of multimodal interaction into some of our work there too.
Larry:
This comes up every episode. Now I want to do a whole nother interview about your knowledge work at Gates, but anyhow. But let’s focus on multimodal design today. But a lot of what you’ve been talking about is we’ve all proclaimed to be human-centered designers for a long time, but I fear that we’ve often kind of biased things and maybe a little bit towards the tech, and we’re always accounting for the humans. But this sounds like an evolution of that, getting us closer to genuine human centeredness. Is that an accurate way to think about what you’re doing?
Cheryl:
I would like to think so, yes. And it’s hard because inevitably when you get really into multimodal design, you have to talk about the technology and you’re like, “Oh, haptic design and like power gloves and force feedback.” And it’s really hard not to drown them the tech, but the goal is to be human centered and to allow people to choose the method of expression that works best for them. And in the holy grail of multimodal experiences where people can freely move between input or output modalities, we’re being as inclusive as possible. Because when we only allow one or two input modalities, we’re inevitably leaving people behind, either because of situational disabilities or restrictions or permanent disabilities or restrictions, and the more modes of interaction we open up, the more people we invite to the table, and that is inherently humanistic.
Larry:
Yeah. And that’s something that’s, again, it’s not like we’ve been paying lip service to it, but all of a sudden you’re showing us that there’s way more that we could be doing to accommodate not just different demographic things, but “Oh, you’re holding a bag of groceries or a baby, or you’re in a loud place or all of these kinds of things.” Because I think human beings are becoming more… have higher expectations of technology these days. And this is how we can start to address those. Is that…
Cheryl:
So true. And it happens all the time. Every generation comes up, they grow up on a new era of technology and their expectations increase. But imagine growing up with smart speakers and knowing that you can command any device and wondering why desktops are so limited, why just keyboards and mice? Suddenly the boundaries seems sort of arbitrary, don’t they?
Cheryl:
And our phones, even years ago, when I started out on the circuit talking about voice interaction, I think the number was like 50% of people on mobile were doing voice search. And yet most apps on mobile aren’t really using voice as a form of interaction in any tangible way. And that also seems sort of like a missed opportunity given how capable our phones are of supporting that kind of thing. Not all apps necessarily need to, but that gets us to some of the topics I think we’ll talk about a little bit later, about figuring out when that would be appropriate or desirable for customers.
Larry:
Right. Well, and again, a whole other episode about user research is coming to mind here. But can you give us a quick overview of how do you ascertain what you’ll need to do to address these multimodal experiences?
Cheryl:
Well, in chapter two of my book, I talk about a framework I drew from my experience as a professional improviser and improv teacher. When you’re performing on a stage and people pay to see you, you have to come up with a way to make compelling scenes quickly, to tell compelling stories quickly. And at my theater Unexpected Productions in Seattle, we use a framework called CROW. It reminds us of the four elements we need to create a scene that’s compelling enough that the human brain can fill in the blanks. And CROW stands for character, relationship, objective and where. And in the chapter, I talk about turning those four elements into essentially a research framework you can use to expand your own outreach, to get the elements of a customer’s experience that will tell you what you need to know to make a smart decision about what multimodality model you need or want in a particular situation or context, or what devices are appropriate in a situation or context. So much is about context.
Cheryl:
And so what elements of their character or identity are informing their choices. What relationships around them with people or businesses, or devices are going to influence their choices in the moment. One thing that we all kind of took for granted in a world where offices were offices, and living rooms were living rooms and we kind of take things as sort of sketches and stereotypes, the where matters. And as we’ve all seen in a world where we get to look into people’s homes, there’s a wide variety of wheres, even just in the archetype of like a house or an apartment. And so understanding those things better gives you the context you need to say, like, “You know what, I think my customers, they don’t have room for a screen. We really need a smart speaker here.”
Cheryl:
Or “Actually this is going to happen in the living room, so they have a television. So why wouldn’t we take full advantage of that, but the smart speaker’s nearby. So we’ll use that to augment the experience, understanding.” Or “They’re super mobile and so we can’t pay ourselves to a single experience. We need to have something that moves rapidly between devices and can be interacted with anywhere.” But only by kind of drilling down into those elements can you understand enough to make a smart decision one way or the other.
Larry:
Can you walk through a quick example of… Some of the details you just mentioned kind of fit this, but could you put that in a CROW context? Like those things you just mentioned, just kind of plug that into say you want to deliver a weather report and how would you… I don’t know if that’s a good example, but does that make sense, to just kind of quickly show how CROW works in that?
Cheryl:
Sure. Weather report is a classic problem, and there’s a lot of different ways you can deliver it. We have all kinds of devices in our home. And so for you to figure out which… If you’re trying to put together a new weather app and figure out which devices you want to target, you want to talk to your customer and figure out where they spend the most time in their home. You want to figure out what drives them, what their… I’d probably focus a lot on objective, the O and Crow. What are their daily tasks? What do they need to accomplish every day? And maybe a little bit also on what their weekly or monthly tasks are, what drives them? What do they want to accomplish in their career or for their family that might motivate them to make purchases or in their choice of devices? That sort of thing.
Cheryl:
For relationship, I think investigating often weather reports, if we’re talking about the home context, who else is around them? Is this a collaborative situation? Is the weather report for more than one person? Is it a noisy environment, that kind of thing? And speaking of environment, the W, where in CROW, super important. Where does the question come up for them? Is it when they’re getting dressed? Is it coming up when they’re just about to leave the house? Is it in the car? Or is it coming up wherever, like they literally just expect to be able to just blurt it out in any room and get the same sort of response. That’s going to influence our decision as far as like if it’s really just when you’re getting dressed, and they happen to have smart speaker in their room, then maybe just putting together a smart speaker app is the way to go.
Cheryl:
If they expect to be able to blurt it out anywhere, then maybe we’re talking about something that’s on a phone, or maybe it’s a real cross-device play, where it is a smart speaker, but it’s also a car and it supports casting to a TV or something. If it’s a family weather thing, like “I’m asking for the weather because my whole family is curious about it,” maybe it’s a TV situation where you want to do something really rich and you’re showing not just what the weather is, but why the weather is. So by understanding that context, you can make some really interesting decisions about which devices you want to take advantage of. By understanding the why behind what people are asking for, you can differentiate yourself, go in a different direction from everybody else in the market. Weather’s a crowded space. But if you figured out that there’s an interesting “why,”” like we’re trying to learn about weather together and you put together a weather app on a TV that you can all learn with, that’s something I haven’t seen. But you would have to get to that through questioning with something like CROW.
Larry:
So tons of opportunities here, just as you talk about… That’s a real small example too. And that was a whole bunch of really interesting insights and ways to apply that. And I’m wondering, so my audience is mostly content people and we’re like, oh, great. So here’s the quick evolution of the definition of content strategy. Maybe 10 years ago, it was like getting the right content to the right person at the right time. Great. And then getting the right person in the right content at the right time, on the right device. And then on the right device with accessibility added. Now it’s like getting the right person, the right content, the right time, and now with the right modality. You’ve added this whole other thing onto the end of that. So put on your empathy hat as a designer and help us poor content people figure out how we’re going to author and manage and get that content into the hands of the… so it’s ultimately ready for that weather app.
Cheryl:
Well, one thing I would say is that this whole multimodality thing is a layer on top of the existing work we’re already doing. So I’m not saying we have to throw away what we’ve already learned. In the worlds I’ve described, you still need to do this smart speaker design, or you still need to do the content design for the TV app. Where things get interesting is the transitions between modalities, the transitions between devices. That’s what’s new. And the other part that particularly pertains to people on content side, that may actually bring you joy, because I think we’re all storytellers, is the importance of storytelling, particularly in the early parts of the process.
Cheryl:
I have run into this multiple times in my career, working on multimodal experiences. Like you do the research, you find this really compelling customer case. You’re like “Oh.” When I was working on the Echo Look, we’re like, wow. We had this hypothesis that people maybe wanted some help with wardrobe management, and we actually found that to be true, but actually people also don’t even have mirrors in their house and they literally want this device, because they live in tiny New York apartments and couldn’t get a full length mirror in, or couldn’t put it on the wall because they don’t own.
Cheryl:
But our stakeholders have this preexisting notion of how our customers live, and us telling them that they don’t have mirrors doesn’t work because our stakeholders live in very nice houses in Seattle that have plenty of room for large closets and mirrors. And we need to tell that story. We need to bring the story from those customers and aggregate it and bring it forward in storyboards and compelling content forward to make the case and keep the products alive, and that’s make or break. I saw projects at the same time as us that didn’t make the case, that died around us on the vine. And there’s a lot of art there.
Larry:
Well, I got to say, that’s one of the best sort of descriptions of the rationale for the emerging role of the content designer. And content people are always like, “God, you should’ve involved me a little earlier.” And it’s like, that’s a perfect example of like, “And if you’d told a better story about this, even the user story that launches the whole thing, you would have been better.” It’s like I don’t want to say “I told you so,”” but like, come on, let’s get the content people in there quicker. But I think the content people also need to be involved at a deeper way, like the holy grail in like modern omni-channel distribution of content would be some kind of a single source of truth, so that it’s like it’s easier to align your voice and tone and style issues and all that… the messaging architecture. It’s so much…
Larry:
But I think a lot of that happens more, I think, pragmatically. We’re pretty far away from that so that often happens more piecemeal and in silos or in different parts of the process. Tell me your thoughts on that. Is it possible to single source the content that answers voice queries, that shows a visual display, and that powers a chatbot from one place?
Cheryl:
Well, it is, because Amazon’s doing it. I’m pretty sure they’re powering the Alexa chat responses from there, because why would they spin up something new when they already had something in 2015 that was single source of truth? The challenge there is they have that central content engine, the style and tone situation is a cultural problem, right? As you all know, you can write style guides till the cows come home, but if someone’s not proliferating that information to the organization, it doesn’t get anywhere. And the larger the organization gets, and that is a very large organization, you know there’s going to be people off script. So that’s a bit tougher. And the tool when I used it, ages ago now, basically in a previous era, didn’t really have integrated notes on tone. It was a very functional tool you could go into to directly place content for Alexa and have it… and there was a workflow where you could get it to beta and then production and have it pushed to all Alexa devices.
Cheryl:
But it is possible. I’ve seen it. But to the point of involving everybody early like that, you can’t. You can’t just come in late and do that. I’ve drived that point in a lot of subjects home in my book over and over again, like, talk to your devs early, get in the mix and force ourselves into the conversations, because so many of these decisions seem easy early on and are completely irreversible later, if we have not taken them on. I’m sure many of your listeners… So I’m just preaching to the choir because we know that we can’t just like come in at the 11th hour and just shove a CMS in there.
Larry:
I should have put a trigger warning before that, because that’s such common thing in the content world that at the very end of a project, it’s like, “Oh yeah, and can you put the words here now?” It’s like, ah! Anyhow, so yes, it’s a known issue in our world. But-
Cheryl:
So the conversational agents have been a topic you’ve talked about several times. And when we talk about like CMS for conversational agents. There’s a couple of things. You’re probably going to have multiple variants for a single written prompt, because we don’t want the uncanny valley where this commonly repeated prompt is always exactly the same because the human brain will call that right out. Humans are not good at saying the same thing exactly the same way over and over again. Like if you remember, Alexa used to say “okay” every time you did a home automation prompt, and it was always the same and like that really, really super-bothered people on the team. And there’s a whole other side story about them trying to push what happens now, where she makes a little boop, boop.
Cheryl:
But then half of the user base was emotionally attached to the “okay,”” so the change had to get rolled back until they could put in a choice option. That’s another subject, but uncanny valley. So you’ve got to have multiple options, like different versions of the prompt, and then phatic support, usually things like “okay” and “sure” or “I got it,” those things you don’t usually build into the prompts, you usually have a way of flagging different strings or content chunks as compatible with different sort of prefixes or postfixes. And it’s different system adding them in the real-time at the end. One thing we had been working on when I was there was also how to put in follow-on content, if there was a notification waiting or there was a tutorial moment or something, how we could chain content and things like that. But that all required essentially functioning content management system and then kind of like a broker to handle all of that dynamic construction.
Larry:
Right. As you talk about this, I’m thinking it seems like there’s two things that need to be reconciled here. There’s that challenge of designing for that specific voice interaction thing, where you have those considerations that you can’t repeat the same thing. For any one message you need multiple utterances of that message. And there’s probably analogous things in every channel that we’re delivering to, but the end user experience has to be seamless. And if they go from talking to their phone and they walk into the store and they look it up on their phone in a browser or something, and then they go to the kiosk or something in the store, it seems like that’s where that single sourcing can tie that experience together. Is that happening or is that… It sounds like Amazon is doing it, but do you know of others that are doing that well?
Cheryl:
I don’t have as much of an inside look at that. I assume some folks are. I have to hope that some folks are, but I can’t say for sure.
Larry:
Yeah. No, there’s probably some super-progressive, crazy design-tech-people genius out there that will look at it in a year and go, “Oh, holy cow. That’s brilliant.” But we don’t know it yet. So, yeah. But hey, one of the things about, well, and kind of related to single sourcing is this notion of structuring content. Well, and kind of like structuring it both in terms of usable chunks. Like, what’s the answer to that question, how does that part of it work, but also increasingly there’s attention to ascribing meaning to it, whether it’s just simple metadata, like descriptive stuff or taxonomy stuff. Does that need to manifest, or do we need to be doing different things in our content information architecture to accommodate these multimodal needs? Or do you have any thoughts about that?
Cheryl:
Well, I think there’s probably some baseline tagging that’ll have to happen, if a particular content chunk is intended for a device’s different capabilities. One common differentiation is a prompt being intended for a device with visuals versus without visuals. If you don’t have visuals, you’re probably going to be more verbose. Back to the weather example, the Echo Show automatically shows you multiple days of weather. Well, the smart speaker can’t really show you anything. And so often, more often than I would like, it will offer to give you the weekend’s weather, so that dynamic construction I mentioned, kind of doing a follow-on afterwards.
Cheryl:
But the visual device doesn’t need to do that because you can already see the weekend’s weather right there. So being able to flag the intended use of a particular prompt or something may make sense. As far as the content, when building conversational agents, you already get a little bit of that when you’re doing flagging for slots and slot values, there is some semantic stuff you can already derive from the way those prompts are kind of played out, or at least in the utterances, like that pairing. So it’s interesting to think about. If you’re using multiple modalities at once, there may be an opportunity to say, “In my content at this time, I want an LED to flesh,” or I want to trigger something else. I want an additional modality to emphasize what’s happening in the text, or I want to synchronize. I think that’s kind of sort of metadata that might manifest itself over time, but that’s the first thing that comes to mind, off the top of my head.
Larry:
Now you’ve got me thinking about, I think we’ve all got a lot of learning to do because a lot of what you’re talking about, like I think content folks may need to learn some more interaction design, really to be cognizant of those concerns that you just mentioned, because like you mentioned a flashing light, that’s a meaningful thing. It’s not really what we would conventionally call content, but in some points that might be a flashing thing and some points it might be an audio prompt that says mission accomplished or something. And in a VR suit, it might be some haptic feedback where it vibrates your chest or something. I don’t know. It’s like, yeah, I don’t know, it’s really interesting, what are the boundaries of content, I guess, and does the user care? Because this is equally about visual and auditory and haptic feedback for the user. It’s funny. Like I said, I’m optimistic because we’re getting closer to the actual human experience, but man, there are a lot of details.
Cheryl:
Yeah. There’s definitely a chicken and egg problem, right? You could say that music drives the experience and the text is just going to play what it’s going to play, or show what it’s going to show, and timing doesn’t matter. Or you could say the spoken text is our baseline timer and all the other effects and all of the other modalities drive off of that, and so we need metadata tied in there. But I think one of the main things that anyone who’s new to this multimodal cross-device thing needs to take take to heart is it’s not a one-person challenge. I mean, you can do Alexa skills that are voice and they push some cards to the Echo Show. But inevitably if you’re doing like a large, even a moderate scale production quality app or experience, you’re probably going to be working with more than one individual.
Cheryl:
So, occasionally a UX designer will be also working as a content person, but often you might have a content person, a UX person, and it’ll just pay for both of you to be aware of the other person’s abilities and the capabilities that each side has like, okay, I know my content partner is going to have the ability to do synchronized events and so I want to communicate to them kind of what we want the customer to see, and they can figure out the exact timing and put that in, figure out where in the text that works best or vice versa, that sort of thing. If you don’t have that kind of timing system, like, “Hey, I wrote this text, but I think it would be really made better if we can find an affordance to emphasize this in one of our other modalities.”
Larry:
I could talk forever, but we’re coming up close to time. But I really appreciate you. I feel like I’ve got a better handle on the implications for content practices of this, so thanks so much for that. But hey, before we wrap, is there anything last, anything that’s just on your mind about multimodal design or about content, or just design in general that you want to share with the folks?
Cheryl:
Yeah. I think we’ve been talking right at the bleeding edge of multimodal, this stuff. And I think a lot of folks aren’t going to be at a point where they could talk about synchronicity with multiple outputs for a while. But I think some of the content in my book, like the CROW, like getting the extra context about customers is directly applicable now. And another thing that I go into a lot of detail about is interruptions and notifications. This really gets into the cross-device part of the book, which I think we’re all designing in that world right now. We have phones and we have websites and in some cases, smart speakers and other devices, and if your customers are moving back and forth between a website and an app, they’re inherently cross-device already.
Cheryl:
And so giving thought into how content’s moving back and forth between those things, are you notifying someone if an event has occurred outside the scope of where they last interacted? If you are interrupting people, do you understand enough about what your customer is doing to interrupt them in a polite and respectful way? Those are topics I talk about. And I think that really impacts content folks, because don’t you wish your system could tell you a rough guess at like the mindset your customer is going to be in when you bring the message to them, like they’re super busy and they’re going to be cranky, versus like, yeah, they’re really receptive right now. You can go nuts. And chapter three is really about how you might take that understanding of your customer’s world, which hopefully you got from your expanded inquiry, with CROW or whatever research you have, and build that into your system in some way, so that we can in the moment feel like… Right now the customer’s involved in like a sustained task. Like they’re listening to music.
Cheryl:
Yeah, they’re engaged, but they’re probably receptive. And you know what, if we have to pause the music, they can pick it right back up. They haven’t lost any context. It’s okay. Versus they’re in the flow right now. They’re actually typing in Word. And if we interrupt them, they’re probably going to hate us. And so being able to use that information to not only judge whether or not we should interrupt them, but how we should interrupt them, and then like the tone of any content that’s presented during that interruption, I think is huge. And attention is such a precious resource right now. And so I know it can be really intimidating to talk about and kind of almost othering to talk about technologies you don’t have access to right now, or like activating multiple modalities. But there’s just as much in the book about dealing with that human side of getting our systems to be more respectful of human behavior and the implications of the AI systems that we already have inside our devices right now.
Larry:
Yeah. To what you were just saying, that’s one of the big take homes I took from reading the book is that, yeah, this is crazy fun, cool, futuristic stuff, but there’s a lot you can do right now with these insights. And yeah, I really appreciate that. Well, thanks so much, Cheryl. This is a great conversation. I really enjoyed your contributions.
Cheryl:
Thank you for the great questions. It’s super fun to talk to you.
Leave a Reply