A few years back I purchased an Amazon Echo device for my mom, a septuagenarian who previously had expressed little inclination to use the voice assistant aid offered by her phone. Fortunately she was beginning to show interest in trying a standalone device to serve her daily inquiries about the weather, stocks and news, so I set her up with two Echo devices, both which almost immediately became a favorite everyday tool. Its success can be attributed to Alexa’s unwavering ability to comprehend her ceaseless queries from the comfort of her sofa, alongside the ring of light integrated to offer a friendly visual cue that her command or question was heard.
Amazon’s more recent effort, the Echo Show 10, builds upon those earlier features with a more extensive interactive toolkit at its disposal, raising the bar of device and user interactivity toward one much more personal. The inclusion of a 10.1″ HD screen and 36-degree swiveling base engineered to mimic identifiable movement cues most humans are naturally attuned to understand supplements the Echo Show 10’s cloud computing powered multimodal comprehension skills with a touch of humanity.
The latest version of Alexa Presentation Language (APL), the visual design framework for Alexa used by developers to build interactive voice and visual experiences, has opened a new realm of of user+device interaction previous stationary incarnations of the Echo were incapable of expressing with the addition of a trio of new gestures. For example, telling the Echo Show 10, “Alexa, have a nice day” results in the device first responding with a subtle motion before uttering a “Same to you”, followed by a “friendly” arcing motion, the equivalent of the universal wave gesture.
However false, our minds have evolved to observe and respond to such physical cues, resulting in a device that feels not only capable of listening, but perhaps even harboring the aptitude to care.
A cutaway view of Echo 10’s motor (brass disc at bottom), responsible for the array of choreographed motions.
The three new choreographed motions on the Echo Show 10 are associated with specific messaging: Greeting, Acknowledgement and Exit. More specifically, the Echo Show 10 can respond with a quick, bouncing motion on both right and left sides called “Mixed Expressive Shakes”. A Clockwise Medium Sweep is programmed to create a measured, clockwise sweeping motion, while a slow, counterclockwise sweeping Counter Clockwise Slow Sweep is the third of three new choreographed motions ‘choreos’.
After demoing the Amazon Echo Show 10, we spoke with Prakash Iyer, Director of Software Development at Amazon to discuss how his team approached these choreographed motions to communicate with a sense of “delight.”
One thing I noticed is the Echo Show 10 responds with a slight lag, something I was informed was an intentional decision made by your team after discovering that subtle, but noticeable pause provides a “much more pleasant” experience for users. Could you tell us why?
During the development process, we realized early on that it’s just as, if not more, important to know when Echo Show 10 should not move, as it is to know when to move. The challenge was determining when movement is delightful, as opposed to when it’s distracting. To optimize for a smooth experience, Echo Show 10 will only turn once a customer settles into a position when interacting with the device – this is intentional. We found that if the device moves every time a customer adjusts their position, it appears jittery, and distracts from the experience.
What do you mean by “movement [that] is delightful”?
Our designers sat down, and created a “mild to wild” scale to measure delight, intention and purpose for the motions in development. Then, they thought about what experiences they could attach those motions to, and eventually narrowed to a few that we felt were in good taste, delightful and useful.
On the “wild” end of the spectrum, we chose not to launch some experiences. For example, we had a motion that did an elaborate dance in response to an error state. While the choreo itself was delightful, it didn’t seem natural to the information being communicated to the customer.
How much noise interference can the Echo tolerate before it becomes a problem for the device to track a speaker?
Echo Show 10 uses a fusion of audio-based localization, and computer vision technologies to determine where the speaker is standing, and turns the screen in that direction. When there are multiple people in the room, Echo Show 10 will try to center to face everyone in the room. Echo Show 10 will not react to small motions, rather, it moves only after there is some stability or when it has to move to keep the person(s) in view.
How does the Echo Show 10 know you’re there? By combining sound source localization (SSL) with computer vision (CV), the device can identify objects and humans in the field of view, enabling the device to differentiate between sounds coming from people and those coming from other sources and reflections off walls.
How did the team come to finalizing this specific form? Were earlier iterations notably different, or was this shape the primary foundation for all explorative forms?
Echo Show 10’s design has purpose. The base rotates so that the screen stays in view, no matter where you are in the room. As part of device set up, all customers must go through device mapping to set the range of motion for Echo Show 10. This allows the device to work in any size space.
One thing we’ve heard from those averse to using such responsive technology is the element of “creepiness” associated with a device that is not only listening, but now watching. How did the team differentiate movement to be perceived as attentive rather than stalking?
Motion is on by default, but customers are in control of their experience. Echo Show 10’s screen moves in two ways: rotating when you say the wake word, and during active engagement activities where motion is most useful, like a video call or watching a show on Prime Video. Customers are in control and can choose whether to leave motion on during all activities, select activities, set it to move only when explicitly asked, or turn it off entirely.
How granular can developers customize the choreographed motion?
Right now, we’re focused on how developers plan to use the existing choreographed motions, and have the potential to introduce more degrees of freedom in the future.
Today, developers can choose from the four available choreographed motions to customize their skill experiences.
Noting different countries and cultures communicate differently via movements, does the International Version of the Echo have uniquely different movements?
Choreographed motions are the same internationally, but developed with global communities in mind. For example, [the] “Alexa, have a nice day” motion that greets the customer as they start their day, and follows the sun’s path – is a motion universal to customers everywhere.
Even as traditional greetings vary between cultures, i.e., a hug or a European kiss on the cheek, the sun’s movement is consistent.