Multimodal User Interface Principles
1 Multiple Modalities Need to Be Synchronized
Spoken interaction is highly temporal, whereas visual interaction is spatial. When combining these modes of interaction in a multimodal interface, synchronization is a key feature that determines overall usability of the interface. Synchronization is key to the following multimodal acts:
Point and Talk
User points at a location on the map while speaking a question.
The user interface supplements visual output with a spoken confirmation; for example, a travel reservation system might highlight the user's selection while speaking an utterance of the form
Leaving from San Francisco
If not synchronized, this can increase the cognitive load on the user significantly and prove a source of confusion.
The user interaction leverages the availability of multiple streams of output to increase the band width of communication. For example, a travel reservation system might visually present a list of available flights and speak a prompt of the form
There are seven flights that match your request, and the flight at 8:30A.M. appears to be the most convenient.
To be effective, such complementary use of multiple modalities needs to be well synchronized with respect to the underlying interaction state.
2 Multimodal Interaction Should Degrade Gracefully
Human interaction degrades gracefully; for example, a face-to-face conversation degrades gracefully in that the communication still remains effective when one of the participants in the conversation is functionally blind, for example, when talking over a telephone. This form of graceful degradation is due to the high level of redundancy in human communication. As man-machine interfaces come to include multimodal interaction, we need to ensure that these interfaces degrade gracefully in a manner akin to human conversation. Such graceful degradation is important since the user's needs and abilities can change over time, for example, a user with a multimodal device moving between a noisy environment where spoken interaction fails and an eyes-free environment where visual interaction is unavailable.
The use of multiple modalities to supplement one another leads to user interfaces that degrade gracefully.
Portions of the interface that use multiple modalities to complement one another are natural points where the interface will fail to degrade gracefully. When complementary modalities are used, the underlying system needs to be aware of the modalities that are currently available and ensure that all essential items of information are conveyed to the user. This is a key accessibility requirement when ensuring that the user interface is usable by individuals with different needs and abilities.
Capabilities can change rapidly in the case of mobile users. These include available bandwidth between the mobile device and the network, as well as changes in the bandwidth of communication between device and user. To be useful, multimodal interaction that is deployed to mobile devices needs to adapt gracefully to such changes.
3 Multiple Modalities Should Share a Common Interaction State
Consider the user task of setting up a travel itinerary using a multimodal interface, for example, one that allows the simultaneous use of spoken and visual interaction. Successful task completion during such a conversation requires that the participants share a common mental model, and this is true in the case of man-machine interaction as well. When using multiple modalities in a user interface, it is important that the various modes of interaction share a common interaction state that is used to update the presentation in the various available output media. Such a common interaction state is also essential for rapid completion of the conversation, since the various multimodal interactors can examine this shared interaction state in determining the next step in the dialog. A shared interaction state is important for the following multimodal acts:
User switches between interaction modalities owing to a number of external factors such as the available bandwidth among the user, the device, and the network. As an example, the user might answer the first few questions using spoken input and then use the visual interface to complete the conversation. For such transitions to be seamless, the data collected by each interaction modality, as well as the information conveyed via the available output media, needs to be driven by a shared interaction state.
The shared interaction state can track the history of user interaction, and this history can be useful in determining the most appropriate path through the dialog to achieve rapid task completion. For example, if the user had requested nonstop flights earlier in the conversation, this knowledge can be used in customizing the visual presentation by filtering out flights that do not meet this requirement.
A user with a mobile device might use a large visual display upon entering a conference room. To achieve a synchronized multimodal experience, the user's mobile device and the conference room display will need to share some interaction state.
It is advantageous to offload complex speech processing to network servers when using thin clients such as cell phones. As an example, a cell phone might be capable of local speech processing sufficient to enable the user to dial a small number of frequently used entries by speaking a name. If the name is not found in this list, it may be looked up in a larger phone book, for example, a company directory, and the speech processing required might be best offloaded to a network server. Sharing a common interaction state between the visual and spoken components of the cell phone is essential for synchronized multimodal interaction in such distributed deployments.
4 Multimodal Interfaces Should Be Predictable
Multimodal interaction provides the user with a multiplicity of choices and often enables a given task to be performed in a number of different ways. But to be effective, the interface needs to empower the user to arrive intuitively at these different means of completing a given task. Symmetric use of modalities where appropriate can significantly enhance the usability of applications along this dimension; for example, an interface that can accept input via speech or pen might visually highlight an input area while speaking an appropriately designed prompt. Where a specific modality is unavailable for a given task, for example,
signatures may only be entered via pen input
appropriate prompt design can help make the user implicitly aware of this restriction. Predictable multimodal user interfaces are important for the following:
Eliciting Correct Input
Appropriately designed prompts are important for getting the desired user input. This in turn can lead to rapid task completion and avoid user frustration when using noisy input channels such as speech.
What Can I Do?
Rich user interfaces can often leave the user impressed with the available features but baffled as to what can be done next. Spoken interaction, combined with good visual user interface design, can be leveraged to overcome this lost in space problem. Rich multimodal interfaces can use the shared interaction state and dialog history to create user interface wizards that guide the user through complex tasks.
5 Multimodal Interfaces Should Adapt to Users Environment
Finally, multimodal interfaces need to adapt to the user's environment to ensure that the most optimal means of completing a given task are made available to the user at any given time. In this context, optimality is determined by
The user's needs and abilities
The abilities of the connecting device
Available bandwidth between device and network
Available bandwidth between device and user
Constraints placed by the user's environment, for example, the need for hands-free, eyes-free operation