Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-28T02:35:42.447Z Has data issue: false hasContentIssue false

Using speech to identify gesture pen strokes in collaborative, multimodal device descriptions

Published online by Cambridge University Press:  11 July 2011

James Herold
Affiliation:
Department of Computer Science and Engineering, University of California, Riverside, California, USA
Thomas F. Stahovich
Affiliation:
Department of Mechanical Engineering, University of California, Riverside, California, USA

Abstract

One challenge in building collaborative design tools that use speech and sketch input is distinguishing gesture pen strokes from those representing device structure, that is, object strokes. In previous work, we developed a gesture/object classifier that uses features computed from the pen strokes and the speech aligned with them. Experiments indicated that the speech features were the most important for distinguishing gestures, thus indicating the critical importance of the speech–sketch alignment. Consequently, we have developed a new alignment technique that employs a two-step process: the speech is first explicitly segmented (primarily into clauses), then the segments are aligned with the pen strokes. Our speech segmentation step is unique in that it uses sketch features for locating segment boundaries in multimodal dialog. In addition, it uses a single classifier to directly combine word-based, prosodic (pause), and sketch-based features. In the second step, segments are initially aligned with strokes based on temporal correlation, and then classifiers are used to detect and correct two common alignment errors. Our two-step technique has proven to be substantially more accurate at alignment than the existing technique that lacked explicit segmentation. It is more important that, for nearly all cases, our new technique results in greater gesture classification accuracy than the existing technique, and performed nearly as well as the benchmark manual speech–sketch alignment.

Type
Special Issue Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES

Adler, A., & Davis, R. (2007). Speech and sketching: an empirical study of multimodal interaction. SBIM ‘07: Proc. 4th Eurographics Workshop on Sketch-Based Interfaces and Modeling, pp. 8390. New York: ACM.CrossRefGoogle Scholar
Aha, D.W., & Bankert, R.L. (1994). A comparative evaluation of sequential feature selection algorithms. Proc. 5th Int. Workshop on Artificial Intelligence and Statistics, pp. 17, Ft. Lauderdale, FL.Google Scholar
Artstein, R., & Poesio, M. (2005). Bias decreases in proportion to the number of annotators. Proc. FG-MoL, pp. 141150.Google Scholar
Bischel, D., Stahovich, T., Peterson, E., Davis, R., & Adler, A. (2009). Combining speech and sketch to interpret unconstrained descriptions of mechanical devices. IJCAI ‘09: Proc. 21st Int. Joint Conf. Artifical Intelligence, pp. 14011406. San Francisco, CA: Morgan Kaufmann.Google Scholar
Bishop, C., Svensen, M., & Hinton, G. (2004). Distinguishing text from graphics in on-line handwritten ink. Proc. Int. Workshop on Frontiers in Handwriting Recognition, pp. 142147.CrossRefGoogle Scholar
Bloomenthal, K., & Zeleznik, R. (1998). SKETCH-N-MAKE: automated machining of CAD sketches. Proc. ASME DETC ‘98, pp. 111.CrossRefGoogle Scholar
Bolt, R.A. (1980). “Put-that-there”: voice and gesture at the graphics interface. SIGGRAPH ‘80: Proc. 7th Annual Conf. Computer Graphics and Interactive Techniques, pp. 262270. New York: ACM.CrossRefGoogle Scholar
Brown, D.C., Kwasny, S.C., Chandrasekaran, B., & Sondheimer, N.K. (1979). An experimental graphics system with natural language input. Computers & Graphics 4(1), 1322.CrossRefGoogle Scholar
Cassell, J. (1998). A framework for gesture generation and interpretation. In Computer Vision in Human–Machine Interaction (Cipolla, R., & Pentland, A., Eds.), pp. 191215. New York: Cambridge University Press.Google Scholar
Chai, J.Y., Hong, P., & Zhou, M.X. (2004). A probabilistic approach to reference resolution in multimodal user interfaces. IUI ‘04: Proc. 9th Int. Conf. Intelligent User Interfaces, pp. 7077. New York: ACM Press.CrossRefGoogle Scholar
Chai, J.Y., Prasov, Z., Blaim, J., & Jin, R. (2005). Linguistic theories in efficient multimodal reference resolution: an empirical investigation. IUI ‘05: Proc. 10th Int. Conf. Intelligent User Interfaces, pp. 4350. New York: ACM.CrossRefGoogle Scholar
Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., & Clow, J. (1997). Quickset: multimodal interaction for distributed applications. MULTIMEDIA ‘97: Proc. 5th ACM Int. Conf. Multimedia, pp. 3140. New York: ACM.CrossRefGoogle Scholar
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis 1, 131156.CrossRefGoogle Scholar
Efron, D. (1941). Gesture and Environment. Morningside Heights, NY: King's Crown Press.Google Scholar
Field, M., Gordon, S., Peterson, E., Robinson, R., Stahovich, T., & Alvarado, C. (2010). Technical section: the effect of task on classification accuracy: using gesture recognition techniques in free-sketch recognition. Computers & Graphics 34, 499512.CrossRefGoogle Scholar
Forbus, K.D., Ferguson, R.W., & Usher, J.M. (2001). Towards a computational model of sketching. Proc. 6th Int. Conf. Intelligent User Interfaces, pp. 7783. New York: ACM Press.Google Scholar
Godfrey, J.J., Holliman, E.C., & McDaniel, J. (1992). SWITCHBOARD: telephone speech corpus for research and development. Proc. ICASSP, Vol. 1, pp. 517520.Google Scholar
Gotoh, Y., & Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts. Proc. ISCA Workshop: Automatic Speech Recognition: Challenges for the New Millennium ASR-2000, pp. 228235.Google Scholar
Graham, P. (2004). Hackers and Painters, Big Ideas From the Computer Age. New York: O'Reilly.Google Scholar
Gupta, P., Doermann, D., & DeMenthon, D. (2002). Beam search for feature selection in automatic svm defect classification. Proc. Int. Conf. Pattern Recognition, Vol. 2, p. 20212.Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explorations Newsletter 11(1), 1018.CrossRefGoogle Scholar
Heiser, J., & Tversky, B. (2006). Arrows in comprehending and producing mechanical diagrams. Cognitive Science 30(3), 581592.CrossRefGoogle ScholarPubMed
Hoffmann, A.G., Kwok, R.B.H., & Compton, P. (2001). Using subclasses to improve classification learning. EMCL ‘01: Proc. 12th European Conf. Machine Learning, pp. 203213. London: Springer–Verlag.Google Scholar
Huang, X., Alleva, F., & Hon, H. (1993). The SPHINX-II speech recognition system: an overview. Computer, Speech and Language 7, 137148.CrossRefGoogle Scholar
Hwan Kim, J., & Woodland, P.C. (2001). The use of prosody in a combined system for punctuation generation and speech recognition. Proc. EUROSPEECH, pp. 27572760.CrossRefGoogle Scholar
Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., & Maloor, P. (2002). MATCH: an architecture for multimodal dialogue systems. Proc. 40th Annual Meeting of the Association for Computational Linguistics, pp. 276383.Google Scholar
Kara, L.B., & Stahovich, T.F. (2005). An image-based, trainable symbol recognizer for hand-drawn sketches. Computer Graphics 29(4), 501517.CrossRefGoogle Scholar
Kendon, A. (1997). Gesture. Annual Review of Anthropology 26(1), 109128.CrossRefGoogle Scholar
Krahnstoever, N., Kettebekov, S., Yeasin, M., & Sharma, R. (2002). A real-time framework for natural multimodal interaction with large screen displays. Proc. 4th Int. Conf. Multimodal Interfaces (ICMI 2002), Pittsburgh, PA.CrossRefGoogle Scholar
Liu, Y., Stolcke, A., Shriberg, E., & Harper, M.P. (2004). Comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech. Proc. Empirical Methods in Natural Language Processing, Barcelona.Google Scholar
Liu, Y., Stolcke, A., Shriberg, E., & Harper, M. (2005). Using conditional random fields for sentence boundary detection in speech. ACL ‘05: Proc. 43rd Annual Meeting on Association for Computational Linguistics, pp. 451458. Morristown, NJ: Association for Computational Linguistics.CrossRefGoogle Scholar
Luo, Y. (2008). Can subclasses help a multiclass learning problem? Proc. Intelligent Vehicles Symposium, 2008 IEEE, pp. 214219.Google Scholar
Masry, M., Kang, D., & Lipson, H. (2005). A freehand sketching interface for progressive construction of 3D objects. Computers & Graphics 29(4), 563575.CrossRefGoogle Scholar
McNeill, D. (1992). Hand and Mind: What Gestures Reveal About Thought. Chicago: University of Chicago Press.Google Scholar
Nakai, M., & Shimodaira, H. (1994). Accent phrase segmentation by finding n-best sequences of pitch pattern templates. Proc. 3rd Int. Conf. Spoken Language Processing (ICSLP 94), pp. 347350.CrossRefGoogle Scholar
Novak, G.S.J., & Bulko, W.C. (1993). Diagrams and text as computer input. Journal of Visual Languages and Computing 4(2), 161175.CrossRefGoogle Scholar
Oltmans, M. (2000). Understanding naturally conveyed explanations of device behavior. MS Thesis. Massachusetts Institute of Technology.Google Scholar
Oviatt, S. (2000). Taming recognition errors with a multimodal interface. Communications of the ACM 43(9), 4551.Google Scholar
Oviatt, S., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J., & Ferro, D. (2000). Designing the user interface for multimodal speech and pen-based gesture applications: state-of-the-art systems and future research directions. Human–Computer Interaction 15(4), 263322.CrossRefGoogle Scholar
Oviatt, S., DeAngeli, A., & Kuhn, K. (1997). Integration and synchronization of input modes during multimodal human–computer interaction. CHI ‘97: Proc. SIGCHI Conf. Human Factors in Computing Systems, pp. 415422. New York: ACM.CrossRefGoogle Scholar
Oviatt, S.L. (1999). Mutual disambiguation of recognition errors in a multimodel architecture. Proc. CHI 99 Conf. Human Factors in Computing Systems: The CHI is the Limit, pp. 576583. New York: ACM.Google Scholar
Patel, R., Plimmer, B., Grundy, J., & Ihaka, R. (2007). Ink features for diagram recognition. Proc. SBIM ‘07, pp. 131138.Google Scholar
Rubine, D. (1991). Specifying gestures by example. Computer Graphics 25, 329337.CrossRefGoogle Scholar
Silva, N., & Cardoso, T. (2004). GIDeS++—Using Constraints to Model Scenes, Technical Report. Information Society Technologies.Google Scholar
Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. Proc. Int. Conf. Spoken Language Processing (Bunnell, H.. & Idsardi, W., Eds.), Vol. 2, pp. 10051008. Philadelphia, PA.Google Scholar
Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., & Lu, Y. (1998). Automatic detection of sentence boundaries and disfluencies based on recognized words. Proc. Int. Conf. Spoken Language Processing (Mannell, R., & Robert-Ribes, J., Eds.), Vol. 5, pp. 22472250. Sydney: Australian Speech Science and Technology Association.Google Scholar
Strassel, S. (2004). Simple metadata annotation specification version 6.2. Linguistic Data Consortium.Google Scholar
Toutanova, K., & Manning, C.D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Proc. 2000 Joint SIGDAT Conf. Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 6370. Morristown, NJ: Association for Computational Linguistics.Google Scholar
Ullman, D.G., Wood, S., & Craig, D. (1990). The importance of drawing in the mechanical design process. Computers & Graphics 14(2), 263274.CrossRefGoogle Scholar
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., & Woelfel, J. (2004). Sphinx-4: A Flexible Open Source Framework For Speech Recognition, Technical Report TR-2004-139. Sun Microsystems.Google Scholar
Wauchope, K. (1994). Eucalyptus: Integrating Natural Language Input With a Graphical User Interface, Technical Report NRL/FR/5510-94-9711. US Naval Research Laboratory.Google Scholar
Wobbrock, J.O., Wilson, A.D., & Li, Y. (2007). Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes. UIST ‘07: Proc. 20th Annual ACM Symp. Interface Software and Technology, pp. 159168. New York: ACM.CrossRefGoogle Scholar
Woods, W., Bates, L., Bobrow, R., Brachman, R., Cohen, P.R., & Klovstad, J. (1979). Research in Natural Language Understanding, Annual Report 4274. Bolt Beranek and Newman.Google Scholar
Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature selection for high-dimensional genomic microarray data. Proc. 18th Int. Conf. Machine Learning, pp. 601608. San Mateo, CA: Morgan Kaufmann.Google Scholar
Yerazunis, W. (2004). The spam-filtering accuracy plateau at 99.9 percent accuracy and how to get past it. Proc. MIT Spam Conf., 2004.Google Scholar