Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-01-27T10:49:19.684Z Has data issue: false hasContentIssue false

Lightweight checkpointing for concurrent ML

Published online by Cambridge University Press:  19 March 2010

LUKASZ ZIAREK
Affiliation:
Department of Computer Science Purdue University, 305 N. University Street, West Lafayette, IN 47907-2107, USA (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu)
SURESH JAGANNATHAN
Affiliation:
Department of Computer Science Purdue University, 305 N. University Street, West Lafayette, IN 47907-2107, USA (e-mail: lziarek@cs.purdue.edu, suresh@cs.purdue.edu)
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Transient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.

Type
Articles
Copyright
Copyright © Cambridge University Press 2010

References

Adya, A., Gruber, R., Liskov, B. & Maheshwari, U. (1995) Efficient optimistic concurrency control using loosely synchronized clocks, SIGMOD Rec., 24 (2): 2334.CrossRefGoogle Scholar
Agarwal, S., Garg, R., Gupta, M. S. & Moreira, J. E. (2004) Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th Annual International Conference on Supercomputing. Malo, France, ACM, pp. 277286.CrossRefGoogle Scholar
Beck, M., Plank, J. S. & Kingsley, G. (1994) Compiler-Assisted Checkpointing. Tech. rept. Knoxville, TN: University of Tennessee.Google Scholar
Bronevetsky, G., Marques, D., Pingali, K. & Stodghill, P. (2003) Automated application-level checkpointing of MPI programs. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. San Diego, California, USA, ACM, pp. 8494.Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Szwed, P. & Schulz, M. (2004) Application-level checkpointing for shared memory programs. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages And Operating Systems. Boston, MA, USA, ACM, pp. 235247.CrossRefGoogle Scholar
Bruni, R., Melgratti, H. & Montanari, U. (2005) Theoretical Foundations for compensations in flow composition languages. In Proceedings of the 32nd ACM SIGPLAN Symposium on Principles of Programming Languages. Long Beach, CA, USA, ACM, pp. 209220.Google Scholar
Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G. & Fox, A. (2004). Microreboot – A technique for cheap recovery. In Proceedings of the 6th ACM Symposium on Operating Systems Design and Implementation. San Francisco, CA, USA, USENIX Association, p. 3.Google Scholar
Chen, Y., Plank, J. S. & Li, K. (1997) CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing. San Jose, CA, USA, ACM, pp. 111.Google Scholar
Christiansen, J. & Huch, F. (2004) Searching for deadlocks while debugging concurrent Haskell programs. In Proceedings of the 9th ACM SIGPLAN International Conference on Functional Programming. Snow Bird, UT, USA, ACM, pp. 2839.Google Scholar
Chrysanthis, P. K. & Ramamritham, K. (1992) ACTA: the SAGA continues. In Database Transaction Models for Advanced Applications. Morgan-Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 349397.Google Scholar
Dieter, W. R. & Lumpp, J. E. Jr. (1999) A user-level checkpointing library for POSIX threads programs. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing. Madison, WI, USA, IEEE Computer Society, p. 224.Google Scholar
Donnelly, K. & Fluet, M. (2008) Transactional events, J. Funct. Program., 18, 649706.CrossRefGoogle Scholar
Effinger-Dean, L., Kehrt, M. & Grossman, D. (2008) Transactional events for ML. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming. Victoria, BC, Canada, ACM, pp. 103114.CrossRefGoogle Scholar
Elnozahy, E. N. (Mootaz), Alvisi, L., Wang, Y-M & Johnson, D. B. (2002) A survey of rollback-recovery protocols in message-passing systems. Acm Comput. Surv., 34 (3): 375408.CrossRefGoogle Scholar
Field, J. & Varela, C. A. (2005) Transactors: A programming model for maintaining globally consistent distributed state in unreliable environments. In Proceedings of the 32nd ACM SIGPLAN Symposium on Principles of Programming Languages. Long Beach, CA, USA, ACM, pp. 195208.Google Scholar
Flatt, M. & Findler, R. B. (2004) Kill-safe synchronization abstractions. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation. Washington DC, USA, ACM, pp. 4758.CrossRefGoogle Scholar
Gray, J. & Reuter, A. (1993) Transaction Processing. Morgan-Kaufmann. Publishers Inc., San Francisco, CA, USA.Google Scholar
Harris, T. & Fraser, K. (2003). Language support for lightweight transactions. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications. Anaheim, CA, USA, ACM, pp. 388402.Google Scholar
Harris, T., Marlow, S., Simon, P. J., & Herlihy, M. (2005) Composable memory transactions. In Proceedings of the 10th ACM SIGPLAN Conference on Principles and Practice of Parallel Programming. Chicago, IL, USA, ACM, pp. 4860.Google Scholar
Herlihy, M., Luchangco, V., Moir, M. & Scherer, W. N. III (2003). Software transactional memory for dynamic-sized data structures. In Proceedings of the ACM Conference on Principles of Distributed Computing. Boston, MA, USA, ACM, pp. 92101.Google Scholar
Hulse, D. (1995) On page-based optimistic process checkpointing. In Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems. Lund, Sweden, IEEE Computer Society, p. 24.Google Scholar
Kasbekar, M. & Das, C. (2001) Selective checkpointing and rollback in multithreaded distributed systems. In Proceedings of the 21st International Conference on Distributed Computing Systems. Mesa, AZ, USA, IEEE Computer Society.Google Scholar
Kung, H. T. & Robinson, J. T. (1981) On optimistic methods for concurrency control, ACM Trans. Database Syst., 6 (2), 213226.CrossRefGoogle Scholar
Li, K., Naughton, J. & Plank, J. (1990) Real-time concurrent checkpoint for parallel programs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Seattle, WA, USA, ACM, pp. 7988.Google Scholar
Manson, J., Pugh, W. & Adve, S. V. (2005) The Java memory model. In Proceedings of the 32nd ACM SIGPLAN Symposium on Principles of Programming Languages. Long Beach, CA, USA, ACM, pp. 378391.Google Scholar
Reppy, J. (1999). Concurrent Programming in ML. Cambridge University Press.CrossRefGoogle Scholar
Rinard, M. (1999) Effective fine-grained synchronization for automatically parallelized programs using optimistic synchronization primitives, ACM Trans. Comput. Syst., 17 (4), 337371.CrossRefGoogle Scholar
Ringenburg, M. F. & Grossman, D. (2005) AtomCaml: First-class atomicity via rollback. In Proceedings of the 10th ACM SIGPLAN International Conference on Functional Programming. Tallinn, Estonia, ACM, pp. 92104.Google Scholar
Tantawi, A. N. & Ruschitzka, M. (1984). Performance analysis of checkpointing strategies, ACM Trans. Comput. Syst., 2 (2), 123144.CrossRefGoogle Scholar
Tolmach, A. P. & Appel, A. W. (1990) Debugging standard ML without reverse engineering. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming. Nice, Francs, ACM, pp. 112.Google Scholar
Tolmach, A. P. & Appel, A. W. (1991) Debuggable concurrency extensions for standard ML. Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging. Santa Cruz, CA, USA, ACM, pp. 120131.CrossRefGoogle Scholar
Welc, A., Jagannathan, S. & Hosking, A. L. (2004) Transactional monitors for concurrent objects. In Proceedings of the European Conference on Object-Oriented Programming. Oslo, Norway, Springer Berlin/Heidelberg, pp. 519542.Google Scholar
Welc, A., Jagannathan, S. & Hosking, A. (2005) Safe futures for Java. In Proceedings of the 20th ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications. San Diego, CA, USA, ACM, pp. 439453.CrossRefGoogle Scholar
Ziarek, L., Sivaramakrishnan, K. C. & Jagannathan, S. (2009) Partial memoization of concurrency and communication. In Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming. Edinburgh, Scotland, ACM, pp. 161172.CrossRefGoogle Scholar
Submit a response

Discussions

No Discussions have been published for this article.