The Computing Innovation Fellows Project

Matchmaking Service for Mentors and CIFellows

* Post a Profile!
* Update a Profile

Click for Available Candidate Profiles

Gene Cooperman

University/Research Lab: Northeastern University
Location: (Boston, MA)
Personal Research Web Page: http://www.ccs.neu.edu/home/gene/

Keywords: checkpointing, Condor, OpenMPI, Infiniband, IBM Blue Gene, reversible debugger

Posted on: Wednesday, April 28th, 2010
Broad Research Area: Networks / Operating Systems, Numerical/Scientific Computing / HPC / Data-Intensive Scalable Computing

Research Interests:

Our group (currently including five PhD students, two M.S. students and an undergraduate) has a six-year history of developing a robuts, transparent checkpointing package: DMTCP ( Distributed MultiThreaded CheckPointing; dmtcp.sourceforge.net ). DMTCP is now experiencing over 100 downloads per month. We directly checkpoint the binary executable with no modification to the kernel (entirely in user space). We transparently checkpoint and restart most programs, including OpenMPI, Matlab, python, perl, bash, scheme, lisp, etc. We are collaborating with the teams for Condor, Maple, SCIRun and others to support checkpointing for those environments. In the area of supercomputing, we plan to support Infiniband (in addition to current support for TCP/IP), and such supercomputers as Cray XT and IBM Blue Gene. ==========================================

In the last year, we have been developing a reversible (time travelling) debugger (URDB: Universal Reversible Debugger; urdb.sourceforge.net ) that adds reversibility gdb, the Matlab debugger, the perl debugger (perl -d), and the python debugger (pdb). The foundation is the support of DMTCP for checkpointing entire debugging sessions (e.g.: gdb and target process). This research platform opens up many exciting possibilities, such as a program that can communicate with its past self and make decisions. One of the novelties implemented today is the ability to do a binary search on the lifetime of a program and stop at a point in time where a given expression changed from a “good value” to a “bad value”. This allows a user to run until an error is detected, then construct an expression representative of the underlying fault that caused the bug, and then moving quickly to the point in the lifetime of that process where the expression changed from a good value to a bad value.

 

Contact Information:

If interested, please send e-mail to: gene@ccs.neu.edu

twitter-icon

Browse Mentor Posts in other Research Areas