Troubleshooting Hadoop: Distributed Debugging

The Hadoop ecosystem is a brilliant constellation of moving parts that provide a scalable solution to the ever increasing need to analyze and transform petabyte-scale datasets. However, using layers of software packages over distributed systems is a daunting administrative task that often times leaves operators at a loss for where to search for a root cause when a problem does arise. Knowing which service is having trouble and where to find the logs for that service is a first step, but many common failure scenarios require looking at the interaction between components to find a long-term solution.

This talk is aimed at beginning to intermediate level Hadoop administrators looking to enhance their suite of debugging techniques. We start with high level component interactions and dive more deeply into failure scenarios. Special attention is given to YARN applications where a common debugging problem is determining the difference between a failed application and a troubled piece of the infrastructure.