After a Big Data Scaling 'Crisis,' LinkedIn Open Sources Scale Testing Tool
- By David Ramel
- February 9, 2018
LinkedIn engineers have open sourced a homegrown Big Data scale testing tool following a "crisis" they experienced after adding 500 machines to a Hadoop Distributed File System (HDFS) cluster, resulting in disastrous slowdowns.
The company attributed the problem to applying changes from the Apache Hadoop open source community, from which even official releases aren't always subject to performance and scale testing.
Called Dynamometer, the tool addresses that testing void, allowing for performance/scale testing without the traditional cost and hassle associated with spinning up huge systems and clusters to assure real-world functionality doesn't suffer.
And suffer it did, when LinkedIn added 500 machines to its HDFS cluster in March 2015. That caused operations performed against the team's primary HDFS cluster to take up to two orders of magnitude longer than normal -- if they didn't time out completely.
"By the time the issue was detected, it was non-trivial to remove the new machines, as they had already accumulated a significant amount of data that would need to be copied off before they could be removed from service," the team said in a blog post yesterday. "Instead, we worked quickly to solve the specific issue causing the performance regression ... and subsequently began to make plans for how to protect ourselves from similar mishaps in the future."
That protection is now being shared with the open source community via Dynamometer, "a framework that allows us to realistically emulate the performance characteristics of an HDFS cluster with thousands of nodes using less than 5 percent of the hardware needed in production."
While expressing surprise that open source Big Data releases aren't tested for scaling functionality -- which, after all, is the core tenet of Big Data to begin with -- LinkedIn said the testing deficit makes sense considering:
- Scale testing is expensive -- the only way to ensure that something will run on a multi-thousand node cluster is to run it on a cluster with thousands of nodes.
- HDFS is maintained by a distributed developer community, and many developers do not have access to large clusters.
- Developers are aware that Apache community releases (as opposed to distributions like CDH and HDP) are not typically run in production.
The code for Dynamometer is on GitHub, where the project's three main components were described:
- Infrastructure: This is the YARN application which starts a Dyno-HDFS cluster.
- Workload: This is the MapReduce job which replays audit logs.
- Block Generator: This is a MapReduce job used to generate input files for each Dyno-DN; its execution is a prerequisite step to running the infrastructure application.
The site says: "Dynamometer is a tool to performance test Hadoop's HDFS NameNode. The intent is to provide a real-world environment by initializing the NameNode against a production file system image and replaying a production workload collected via e.g. the NameNode's audit logs. This allows for replaying a workload which is not only similar in characteristic to that experienced in production, but actually identical."
LinkedIn has already enjoyed multiple benefits from the tool, it said, such as making tweaks to the configuration of an upgrade from Hadoop 2.3 to Hadoop 2.6. The tool tipped off engineers to adjust JVM heap size and garbage collection tuning parameters, "potentially avoiding a disaster upon upgrading."
David Ramel is an editor and writer for Converge360.