“What happens if we are successful?” is rarely a question that is asked when a business case is put together. Let’s think for a moment what the answer would be for a widely successful big data or data analytics project.
The answer at its most basic form is simple – you will have a LOT of data! By a lot, it’s not unreasonable to have hundreds of terabytes of data. Some examples of how big the data can get are staggering:
• Facebook has 30+ petabytes of user generated data it stores and analyzes1
• AT&T’s calling records database holds 1.9 trillion rows of data1
• WalMart processes 1 million customer transactions every hour, resulting in 2.5 petabytes of customer transaction data1
Just collecting the data is only half of the challenge. To gain meaningful business value from the project, we must analyze the data and derive intelligence that creates meaningful results.
Now the question becomes, “How do we analyze the massive amounts of data that we have collected?” This is where big data meets big compute. Fortunately, big computing has been a discipline in place since the 1960s and is known as High Performance Computing or HPC.
HPC, or super computers, has grown from the scientific and research sectors to analyze massive data sets for discovery. The largest HPC deployments can have hundreds of thousands of compute cores. A great example is the new Stampede 2 super computer at the Texas Advanced Computing Center power by Dell PowerEdge Servers and Intel processors, providing over 285,000 compute cores.2
While we are not suggesting that every big data project needs to build a super computer to analyze the data, it is interesting to look at what the HPC industry has learned on how to analyze massive data sets and give consideration to those learnings when building your data analytics business cases.
HPC environments make extensive use of parallel processing to crunch huge data sets. Luckily, this approach is available at a much smaller scale. Dell EMC offers preconfigured system like the Dell EMC HPC System for Manufacturing. These systems are purpose built for parallel workloads and look beyond compute to design an entire system – network, storage and software – to take proper advantage of parallel processing systems. Even at small node counts of less than 10 nodes, these systems provide tremendous advantages over traditional, single system approaches.
HPC environments are always looking for speed. Accelerators are key technology advances that create more overall speed. Key accelerators like GPU processing and networking technology like Intel’s Omni-Path fabric help remove bottlenecks. While your specific environment may not need the speed associated with GPU processing, it could be accelerated with an investment in faster networking like a move to 10 or 40 GigE.
The key is to ask the question – “What happens when we are successful?” – and then plan ahead. Luckily in the data analytics space, HPC environments have been facing these problems for decades and many of their tips and tricks can help you produce the results you need in a timely fashion.
1 Waterford Technologies "Big Data Statistics & Facts for 2017." 22 Feb. 2017. Web. https://www.waterfordtechnologies.com/big-data-interesting-facts/