Abstracting the Hadoop World

As I'm sitting on a plane (yet again), I got to thinking about abstracting. Abstraction is an interesting human concept. When done properly, no one notices. When done poorly, everyone notices (and subsequently everyone is an expert in it).

Recently, I helped a friend, James, while he was purchasing a thermal printer. Yeah, I said "thermal". It hadn't even occurred to me how prevalent they are in today's world. Every receipt, shipping label, and building visitor badge is made from thermal printers - so the market is competitive. Unfortunately, it’s a business to consumer product, so you can't really educate yourself on the specs of a printer by going onto Amazon (although you can buy one there).

Like a technologist, I helped Jimmy (I like calling him Jimmy because it annoys him) by researching all the specs and reviews. What I found out was that the cost of these thermal printers isn't just slightly cheaper to operate, its significant. For example, printing 25k pages a year runs through about 100 reams of paper and about 10 toner cartridges. After maintenance and shipping you are looking at $2k/year. With a thermal printer, there is no toner, it just scorches the paper, so the same 25k cost can be dropped to about $85 in paper. That's quite some savings.

After some careful research, I ended up purchasing a printer for Jimmy and found a specific 4" wide paper that would give him a nice finished product. I got all the pieces and plugged it in to my laptop to test. That's when I noticed a fairly significant problem. See, the printer accepted the paper width, but the diameter of the spool that the paper was on was too small. Arg. The printer would print, but the roll of paper would just sit at the bottom of the tray and rock back and forth. Frustration.

To me this was an example of poor abstraction (please ignore that I completely missed this spec when buying the printer and paper). 

I've found that this type of abstraction continues to be problematic today in servers for the Hadoop world. As much as I want to believe that as a technologist, we abstract better than other industries, we are still victim to inconsistencies in the hardware/software combinations we purchase. MPP databases are designed to handle scale-out architectures, however, when they are placed on non-homogeneous hardware, they are lop sided. There are, of course, opportunities to leverage this hardware *a little bit* by creating lop sided data distributions that mimic the servers. Try doing that at hundreds of nodes.

Hadoop made some strides forward with its internal MapReduce architecture. It "pulls" tasks instead of pushing tasks. This means that node failures or network disconnects can be abstracted out of the overall architecture. It also means that more powerful servers will likely pull more tasks and do more work. It works well, but it doesn't solve the storage-to-processor ratio issue. This means that if your application changes or there is a huge volume increase, you may not have enough storage attached to your hundreds of nodes you've carefully built. 

Just to prove it, I'm including a picture of my adapter.Frustrated with my choice of thermal printer, I started to see if I could find an adapter to solve the issue. No luck. Out of sheer stubbornness, I went to an online CAD website and pulled out my calipers. I carefully measured and designed a 1" to 6mm adapter. Next, I went to a 3D printing company, uploaded my model, and printed it. $16 later, I had my solution. (Just to prove it, I'm including a picture of my adapter).

Solving the same problem in the MPP and Hadoop space proved to be no different. The folks at DriveScale have come up with a server-to-disk mapping software. It’s a new push in the Big Data industry called "software composable infrastructure". It solves the exact problem of mapping the processors (in my case the printer) to the disks (in my case the paper). It does this by leveraging 1U servers and JBODs instead of 2U servers with direct attached storage.

The benefits of moving the big data servers into this space are tremendous. Not only do you get the flexibility with the same performance, the costs can be driven down dramatically (just like those thermal printers). Growth or changes in a cluster aren’t a problem either, making altering that balance of processor-to-disk ratio simple and with no additional costs. Daily operations are also simplified with deployment and maintenance.

Luckily, they are a valued partner here at RoundTower. We are doing an informational webinar together on June 27th @ 10:00a PT / 1:00p ET. If you are still reading this far, I invite you to join us for a much more in depth understanding of this problem (and more) around data center growth. I promise, I'll leave the abstraction concept up to them, but I'm available if you need an adapter for a thermal printer. Oh, and Jimmy never noticed the adapter I built.

Register for the June 27th webinar by clicking the button below.
Register Today!

Share this Post:
« RECAP: Citrix Synergy 2017
Pure Storage Global Partner Forum and Accelerate 2017, A Few More Things I Didn’t Already Know »

We've moved! Check out our new HQ office at the Kenwood Tower

It's official, we are up and running in our brand new office space! Schedule a tour of our beautiful new facility located on the third floor of the Kenwood Tower at 5905 E Galbraith Rd, Cincinnati.