Wednesday, 5 March 2014

Cache on top of cache on top of cache

Hi everyone,

For the second post, I'll dive into caching in its many forms, and explore how this absolutely critical piece of the puzzle leads into a storage environment that can keep up with the workloads demanded of it.

I'll simplify things down a little to make it make sense, as there are many, many factors that can impact some of the examples given, however, for the purposes of illustration, don't overcomplicate things.

What is it?


Caching is a critical part of any storage infrastructure, allowing the traditionally cumbersome spinning media to keep up with the performance demands of any workload. Once something like server virtualization comes along, this need is hugely enhanced by the increasingly random patterns in which the data needs to be read.

A cache is typically a small amount of very fast storage attached to a large amount of relatively slow storage. At the level of a single disk, this allows chunks of data to be read from the much slower mechanical or flash-based disk, into that small amount of memory, where the request from a server can be satisfied by picking out the bits it needs now, and likely the next bit it needs in a moment.

All of this happens incredibly quickly, at the level of milliseconds or microseconds, however, the difference made by dropping the time it takes for a data request to be satisfied by just one millisecond makes a large difference.

For example, a storage device that can respond in 12 milliseconds to each request, is going to potentially take up to 5 times as long to answer a series of requests as something that can respond in 3 milliseconds, which in turn might take 3 times longer again to answer the same series as something that can respond to each request in 1 millisecond. (This is highly simplified)

Once you scale the example up to an environment where there may be hundreds of disks, and thousands of users, performing hundreds of thousands of requests of these disks every second, the benefits of reduced latency grow exponentially as more "work" can be completed in any given time.

Cache exists within practically all types of storage and storage controllers. Each SAS or SATA disk comes with a small amount of cache, RAID controllers also have an amount of memory integrated in them to provide an additional layer of performance and intelligence, and some controllers are even able to utilise a small amount of solid state storage to give a comparatively large area of caching

When it comes to caching, adding an SSD or flash unit directly to a RAID card, or installing the technology in the host server itself can provide some benefit, and looking at faster technologies such as using DRAM can further enhance the cache built into the storage devices themselves.

Practical application


When dealing with environments that rely on shared storage architecture, as most highly-available, virtualized environments do, this presents a management headache in which some servers may be under-utilising the high-performance resources in some areas, and being bottlenecked in others.

The solution to many challenges is to be able to add fast disk technologies to your existing shared storage infrastructure, and be able to utilise DRAM (lower latency and higher speed than even the fastest flash devices on the market) in the shared storage to further enhance the tier-1 flash devices.

With ~300 or so flash or SSD vendors on the market right now, the number of options are staggering, and to be anchored to a single vendor or disk technology could well be a massive drawback given the rate of innovation around SSD and Flash technologies.

My ideal environment would have heterogeneous storage with multiple tiers of storage; best of breed flash as the top tiers, SAS for the mid-range data, and some near-line SAS/SATA for bulk capacity, all complemented by DRAM caching to enhance the latency response.

It would have to be able to use the most cost-effective technologies for each tier, and give the flexibility to utilise these technologies regardless of vendor, and be able to ensure that minor tasks like maintenance, upgrades, and decommissioning equipment at end-of-life, did not result in the entire staff sitting around twiddling their thumbs due to an outage.

Is this all possible using technologies accessible easily today? Yes, many times over, and the answer is through using a software layer such as DataCore to abstract the features and functions away from the physical devices, giving the ability to have a storage strategy, rather than settling for a temporary solution every 3-5 years.