Cool Tech Preview: VMware’s distributed storage
Looking through VMware’s newly announced things at VMworld 2012, the one thing that stood out for me was vSAN or (vCloud) Distributed Storage technology. From what I’ve seen at VMworld sessions, the vSAN technology creates a “distributed storage layer” across ESX nodes in a cluster – yes, up to 32 of them. Disclaimer: Even though I work for EMC, I have NO further insite into this development, nor do I blog for EMC. These are my own thoughts and ideas.
Just a VSA on steroids or way more?
So what is this distributed storage technology? At the first glance, it would appear to be something much like the VSA, but its implementation would be more comparable to the distributed vSwitch approach. The distributed vSwitch is basically built out of local vSwitches, which project themselves as being one single big switch that stretches nodes.
Converting this to the storage layer, it makes sense that you would have local storage in each server (SSD and HDD), and the vSAN distributed storage layer will figure out how to mirror the data across available nodes. On top of that, the distributed storage layer would live INSIDE the hypervisor, and not bolted on via appliances. Interesting thought…
The big idea
VMware’s big idea around distributed storage is that a VM on a vSphere cluster would require storage, and have its own storage profile. This storage profile would have numbers inside for availability and performance. The distributed storage layer would then carve storage out of the global pool of local HDDs and SSDs, and mirror across nodes depending on the requested availability. For performance, the distributed storage layer would or would not carve out cache space out of available SSDs.
Interestingly, the distributed storage layer would deliver thousands of these “storage objects” and not just up to 256 LUNs. Although different, this is very much in line with the vVols architecture that were also officially announced at VMworld 2012.
Assuming each node (or most of them) will bring local HDDs and SSDs into the cluster, the distributed storage layer could automagically claim these drives to be used for the distributed storage layer. Alternatively administrators can manually add drives to the pool:

Starting to use VMware Distributed Storage should be ridiculously easy: Step one, enable… Step two… Start to use. Scary.
This all sound like a very simple yet effective solution for delivering storage to a vSphere cluster. However, there are some nasty details to consider… A storage environment like this is not built “just like that”…
Some possible issues
So it would appear to be the greatest thing since sliced bread: Simple, effective, very very integrated into the vSphere stack, fully profiled, self-remediating. But let us not forget how hard it is to build decent storage that works and keeps working. And I am not talking about failure of a node or a disk; those are the easy ones that everyone can solve. But what about the “fringe cases”, which aren’t as rare as you might think like network failures between nodes, drives that “sort of die”, nodes “jittering”. I have seen the most “exciting” things around various storage arrays, and those were specially built JUST to deliver available non-breaking storage.
Here are some interesting thoughts around the distributed storage approach VMware wants to take:
Bandwidth between nodes
As each node will apparently need to bring SSD into the cluster, this SSD will be used for read and write caching. This in turn means that the network in between the nodes will have to be high in bandwidth and low in latency. I’d vote for infiniband, but I’m guessing VMware will be aiming at 10GbE for this. There will be a LOT of traffic going over these backend links, as all VMs vMotion all over the place (and I’m guessing the storage will not follow the VM).
Caching: easy?
The proposed distributed storage solution would (if the profile of the workload requests it) use read AND write caching to SSDs. I can think this up in 30 seconds as well, but building this… Is another story. How do I determine which blocks go into cache? How do I determine how (and which) blocks I have to demote from cache? Will the solution try to have the cache SSD local to the VM using it? How would this work after a vMotion? Would the cache move after or with the vMotion? These are all things that will determine how effective caching will be. One thing is for sure: These caching algorithms can be pretty complex, and the available space will determine how the caching algorithms should behave.
A very good example is EMC DRAM caching versus FAST-cache solid state drive caching: From a distance they appear to be the same. But in detail they are in fact very different: DRAM cache will absorb writes into cache, only to buffer them and flush them out to disk. FAST-cache on the other hand, has no urge to flush cached data out to spinning disk; as the capacity is much bigger, it will try to keep ALL hot data in cache. These very different strategies have to be used in different scenarios (even worse: the strategy could actually flip as workloads change!): If the hot data has a high locality, you’d want to keep the data in cache “forever”. If the hot data footprint is way larger than the available cache, you want to (have to) flush. When to use which strategy? You’d need to measure this. Maybe I’d always want to run some workloads directly from SSD. That could be forced by setting the appropriate profile on the workload. But once you allow that, your SSDs would no longer just be caches, but perform mixed services: Cache and store. Not impossible, but a lot of work.
Failure of a disk
If a disk fails the hard way, vSphere will be able to detect this. The response would be to use the mirror(s) of the data and re-protect this data elsewhere, effectively copying the mirrors of the lost data to other drives in this host or other hosts. Of course, the system would have to avoid to use a mirror of disks within a single node here, just like the standard deployment of these “storage objects”. But what happens if the disk returns false data? There is a reason that for example EMC uses 520 bytes sectors to include an extra check, and that NetApp / ZFS uses checksums in branches for the leaves of data located elsewhere to make sure the data being read is actually the right data. It would be interesting to see if (and how) VMware plans to implement this in their distributed storage.
Isolation of a node
Another “cool” scenario is the isolation of a node. What happens if a node becomes isolated from the network but is still up and running? If a VM is running there utilizing a local drive as well as a remote mirror (which would seem a realistic scenario), how would the system behave? Allow writes to that one disk or not? And if yes, would the mirrored object still accept writes? That last one would create split brain. I am not saying it cannot be solved properly, but it does show the amount of work that needs to be done to make things resilient. And what if we throw FT in the mix? Transparent failover to the non-isolated host(s) would be possible, but behavior would have to be very well thought up.
Assuming the response is the correct one (by using a majority of nodes, pre-programmed isolation responses), some VMs would possibly have to be powered off when isolation occurs. HA would restart the impacted VMs elsewhere. Distributed storage would respond by re-protecting the lost data on other disks in the distributed pool. And when the isolated node returns to the pool, it would have to be considered a new node, with no data on it… Or really interesting things start to happen :O
Adding a node
What about adding a node? Let’s assume I have a 6 node setup, where all the disks in all of the nodes are heavily utilized in both capacity and performance. Now I add a seventh node. It disks deliver zero IOPS and zero capacity. As I now start to add new VMs, the distributed volume would start to use this empty space on the node, but also NEEDS to have its data mirrored on the heavily utilized remainder of nodes. This would require a REBALANCING algorithm. If a node gets added, some of the storage objects would have to be copied over to the new node to rebalance the load. Again, far from impossible (look at how pools can now restripe on EMC VNXes for example) but a lot of effort to get working properly.
Removing a node
Removing a node is also something that would have to be dealt with. You can today put a node into maintenance mode, which means all VMs (that are allowed to vMotion off) would vMotion off. No problem here. But I would probably also need to put the node into “storage maintenance” to evacuate all of the data off the node onto the remaining nodes, right?? That could take some time, especially as the drives grow and grow in capacity and impact to the remaining nodes has to be limited during the move of data (you would not want your VMs to be impacted by this massive migration). Again, not impossible to do, but a lot of effort.
Node maintenance
One of the superb things about virtualization today is being able to move your workloads without downtime. A vSphere node can be evacuated of VMs in a matter of minutes. After that, you can bring the node down, upgrade it, boot it again. It joins the cluster, and DRS rebalances the load. Great!
But now let’s put distributed storage into this mix. I put a node into maintenance mode, and the VMs evacuate. But I cannot just shut down the node now, because there is a lot of data on this particular node. I see two options:
- You enter the node into “storage maintenance” and the data is migrated to remaining nodes;
- You shut down the node and live with the lower availability for the time being.
The first option is the easiest one to build (just like “removing a node” and as it returns “rebalance the pool”). But it would take ages if I have 6x 4TB drives in there… The alternative (option 2) is a nasty one. This option seems easier to build (just keep from re-protecting any node in maintenance mode that goes down temporarily), but in fact this option is difficult to build as well: As the node returns out of maintenance mode (after maintenance), you’d have to update all mirrored objects on that node to make sure the mirrors become consistent again.
Another risk here: What happens if a second node fails during the “mode 2” maintenance? Indeed. You’d loose data. This is an option that gets real scary real quick, and if this is the way things will work you’d probably aim for triple mirrored systems.
In turn, triple mirrored systems would require each write to go to not only two but three places, impacting the performance even more (and approaching a RAID5 impact on writes!), especially if the writes have to go to spinning disks.
Harder than it looks
As we have seen from the previous part, building distributed storage is hard. Very hard. There are SO many things to consider, to pre-plan, to pre-program. However, if VMware pulls this off it would be a serious alternative to midrange storage devices (strangely, I would not comfortably propose distributed storage against an EMC VMAX 😉 ).
Another interesting thought: As distributed networking has the Cisco 1000V switch as an option, could distributed storage have storage vendor plugins? Could I have a storage vendor delivered software NAS/SAN in there? Vendors like Nexenta, but also the big vendors like EMC and NetApp immediately come to mind here. A software NetApp cluster, or an EMC Isilon (vIsilon?) ?
Looking at the history of EMC technology in VMware
Now I did it. I used the I(silon) word. EMC’s Isilon would actually be a VERY TIGHT match to the VMware proposed distributed storage implementation. Isilon too uses nodes with local disks to form this one big (file)system that grows on and on (apparently forever: 16PB (yes that is PetaByte) in a single filesystem is actually very possible today). It would be child’s play to draw “storage objects” from that one big flexible store at that point.
Let’s do a small comparing example: Let’s focus on VDR (VMware Data recovery). Anyone ever used version 1.0? Anyone completely happy with it? Even the latest last version wasn’t all that stable imho. Now VMware has a replacement product: VDP (vSphere Data Protection). So what is under the covers? Exactly. EMC’s Avamar technology. Yes, the industry-leading-dedup Avamar.
I am so much hoping VMware has learned from the above, and will not try to build distributed storage from scratch, but instead talk to leading vendors to deliver some of their technology into the distributed storage product. The first one I would think of would be EMC’s Isilon. It appears to be made for something like this. Today, Isilon may not be the greatest solution for low-latency write intensive workloads. But tomorrow, Isilon will get a lot better in handling small-block low-latency writes.
Funny thing, Chad Sakac already showed a virtualized Isilon, where VMware workloads can run on the actual Isilon cluster:
In this demo, the Isilon nodes are said to be running vSphere, with Isilon as a virtual appliance running on top of that together with other virtual workloads. It would be a realistic thought though, to not have Isilon run ON the vSphere node, but INSIDE, which would make it a solution very comparable to the proposed VMware distributed storage.
Tying all this technology together, it becomes an almost logical suggestion for Isilon (code) to be used in a vSphere-based distributed storage solution. The plot thickens 🙂
How the technology would work (as seen from the outside)
Let’s imagine you have x number of nodes (where x is 3-32), each having some form of local storage, which is a mix of large HDDs and fast SSDs. Now I go into vCenter, and I enable “distributed storage” on the vSphere cluster. vCenter responds by enabling the distributed storage on each vSphere node. The nodes now start to (auto)claim the local storage on each node, and add themselves to the distributed storage layer one by one. When done, we have a fully virtualized storage cluster up and running utilizing the local storage of the individual storage nodes, plus caching drives (SSD)!
Next, the nodes deliver their shared storage out of the distributed storage system as storage objects for VMs to use. Depending on the storage profile you gave the VMs, they will get a mirror or a triple mirror of HDD space, and optionally R/W cache space on SSD drives somewhere in the cluster. Cool technology for sure.
You can watch a video demonstration of the tech preview:
Interestingly, this video also shows the number of GBytes available from the distributed storage layer. But in fact, it would depend how much is actually available: As the VM profiles might demand four 9’s or five 9’s, the objects created might be mirrors or triple mirrors. That would impact the true amount of free space we have, so I’m assuming the free space are raw Gbytes here.
There is also a good read to be found at the VMware CTO office on the subject: A preview of distributed storage.
My take on the vCloud DSL
I seriously hope that VMware has been using code or at least insights and ideas from storage vendors before starting to build something like a distributed storage system. I think VMware cannot afford to have this kind of product to fail at launch. That said, it would be SO cool if they pull this off. I’m hoping they will. Because it is cool, because it makes sense. If done right.
With this much complexity in a system that would appear to be so simple, it will be hard to launch a v1.0 product that does it all. My expectation would be that this product, like any, will grow in features over time. And there is nothing wrong with that… Instead of continuously adding features to a non-finished product, it is much better to first finish up v1.0 in a decent manner, and only then start on the new thought up features for v2.0.
But I am getting WAY ahead here. As it stands now, this is a TECH PREVIEW. We might actually never see an actual product launch. On the other hand, this idea is too cool not to. In true James T. Kirk style:
[…] Defined Datacenter (SDDC). This is where things are going for sure (also see my related post Cool Tech Preview: VMware’s distributed storage). The idea is that since “everything” runs on x86 anyway, you can potentially run all […]
[…] http://www.vmdamentals.com/?p=4267#more-4267 […]