Data Hypervisor

Data virtualization surfaces

Posted on January 9, 2015 by Ray in Block Storage, Cloud services, Clustered storage, DAS, Data, data access, data protection, Data QoS, Distributed computing, File Storage, Object storage, SSD storage, Storage, Storage virtualization, Strategy, Systems

There’s a new storage startup out of stealth, called Primary Data and it’s implementing data (note, not storage) virtualization.

They already have $60M in funding with some pretty highpowered talent from Fusion IO, namely David Flynn, Rick White and Steve Wozniak (the ‘Woz’) (also of Apple fame).

There have been a number of attempts at creating a virtualization layers for data namely ViPR (See my post ViPR virtues, vexations but no storage virtualization) but Primary Data is taking a different tack to the problem.

Data virtualization explained

Data hypervisor, software defined storage, data plane, control plane — (c) 2012 Silverton Consulting, Inc. All rights reserved

Essentially they want to separate the data plane from the control plane (See my Data Hypervisor post and comments for another view on this).

The data plane consists of those storage system activities that actually perform IO or read and writes.
The control plane is those storage system activities that do everything else that has to be done by a storage system, including provisioning, monitoring, and managing the storage.

Separating the data plane from the control plane offers a number of advantages. EMC ViPR does this but it’s data plane is either standard storage systems like VMAX, VNX, Isilon etc, or software defined storage solutions. Primary Data wants to do it all.

Their meta data or control plane engine is called a Data Director which holds information about the data objects that are stored in the Primary Data system, runs a data policy management engine and handles data migration.

Primary Data relies on purpose-built, Data Hypervisor (client) software that talks to Data Directors to understand where data objects reside and how to go about accessing them. But once the metadata information is transferred to the client SW, then IO activity can go directly between the host and the storage system in a protocol independent fashion.

[The graphic above is from my prior post and I assumed the data hypervisor (DH) would be co-located with the data but Primary Data has rightly implemented this as a separate layer in host software.]

Data Hypervisor protocol independence?

As I understand it this means that customers could use file storage, object storage or block storage to support any application requirement. This also means that file data (objects) could be migrated to block storage and still be accessed as file data. But the converse is also true, i.e., block data (objects) could be migrated to file storage and still be accessed as block data. You need to add object, DAS, PCIe flash and cloud storage to the mix to see where they are headed.

All data in Primary Data’s system are object encapsulated and all data objects are catalogued within a single, global namespace that spans file, block, object and cloud storage repositories

Data objects can reside on Primary storage systems, external non-Primary data aware file or block storage systems, DAS, PCIe Flash, and even cloud storage.

How does Data Virtualization compare to Storage Virtualization?

There are a number of differences:

Most storage virtualization solutions are in the middle of the data path and because of this have to be fairly significant, highly fault-tolerant solutions.
Most storage virtualization solutions don’t have a separate and distinct meta-data engine.
Most storage virtualization systems don’t require any special (data hypervisor) software running on hosts or clients.
Most storage virtualization systems don’t support protocol independent access to data storage.
Most storage virtualization systems don’t support DAS or server based, PCIe flash for permanent storage. (Yes this is not supported in the first release but the intent is to support this soon.)
Most storage virtualization systems support internal storage that resides directly inside the storage virtualization system hardware.
Most storage virtualization systems support an internal DRAM cache layer which is used to speed up IO to internal and external storage and is in addition to any caching done at the external storage system level.
Most storage virtualization systems only support external block storage.

There are a few similarities as well:

They both manage data migration in a non-disruptive fashion.
They both support automated policy management over data placement, data protection, data performance, and other QoS attributes.
They both support multiple vendors of external storage.
They both can support different host access protocols.

Data Virtualization Policy Management

A policy engine runs in the Data Directors and provides SLAs for data objects. This would include performance attributes, protection attributes, security requirements and cost requirements. Presumably, policy specifications for data protection would include RAID level, erasure coding level and geographic dispersion.

In Primary Data, backup becomes nothing more than object snapshots with different protection characteristics, like offsite full copy. Moreover, data object migration can be handled completely outboard and without causing data access disruption and on an automated policy basis.

Primary Data first release

Primary Data will be initially deployed as an integrated data virtualization solution which includes an all flash NAS storage system and a standard NAS system. Over time, Primary Data will add non-Primary Data external storage and internal storage (DAS, SSD, PCIe Flash).

The Data Policy Engine and Data Migrator functionality will be separately charged for software solutions. Data Directors are sold in pairs (active-passive) and can be non-disruptively upgraded. Storage (directors?) are also sold separately.

Data Hypervisor (client) software is available for most styles of Linux, Openstack and coming for ESX. Windows SMB support is not split yet (control plane/data plane) but Primary data does support SMB. I believe the Data Hypervisor software will also be released in an upcoming version of the Linux kernel.

They are currently in testing. No official date for GA but they did say they would announce pricing in 2015.

~~~~

Comments?

Disclosure: We have done work for Primary Data over the past year.

Photo Credits:

Screen shot of beta test system supplied by Primary Data
Graphic created by SCI for prior Data Hypervisor post

The promise of software defined storage

Posted on May 3, 2013May 3, 2013 by Ray in Server virtualization, Software Defined Network, Storage, Storage performance, Storage utilization, Storage virtualization, System effectiveness

Not sure why but all the hype around software defined storage seems to be reaching a crescendo. Possible due to conference season coming up but it started earlier this year. I attended an SNW analyst session that was talking about software defined storage had on its panel technical people from HDS, IBM, Data Core and VMware. It seems the distinction between storage virtualization and software defined storage is getting slimmer every time we talk about it. I have written before about software defined storage (see my Data Hypervisor post).

Server, networking and storage virtualization today

Server virtualization makes an awful lot of sense, has made lots of money and arguably been around for decades now especially in mainframe systems. Servers have so much power today that dedicating one to a single workload just doesn’t make any sense anymore.

Network virtualization from OpenFlow and others also makes a lot of sense (see OpenFlow the next wave in networking and OpenFlow part 2, Cisco’s response posts). Here we aren’t necessarily boosting network utilization as much as changing resource allocation to deal with altered traffic flows. That and the fact that provisioning, monitoring and other management characteristics can now be under pragmatic control from the user makes these systems very appealing. Especially, to organizations that exhibit varying network activity over time.

Storage virtualization has been around for a long time too and essentially places a storage system abstraction layer on top of a group of other, heterogeneous storage systems. This provides a number of capabilities such as allowing data to be migrated from one storage system to another without host knowledge or intervention. Other storage virtualization features include, centralized, management, common storage features, different storage personalities (protocols), etc. But just being able to migrate data from one storage system to another without host intervention or knowledge provides an awful lot of value, especially to large data centers which refresh technology frequently.

Software defined storage compared to server virtualization

Software defined storage seems to imply some ability to marry storage virtualization services to RESTful and other APIs which would allow programatic storage provisioning, monitoring and management. This would allow data centers to manage and control their storage without involving storage administrators in day-to-day activities.

When I compare this to server virtualization the above described capabilities really don’t increase storage utilization much. Yes, by automating provisioning or even running thin provisioning one can potentially boost storage capacity utilization but you really haven’t increased the IO utilization much by doing this.

Looking under the covers of most storage systems one might find that CPU cores are pretty idle, but data paths and storage devices are typically running flat out. One problem is that today’s enterprise storage subsystems are already highly shared across applications and users. So there is really no barrier to sharing these resources as widely as they can. As such, storage system IOPS and/or bandwidth utilization is already pretty high. I would say a typical enterprise application environment storage subsystem performance usually runs above 30% and reaching 50% or more during peak time periods. Increasing IOPS utilization much beyond that risks seriously impacting peak performance periods.

Now if somehow one could migrate slower data around a complex to lower performing storage when there’s no need for high performance and higher performing data to higher performing storage when there is a need then that could help increase performance utilization considerably. But, many storage systems already do this internally through automated storage tiering and even some can do this across storage systems using storage virtualization.

But the underlying problem here is that in takes a lot of time, resources and effort to move TBs of data around a data center, especially when its doing other work. So other than something akin to storage tiering across storage systems we are unlikely to see much increase in storage performance utilization with a gaggle of multiple storage systems. I suppose in the future moving TB of data may take much less time & resources than today but then the problem becomes moving PB of data around.

Software defined storage compared to network virtualization

When I compare the above capabilities to network virtualization it doesn’t look very similar. There’s really no way to change the storage performance to optimize it for one direction (or application) at this instant and then move storage performance around to another application a couple of hours later. Yes, again automated storage tiering can do this, and yes some of these systems can tier across storage systems using storage virtualization but in general barring storage tiering there’s nothing like this available today.

Maybe if inside a storage system the data paths could somehow be programatically reconfigured to offer say more internal bandwidth to the Device-to-Cache path vs. the Cache-to-Frontend path. Changing or reconfiguring data path resources like this could certainly optimize the internal performance of a storage system and this would be a worthwhile feature of any software defined storage. Knowing which is more important to one application and less important to all the others will take some smarts, across the storage system and host O/S but it’s certainly feasible. So, with RESTful interfaces, APIs or application hints data paths could be reconfigurations on demand to support applications that are all vieing for IO activity.

With these sorts of capabilities software defined storage starts to look a little more like software defined networking.

Software defined storage on its own

But in the end we always reach a fundamental limit of IO capabilities in today’s storage systems which is the devices. Yes you can have 2000 or more devices in high-end storage today and yes you can have all-flash arrays. However, most storage systems are configured to keep whatever devices they have pretty busy as much of the time as possible.

Until we create some sort of storage device that can provide more performance than most applications can ever use, even when they are shared via a storage system, software defined storage capabilities will be limited. Today’s SSDs have certainly boosted performance considerably but this just means that most applications that warrant all flash arrays are performing faster. It just so happens that some applications can take all the performance you throw at them and still want more.

I suppose if SSDs cost were to come down to match NL-SAS storage prices and still maintain the 100X faster IOP rate, then maybe a storage system built on such devices could be more “software defined” than others. And maybe that’s where everyone is headed, believing NAND/SSD price trends will drive costs down so much that everyone can have all the IOPS performance they will ever need out of a single storage system.

Yet, this still just looks like shared storage we have today, only more of it. So we return back to our roots and see that software defined storage is just another way to add more storage sharing. Storage virtualization is nice, new more programmatical storage systems is even better but faster-cheaper storage devices is best of all.

So what we really need is much cheaper SSDs to realize the full promise of software defined storage. In the mean time opening up APIs and providing RESTful interfaces to provide programatic interfaces to provisioning, monitoring, managing and tuning storage system data paths and other performance characteristics are all we can hope for.

Comments?

Posted on August 14, 2012 by Ray in Block Storage, Clustered storage, DAS, Data grid, data logistics, Data QoS, Data transmission, Disk storage, File Storage, Server virtualization, Storage architecture, Storage Features, Storage virtualization, Strategic Inflection Points, System effectiveness

(c) 2012 Silverton Consulting, Inc. All rights reserved

With all this talk of software defined networking and server virtualization where does storage virtualization stand. I blogged about some problems with storage virtualization a week or so ago in my post on Storage Utilization is broke and this post takes it to the next level. Also I was at a financial analyst conference this week in Vail where I heard Mark Lewis of Tekrocket but formerly of EMC discuss the need for a data hypervisor to provide software defined storage.

I now believe what we really need for true storage virtualization is a renewed focus on data hypervisor functionality. The data hypervisor would need both a control plane and a data plane in order to function properly. Ideally the control plane would set up the interface and routing for the data plane hardware and the server and/or backend storage would be none the wiser.

DMs everywhere

I envision a scenario where a customer’s application data is packaged with a data hypervisor which runs on a commodity data switch hardware with data plane and control plane software running on it. Sort of creating (virtual) data machines or DMs.

All enterprise and nowadays most midrange storage provide most of the functionality of a storage control plane such as defining units of storage, setting up physical to logical storage mapping, incorporating monitoring, and management of the physical storage layer, etc. So control planes are pervasive in today’s storage but proprietary.

In addition most storage systems have data plane functionality which operates to connect a host IO request to the actual data which resides in backend storage or internal cache. But again although data planes are everywhere in storage today they are all proprietary to a specific vendor’s storage system.

Data switch needed

But in order to utilize a data hypervisor and create a more general purpose control plane layer, we need a more generic data plane layer that operates on commodity hardware. This is different from today’s SAN storage switches or DCB switches but similar in a some ways.

The functions of the data switch/data plane layer would be to take routing instructions from the control plane layer and direct the server IO request to the proper storage unit using the data plane layer. Somewhere in this world view, probably at the data plane level it would introduce data protection services like RAID or other erasure coding schemes, point in time copy/clone services and replication services and other advanced storage features needed by enterprise storage today.

Also it would need to provide some automated storage movement across and within tiers of physical storage and it would connect server storage interfaces at the front end to storage interfaces at the backend. Not unlike SAN or DCB switches but with much more advanced functionality.

Ideally data switch storage interfaces could attach to dedicated JBOD, Flash arrays as well as systems using DAS storage. In addition, it would be nice if the data switch could talk to real storage arrays on SAN, IP/SANs or NFS&CIFS/SMB storage systems.

The other thing one would like out of a data switch is support for a universal translator that would map one protocol to another, such as iSCSI to SAS, NFS to FC, or FC to NFS and any other combination, depending on the needs of the server and the storage in the configuration.

Now if the data switch were built on top of commodity x86 hardware and software with the data switch as just a specialized application that would create the underpinnings for a true data hypervisor with a control and data plane that could be independent and use anybody’s storage.

Data hypervisor

Assuming all this were available then we would have true storage virtualization. With these capabilities, storage could be repurposed on the fly, added to, subtracted from, and in general be a fungible commodity not unlike server processing MIPs under VMware or Hyper-V.

Application data would then needed to be packaged into a data machine which would offer all the host services required to support host data access. The data hypervisor would handle the linkages required to interface with the control and data layers.

Applications could be configured to utilize available storage at ease and storage could grow, shrink or move to accommodate the required workload just as easily as VMs can be deployed today.

How we get there

Aside from the VMware, Citrix, Microsoft thrusts towards virtual storage there are plenty of storage virtualization solutions that can control most backend enterprise SAN storage. However, the problem with these solutions is that in general the execute only on a specific vendors hardware and don’t necessarily talk to DAS or JBOD storage.

In addition, not all of the current generation storage virtualization solutions are unified. That is most of these today only talk FC, FCoE or iSCSI and don’t support NFS or CIFS/SMB.

These don’t appear to be insurmountable obstacles and with proper allocation of R&D funding, could all be solved.

However the more problematic is that none of these solutions operate on commodity hardware or commodity software.

The hardware is probably the easiest to deal with. Today many enterprise storage systems are built ontop of x86 processor storage controllers. Albeit sometimes they incorporate specialized packaging for redundancy and high availability.

The harder problem may be commodity software. Although the genesis for a few storage virtualization systems might come from BSD or other “commodity” software operating systems. They have been modified over the years to no longer represent anything that can run on standard off the shelf operating systems.

Then again some storage virtualization systems started out with special home grown hardware and software. As such, converting these over to something more commodity oriented would be a major transition.

But the challenge is how to get there from here and would anyone want to take this on. The other problem is that the value add that storage vendors supply currently would be somewhat eroded. Not unlike what happened to proprietary Unix systems with the advent of VMware.

But this will not take place overnight and the company that takes this on and makes a go at it can have a significant software monopoly that would be hard to crack.

Perhaps it will take a startup to do this but I believe the main enterprise storage vendors are best positioned to take this on.

Comments?