Linux and object storage devices

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
November 4, 2008

The btrfs filesystem is widely regarded as being the long-term future choice for Linux. But what if btrfs is taking the wrong direction, fighting an old war? If the nature of our storage devices changes significantly, our filesystems will have to change as well. A lot of attention has been paid to the increasing prevalence of flash-based devices, but there is another upcoming technology which should be planned for: object storage devices (OSDs). The recent posting of a new filesystem called osdfs provides a good opportunity to look at OSDs and how they might be supported under Linux.

The developers of OSDs were driven by the idea that traditional, block-based disk drives offer an overly low-level interface. With contemporary hardware, it should be possible to push more intelligence into storage devices, offloading work from the host while maintaining (or improving) performance and security. So the interface offered by an OSD does not deal in blocks; instead, the OSD provides "objects" to the host system. Most objects will simply be files, but a few other types of objects (partitions, for example) are supported as well. The host manipulates these objects, but need not (and cannot) concern itself with how those objects are implemented within the device.

A file object is identified by two 64-bit numbers. It contains whatever data the creator chooses to put in there; an OSD does not interpret the data in any way. Files also have a collection of attributes and metadata; this includes much of the information stored in an on-disk inode in a traditional filesystem - but without the block layout information, which the OSD hides from the rest of the world. All of the usual operations can be performed on files - reading, writing, appending, truncating, etc. - but, again, the implementation of those operations is handled by the OSD.

One thing that is not handled by the OSD, though, is the creation of a directory hierarchy or the naming of files. It is expected that the host filesystem will use file objects to store its directory structure, providing a suitable interface to the filesystem's users. One could, presumably, also use an OSD as a sort of hardware-implemented object database without a whole lot of high-level code, but that is not where the focus of work with OSDs is now.

[PULL QUOTE: The OSD designers decided to offload another task from the host systems: security. END QUOTE] The OSD protocol [PDF] is a T10-sanctioned extension to the SCSI protocol. It is thus expected that OSD devices will be directly attached to host systems; the protocol has been designed to perform well in that mode. It is also expected, though, that OSDs will be used in network-attached storage environments. For such deployments, the OSD designers decided to offload another task from the host systems: security. To that end, the OSD protocol includes an extensive set of security-related commands. Every operation on an object must be accompanied by a "capability," a cryptographically-signed ticket which names the object and the access rights possessed by the owner of the capability. In the absence of a suitable capability, the drive will deny access.

It is expected that capabilities will be handed out by a security policy daemon running somewhere on the network. That daemon may be in possession of the drive's root key, which allows unrestricted access to the drive, or it may have a separate, partition-level key instead. Either way, it can use that key to sign capabilities given out to processes elsewhere in the system. (Drives also have a "master" key, used primarily to change the root key. Loss of the master key is probably a restore-from-backup sort of event.)

Capabilities last for a while (they include an expiration time) and describe all of the allowed operations. So the act of actually obtaining a capability should be relatively rare; most OSD operations will be performed using a capability which the system already has in hand. That is an important design feature; adding "ask a daemon for a capability" to the filesystem I/O path would not be a performance-enhancing move.

In theory, it should be relatively easy to make a standard Linux filesystem support an OSD. It's mostly a matter of hacking out much of the low-level block layout and inode management code, replacing it with the appropriate object operations. The osdfs filesystem was created in this way; the developers started with ext2. After taking out all the code they no longer needed, the osdfs developers simply added code translating VFS-level requests into operations understood by the OSD. Those requests are then executed by way of the low-level osd-initiator code (which was also recently submitted for consideration). Directories are implemented as simple files containing names and associated object IDs. There is no separate on-disk inode; all of that information is stored as attributes to the file itself. The end result is that the osdfs code is relatively small; it is mostly concerned with remapping VFS operations into OSD operations.

Anybody wanting to test this code may run into one small problem: there are few OSDs to be found in the neighborhood computer store. It would appear that most of the development work so far has been done using OSD simulators. The OSC software OSD is, like osdfs, part of the open-osd project; it implements the OSD protocol over an SQLite database. There is also an OSD simulator hosted at IBM, but it would not appear to be under current development. Simulator-based development and testing may not be as rewarding as having a shiny new device implementing OSD in hardware, but it will help to insure that both the software and the protocol are in good shape by the time such hardware is available.

It should be noted that the success of OSDs is not entirely assured. An OSD takes much of the work normally done in an operating system kernel and shoves it into a hardware firmware blob where it cannot be inspected or fixed. A poor implementation will, at best, not perform well; at worst, the chances of losing data could increase considerably. It may yet prove best to insist that storage devices just concentrate on placing bits where the operating system tells them to and leave the higher-level decisions to higher-level code. Or it may turn out that OSDs are the next step forward in smarter, more capable hardware. Either way, it is an interesting experiment.

See this article at Sun for more information on how OSD works.

Index entries for this article
Kernel	Block layer/Object storage devices
Kernel	Filesystems/osdfs
Kernel	Object storage devices

(Log in to post comments)

Linux and object storage devices

Posted Nov 4, 2008 21:00 UTC (Tue) by vblum (guest, #1151) [Link]

Sounds like we need the open source OSD to make this idea sit well with Linux. A good part of OS development went into making more efficient file systems. Should this type of work now become hidden again?

Linux and object storage devices

Posted Nov 4, 2008 21:22 UTC (Tue) by zdzichu (subscriber, #17118) [Link]

It seems that there is Open Source OSD. Sun sells StorageTek 5800, source for which one can find at http://opensolaris.org/os/project/honeycomb/ . If I understand all the issue correctly :)

Offloading security maketh me nervouth

Posted Nov 4, 2008 22:18 UTC (Tue) by felixfix (subscriber, #242) [Link]

Security software has such a record of flimflam, security thru obscurity, and just plain shoddiness that I wonder how wise it would be to move it to drives where it can't be easily reviewed or modified.

Are there any disk drives with GPL software on them which can be reviewed and modified? Is there any provision for such in this scheme?

Linux and object storage devices

Posted Nov 4, 2008 23:56 UTC (Tue) by dlang (guest, #313) [Link]

it sounds as if they need to go to the SCSI target code and implement a OSD mode for it that then stores the files on a 'normal' local filesystem.

this would let any linux box with the appropriate interface act as a OSD device, allowing for solid experimentation on both sides of the interface.

putting the filesystem layout code on the storage device raises a _lot_ of questions about reliability and performance. it also prevents trying different layouts.

it will also make it very hard for RAID to be used on such devices.

I really don't expect this to be the wave of the future, but anything can happen.

Linux and object storage devices

Posted Nov 5, 2008 2:01 UTC (Wed) by gdt (subscriber, #6284) [Link]

OSD isn't really aimed at drives, it's aimed at managed storage devices. At the moment these either need to handle disk blocks, which is pretty low level, or implement a network file system, which is pretty high level. You really want a protocol which allows the managed storage to understand which blocks comprise a file with buying into all of the semantics of file handling. That allows the storage to migrate the file to nearline storage, duplicate the entire file for redundancy, etc.

The protocol is an extension to SCSI not because OSD drives are expected to be common, but because that's the transport above Fiber Channel and iSCSI -- the dominant storage attachment protocols.

Linux and object storage devices

Posted Nov 5, 2008 5:36 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

So it doesn't seem likely to you that OSDs will supercede block devices for "commodity" mass storage? That's good...I don't really like the idea of the hardware itself mandating a particular organizational scheme.

Linux and object storage devices

Posted Nov 8, 2008 22:51 UTC (Sat) by vonbrand (guest, #4458) [Link]

So you don't like the layout in blocks of your current disk drive either...

Nice exaggeration

Posted Apr 9, 2009 7:24 UTC (Thu) by felixfix (subscriber, #242) [Link]

I want each byte to include a length specifier so I can have variable size and capability tickets with timeouts per byte. I resent the rigid 8 bit byte paradigm enforced upon me by hardware manufacturers.

The difference between hardware enforced block size and hardware enforced security, object layout, capability tickets, and all the other cruft that goes with it is obvious to most people. If you don't see it, you shouldn't be wasting your time reading news from a Linux web site, which places a premium on cognitive capability.

Linux and object storage devices

Posted Nov 5, 2008 6:50 UTC (Wed) by dlang (guest, #313) [Link]

so why would you use OSD instead of one of the many network/cluster file systems? they all abstract the block device away and let the server at the far side do whatever it wants for the data storage.

the only 'advantage' this would have is that instead of using cheap, common ethernet, it requires a SCSI layer (parallel SCSI or fiberchannel), you could mix and match normal storage with OSD storage, but they really are so different that I would not expect them to be mixed in practice.

Linux and object storage devices

Posted Nov 5, 2008 8:36 UTC (Wed) by jamesh (guest, #1159) [Link]

They probably want something that isn't quite as high level as a NAS box such as abstraction of the actual location of the file data (flash, disk, tape, etc), while not imposing a particular high level file system interface as you'd get with NFS or SMB.

It'd probably also make it easy to do things like run multiple file system hierarchies off the same object store, rather than having to worry about growing or shrinking partitions.

Lastly, this doesn't look incompatible with using ethernet to connect to the storage device: people have been running SCSI over ethernet for years with iSCSI.

Linux and object storage devices

Posted Nov 6, 2008 12:01 UTC (Thu) by csamuel (✭ supporter ✭, #2624) [Link]

This *has* come from the cluster filesystems, I first came across OSD's in 2004 as
part of the Lustre file system design. The idea is that you have a whole bunch of
OSD's and you parallelise your I/O across them.

Lustre is now owned by Sun...

Linux and object storage devices

Posted Nov 5, 2008 4:22 UTC (Wed) by pabs (subscriber, #43278) [Link]

OSD and osdfs might be an interesting way to export directories to virtual machines.

Wrong direction!

Posted Nov 5, 2008 13:07 UTC (Wed) by kev009 (guest, #43906) [Link]

I think this is certainly the wrong direction. Manufacturers can't even get hard disk firmwares let alone RAID right much of the time. Expecting them to to do even more complex tasks is asking for trouble.. not to mention obsolescence: they likely wont release updates after 2 or so years if they are feeling generous.

The future of mainstream storage is squarely in solid state technology such as flash memory, holographic storage, memresitors, etc. It only makes sense to treat these as low as possible: arrays of raw memory addresses just like RAM.

With many core processors, it seems foolish to even think about this these days.

Wrong direction!

Posted Nov 5, 2008 13:12 UTC (Wed) by kev009 (guest, #43906) [Link]

Also, input from experts at EMC, NetApp, Seagate, IBM and Sun/StorageTek would be critical because they have been doing things like this for a long time.

In the end I still think a software layer is the only way to go because it can be developed and improved in the open as FOSS.

Wrong direction!

Posted Nov 5, 2008 15:22 UTC (Wed) by zlynx (guest, #2285) [Link]

Except that even RAM isn't just RAM anymore. No one who cares about performance treats memory like a big random access array. With multi-level caches and sequential prefetch, RAM is more like disk blocks. Flash memory is even more so, with its relatively slow read start and very slow write.

When programmers don't understand and account for the underlying nature of the system, it results in awful code, like most Java software. So don't hide too much of it.

Wrong direction!

Posted Nov 6, 2008 3:32 UTC (Thu) by gdt (subscriber, #6284) [Link]

It only makes sense to treat these as low as possible: arrays of raw memory addresses just like RAM.

That's hardly a low as possible abstraction. For rotating storage it hides bad block remapping. For flash storage it hides wear levelling and delete-before-write.

The search isn't for the lowest possible abstraction to present to the computer, the search is for the abstraction which best mediates between the needs of the computer and the needs of the storage. Storage is increasingly remote and managed, and the current block-based abstraction and your offset-based proposal don't give enough information to the storage's management software.

It's the diversity of storage media that's currently driving object-based storage. It's a lot simpler to build complex storage (with features like migration between flash, disk and tape) if the storage is told what blocks are in a file rather than being left to guess.

Someone asked, why not use a filesystem such as CIFS or NFS? The answer is that this leads to user-based authentication, which leads to a lot of unnecessary complexity for the storage. One of the aims of OBS is to allow storage to be leased out, and integrating with customers' authentication systems would have introduced a big hurdle.

Please note that I'm not a OBS defender, I'm only seeking to explain it. Conversely, I'm also not saying that OBS is such a poor idea that it shouldn't be in Linux. My own view is that the SCSI protocol itself is now inadequate for enterprise storage, as it is a poor fit for the link, network and transport protocols used in corporate networks. I don't see much sense in using a disk protocol to communicate between a computer and a storage manager (ie, another computer). There's a lot of pretence happening there which could be stripped away for better performance and robustness. My view may be overly coloured by experience as a participant in the iSCSI working group

It was wrong back then and it's wrong now

Posted Nov 6, 2008 16:20 UTC (Thu) by khim (subscriber, #9252) [Link]

For rotating storage it hides bad block remapping. For flash storage it hides wear levelling and delete-before-write.

Yes - and I've certainly had problems with the first and I'm sure I'll have problems with the second. The only sane way to resolve this problem is to offer as low access as possible - but not lower. I don't think checksum calculation for blocks on HDD belongs to OS kernel (it can be calculated in HDD more or less for free, but general-purpose CPU will spend significant power doing it), but bad blocks handling certainly should belong to kernel - it have more resources to cope.

The search isn't for the lowest possible abstraction to present to the computer, the search is for the abstraction which best mediates between the needs of the computer and the needs of the storage.

Puhlease. What have this search offered us now? Predictable and mostly unlreliable HDDs and SSDs? I'd prefer raw flash, thank you.

It's a lot simpler to build complex storage (with features like migration between flash, disk and tape) if the storage is told what blocks are in a file rather than being left to guess.

What is the goal: great storage subsystem or great system? If the latter then all these things must be done at the system level (and may be offered via NFS/CIFS).

Please note that I'm not a OBS defender, I'm only seeking to explain it. Conversely, I'm also not saying that OBS is such a poor idea that it shouldn't be in Linux.

OBS is quite bad idea but Linux will need some support for it anyway. And it's useful is some strange places (for example in KVM/VMWare/Xen).

It was wrong back then and it's wrong now

Posted Nov 8, 2008 3:15 UTC (Sat) by Ze (guest, #54182) [Link]

Yes - and I've certainly had problems with the first and I'm sure I'll have problems with the second. The only sane way to resolve this problem is to offer as low access as possible - but not lower. I don't think checksum calculation for blocks on HDD belongs to OS kernel (it can be calculated in HDD more or less for free, but general-purpose CPU will spend significant power doing it), but bad blocks handling certainly should belong to kernel - it have more resources to cope. Yes and what about the bandwidth requirements of sending the checksum over? or if someone wishes to use that space for an error correcting code or a more secure hash? Ultimately the time spent transferring files from the disk should be a small percentage compared to the time spent waiting for them. However there are always going to be trade offs with flexibility , speed and other things. What is the goal: great storage subsystem or great system? If the latter then all these things must be done at the system level (and may be offered via NFS/CIFS). OBS is quite bad idea but Linux will need some support for it anyway. And it's useful is some strange places (for example in KVM/VMWare/Xen). The whole point though is that NFS and CIFS are unsuitable for some uses. This is where Object based file systems come in , they are like a stripped down form of NFS/CIFS. Ideally it'd be nice to have a nice layer setup that allows people to see the layers they want and not have to put up with the layers they don't need. One downside I see for object based file systems is with versioning file systems , which use the layout of the block layer to make having multiple versions cheap in space. That's a trade off though that may be worth it to some and not to others.

It was wrong back then and it's wrong now

Posted Nov 9, 2008 0:37 UTC (Sun) by giraffedata (guest, #1954) [Link]

For rotating storage it hides bad block remapping.

Bad block remapping? It hides all block mapping. In a truly low level disk interface, Linux would address the disk by cylinder/head/sector. Indeed, there are things Linux could do more effectively if it controlled the storage at that level. And that's nowhere near the lowest conceivable abstraction, either.

I don't think checksum calculation for blocks on HDD belongs to OS kernel (it can be calculated in HDD more or less for free, but general-purpose CPU will spend significant power doing it)

I see no reason for it to be a cheaper computation in the HDD than in the main computer. If there are special purpose processors in the HDD to do it cheaply, it's because that's where we've decided to do it; not vice versa.

the search is for the abstraction which best mediates between the needs of the computer and the needs of the storage.

I'd like to put it differently, because it's not what I can do for the computer and the storage, but what they can do for me. So: Which abstraction best leverages the abilities of the computer and those of the storage, to provide the most efficient storage service?

There was a time when the best dividing line was such that the main computer watched the bits stream off the head until it detected the start of a record, etc. That let us consolidate expensive CPUs. Today, we can squeeze more storage service out cheaper by moving a great deal of that function to the other end of the cable. More recent technological changes might mean it's most efficient for file layout to move out there too.

I can think of a few reasons to stick function inside the storage product and the storage box instead of the main computer products and box:

The implementation must change more when the storage layer below it changes than when the application layer above it does.
Multiple main computers share the storage box.
It's expensive to move information over the cable.

Linux and object storage devices

Posted Nov 6, 2008 4:45 UTC (Thu) by palapa (guest, #612) [Link]

Sounds like a perfect way to embed DRM into linux systems. The DRM could reside in the OSD, and not run afoul of the (gnu) operating system.

Linux and object storage devices

Posted Nov 6, 2008 12:38 UTC (Thu) by ricwheeler (subscriber, #4980) [Link]

Object based storage systems have been shipping for several years and done relatively well (Panasas, EMC Centera and its archival storage competitors all have object based guts).

Since linux is usually the base operating system used to implement the object based back end, it is interesting to track how well our components support people trying to build systems from our existing stack - target mode SCSI, scalable file systems, etc.

Linux and object storage devices

Posted Nov 19, 2008 14:58 UTC (Wed) by nanolinux (guest, #55256) [Link]

Object storage devices are a trojan horse unless the devices are under the control of linux by offloading the work onto a "storage processor" on the mainboard where the main CPU resides and be completely open source under the control of all linux completely.