Fundamentals of Storage Systems, Understanding Reliability and Performance of Solid State Storage
Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.
SLC and MLC Characteristics and Differences
Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.
SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.
|Read Speed||25~ nanoseconds||50~ nanoseconds|
|Write Speed||220~ nanoseconds||900~ nanoseconds|
|P/E Cycles||100k to 300k||3k to 30k|
|Minimum ECC Bits required||1 bit per 512 bytes||12 bits per 512 bytes|
SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.
As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.
SATA, SAS or Neither?
The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing it did get a lot better but still falls short of SAS in several areas.
SATA Vs. SAS
|Command Queuing||TCQ supports queue depths up to 216 usually capped at 64||NCQ supports queue depths up to 32|
|Error recovery and detection||Uses the SCSI command is more robust||SMART Proven to be in adequate. see Google Paper|
|Duplex||Full Duplex dual port per drive||Half Duplex single port|
|Multi-path IO||fully supported at drive level||supported in SATA II via expanders|
Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this. Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.
PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.
The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.
By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar. As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!
Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.
Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.
There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.