Category Archives: NAND
"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so." - Mark Twain
I try and keep this quote in my mind whenever I’m teaching about new technologies. You often hear the same things parroted over and over again long after they quit being true. This problem is compounded by fast moving technologies like NAND Flash.
If you have read my previous posts about Flash memory you are already aware of NAND flash endurance and reliability. Just like CPU’s manufacturing processes flash receive boost in capacity as you decrease the size of the transistors/gates used on the device. In CPU’s you get increases in speed, on flash you get increases in size. The current generation of flash manufactured on a 32nm process. This nets four gigabytes per die. Die size isn’t the same as chip, or package size. Flash dies are actually stacked in the actual chip package giving us sixteen gigabytes per package. With the new die shrink to 25nm we double the size to eight gigabytes and thirty two gigabytes respectively. That sounds great, but there is a dark side to the ever shrinking die. As the size of the gate gets smaller it becomes more unreliable and has less endurance than the previous generation. MLC flash suffers the brunt of this but SLC isn’t completely immune.
Cycles And Errors
One of the things that always comes up when talking about flash is the fact it wears out over time. The numbers that always get bantered about are SLC is good for 100,000 writes to a single cell and MLC dies at 10,000 cycles. This is one of those things that just ain’t so any more. Right now the current MLC main stream flash based on the 32nm process write cycles are down to 5000 or so. 25nm cuts that even further to 3000 with higher error rates to boot.
Several manufactures has announced the transition to 25nm on their desktop drives. Intel and OCZ being two of the biggest. Intel is a partner with Micron. They are directly responsible for developing and manufacturing quite a bit of the NAND flash on the market. OCZ is a very large consumer of that product. So, what do you do to offset the issues with 25nm? Well, the same thing you did to offset that problem with 32nm, more spare area and more ECC. At 32nm it wasn’t unusual to see 24 bits of ECC per 512 bytes. Now, I’ve seen numbers as high as 55 bits per 512 bytes to give 25nm the same protection.
To give you an example here is OCZ’s lineup with raw and usable space listed.
|Drive Model||Production Process||Raw Capacity (in GB)||Affected Capacity (in GB)|
As you can clearly see the usable space is significantly decreased. There is a second problem specific to the OCZ drives as well. Since they are now using higher density modules they are only using half as many of them. Since most SSD’s get their performance from multiple read/write channels cutting that in half isn’t a good thing.
SLC is less susceptible to this issue but it is happening. At 32nm SLC was still in the 80,000 to 100,000 range for write cycles but the error rate was getting higher. At 25nm that trend continues and we are starting to see some of the same techniques used in MLC coming to SLC as ECC creeps up from 1 bit per 512 bytes to 8 bits or more per 512 bytes. Of course the down side to SLC is it is half the capacity of MLC. As die shrinks get smaller SLC may be the only viable option in the enterprise space.
It’s Non-Volatile… Mostly
Another side effect of shrinking the floating gate size is the loss of charge due to voltage bleed off over time. When I say “over time” I’m talking weeks or months and not years or decades anymore. The data on these smaller and smaller chips will have to be refreshed every few weeks. We aren’t seeing this severe an issue at the 25nm level but it will be coming unless they figure out a way to change the floating gate to prevent it.
Smaller Faster Cheaper
If you look at trends in memory and CPU you see that every generation the die gets smaller, capacity or speed increases and they become cheaper as you can fit double the chips on a single wafer. There are always technical issues to overcome with every technology. But NAND flash is the only one that gets so inherently so unreliable at smaller and smaller die sizes. So, does this mean the end of flash? In the short term I don’t think so. The fact is we will have to come up with new ways to reduce writes and add new kinds of protection and more advanced ECC. On the pricing front we are still in a position where demand is outstripping supply. That may change somewhat as 25nm manufacturing ramps up and more factories come online but as of today, I wouldn’t expect a huge drop in price for flash in the near future. If it was just a case of SSD’s consuming the supply of flash it would be a different matter. The fact is your cell phone, tablet and every other small portable device uses the exact same flash chips. Guess who is shipping more, SSDs or iPhones?
So, What Do I Do?
The easiest thing you can do is read the label. Check what manufacturing process the SSD is using. In some cases like OCZ that wasn’t a straight forward proposition. In most cases though the manufacturer prints raw and formatted capacities on the label. Check the life cycle/warranty of the drive. Is it rated for 50 gigabytes of writes or 5 terabytes of writes a day? Does it have a year warranty or 5 years? These are indicators of how long the manufacturer expects the drive to last. Check the error rate! Usually the error rate will be expressed in unrecoverable write or read errors per bit. Modern hard drives are in the 10^15 ~ 10^17 range. Some enterprise SSDs are in the 10^30 range. This tells me they are doing more ECC than the flash manufacturer “recommends” to keep your data as safe as possible.
Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.
SLC and MLC Characteristics and Differences
Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.
SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.
|Read Speed||25~ nanoseconds||50~ nanoseconds|
|Write Speed||220~ nanoseconds||900~ nanoseconds|
|P/E Cycles||100k to 300k||3k to 30k|
|Minimum ECC Bits required||1 bit per 512 bytes||12 bits per 512 bytes|
SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.
As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.
SATA, SAS or Neither?
The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing it did get a lot better but still falls short of SAS in several areas.
SATA Vs. SAS
|Command Queuing||TCQ supports queue depths up to 216 usually capped at 64||NCQ supports queue depths up to 32|
|Error recovery and detection||Uses the SCSI command is more robust||SMART Proven to be in adequate. see Google Paper|
|Duplex||Full Duplex dual port per drive||Half Duplex single port|
|Multi-path IO||fully supported at drive level||supported in SATA II via expanders|
Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this. Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.
PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.
The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.
By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar. As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!
Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.
Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.
There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.
Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Lets review what solid state storage is and where it differs from traditional hard disks. I will cover solid state storage in a general manor not favoring any specific flash manufacturer or specific type of Flash.
Types of Flash
NAND Flash Structure
NAND Flash Read Properties
NAND Flash Write Properties
Error Detection and Correction
Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.
The Structure of NAND Flash
NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-leveling, which we will cover shortly. Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit. There can be quite a few blocks per actual chip, as many as 16 thousand blocks or 8 gigabytes worth, on a single chip. From there manufacturers group chips together usually in a parallel arrangement using controllers to make them look like one large solid state disk. They can vary in form factor. Most common are a standard 2.5” or 3.5” drive or a PCIe device.
NAND Read Properties
NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like a hard disks, that is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns. A 15,000 RPM hard disk my have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area. It takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.
NAND Write Properties
NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.
Program Erase Cycle
NAND does writes based on a program erase(P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase.
To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Lets take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-leveling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system. Extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.
It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can effect performance of the device.
Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.
Another pit fall of wear-leveling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a stand still. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes.
Error Detection and Correction
The nature of NAND Flash makes it susceptible to several types of data corruption. Like hard drives and floppy disks are at risk near magnetic sources. NAND has several vulnerabilities, some of them occur even when just reading data.
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.
Lastly, there is an issue with data data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to.
All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues.
Bad Block Management
NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.
As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it use to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.
We will talk about MLC vs SLC. What makes a device enterprise ready. How to effectively benchmark your solid state storage and not be caught off guard when you move into production.