Fundamentals of Storage Systems, Solid State Storage Basics
Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Lets review what solid state storage is and where it differs from traditional hard disks. I will cover solid state storage in a general manor not favoring any specific flash manufacturer or specific type of Flash.
Types of Flash
NAND Flash Structure
NAND Flash Read Properties
NAND Flash Write Properties
Error Detection and Correction
Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.
The Structure of NAND Flash
NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-leveling, which we will cover shortly. Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit. There can be quite a few blocks per actual chip, as many as 16 thousand blocks or 8 gigabytes worth, on a single chip. From there manufacturers group chips together usually in a parallel arrangement using controllers to make them look like one large solid state disk. They can vary in form factor. Most common are a standard 2.5” or 3.5” drive or a PCIe device.
NAND Read Properties
NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like a hard disks, that is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns. A 15,000 RPM hard disk my have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area. It takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.
NAND Write Properties
NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.
Program Erase Cycle
NAND does writes based on a program erase(P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase.
To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Lets take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-leveling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system. Extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.
It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can effect performance of the device.
Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.
Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.
Another pit fall of wear-leveling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a stand still. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes.
Error Detection and Correction
The nature of NAND Flash makes it susceptible to several types of data corruption. Like hard drives and floppy disks are at risk near magnetic sources. NAND has several vulnerabilities, some of them occur even when just reading data.
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.
Lastly, there is an issue with data data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to.
All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues.
Bad Block Management
NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.
As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.
Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it use to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.
We will talk about MLC vs SLC. What makes a device enterprise ready. How to effectively benchmark your solid state storage and not be caught off guard when you move into production.