Category Archives: NAND

Solid State Storage: Enterprise State Of Affairs

Here In A Flash!

Its been a crazy last few years in the flash storage space. Things really started taking off around 2006 when NAND flash and moores law got together. in 2010 it was clear that flash storage was going to be a major part of your storage makeup in the future. It may not be NAND flash specifically though. It will be some kind of memory and not spinning disks.

Breaking The Cost Barrier.

For the last few years, I’ve always told people to price out on the cost of IO not the cost of storage. Buying flash storage was mainly a niche product solving a niche problem like to speed up random IO heavy tasks. With the cost of flash storage at or below standard disk based SAN storage with all the same connectivity features and the same software features I think it’s time to put flash storage on the same playing field as our old stalwart SAN solutions.

Right now at the end of 2012, you can get a large amount of flash storage. There is still this perception that it is too expensive and too risky to build out all flash storage arrays. I am here to prove at least cost isn’t as limiting a factor as you may believe. Traditional SAN storage can run you from 5 dollars a Gigabyte to 30 dollars a Gigabyte for spinning disks. You can easily get into an all flash array in that same range.

Here’s Looking At You Flash.

This is a short list of flash vendors currently on the market. I’ve thrown in a couple non-SAN types and a couple traditional SAN’s that have integrated flash storage in them. Please, don’t email me complaining that X vendor didn’t make this list or that Y vendor has different pricing. All the pricing numbers were gathered from published sources on the internet. These sources include, the vendors own website, published costs from TPC executive summaries and official third party price listings. If you are a vendor and don’t like the prices listed here then publicly publish your price list.

There are always two cost metrics I look at dollars per Gigabyte in raw capacity and dollars per Gigabyte in usable capacity. The first number is pretty straight forward. The second metric can get tricky in a hurry. On a disk based SAN that pretty much comes down to what RAID or protection scheme you use. Flash storage almost always introduces deduplication and compression which can muddy the waters a bit.

Fibre Channel/iSCSI vendor list

Nimbus Data

Appearing on the scene in 2006, they have two products currently on the market. the S-Class storage array and the E-Class storage array.

The S-Class seems to be their lower end entry but does come with an impressive software suite. It does provide 10GbE and Fibre Channel connectivity. Looking around at the cost for the S-Class I found a 2.5TB model for 25,000 dollars. That comes out to 9.7 dollars per Gigabyte in raw space. The S-Class is their super scaleable and totally redundant unit. I found a couple of quotes that put it in at 10.oo dollars a Gigabyte of raw storage. Already we have a contender!

Pure Storage

In 2009 Pure Storage started selling their flash only storage solutions. They include deduplication and compression in all their arrays and include that in the cost per Gigabyte. I personally find this a bit fishy since I always like to test with incompressible data as a worst case for any array. This would also drive up their cost. They claim between 5.00 and 10.00 dollars per usable Gigabyte and I haven’t found any solid source for public pricing on their array yet to dispute or confirm this number. They also have a generic “compare us” page on their website that at best is misleading and at worst plain lies. Since they don’t call out any specific vendor in their comparison page its hard to pin them for falsehoods but you can read between the lines.

Violin Memory

Violin Memory started in earnest around 2005 selling not just flash based but memory based arrays. Very quickly they transitioned to all flash arrays. They have two solutions on the market today. The 3000 series which allows some basic SAN style setups but also has direct attachments via external PCIe channels. It comes in at 10.50 dollars a Gigabyte raw and 12 dollars a Gigabyte usable. The 6000 series is their flagship product and the pricing reflects it. At 18.00 dollars per Gigabyte raw it is getting up there on the price scale. Again, not the cheapest but they are well established and have been used and are resold by HP.

Texas Memory Systems/IBM

If you haven’t heard, TMS was recently purchased by IBM. Based in Houston, TX I’ve always had a soft spot for them. They were also the first non-disk based storage solution I ever used. The first time I put a RamSan in and got 200,000 IO’s out of the little box I was sold. Of course it was only 64 Gigabytes of space and cost a small fortune. Today they have a solid flash based fibre attached and iSCSI attached lignup. I couldn’t find any pricing on the current flagship RamSan 820 but the 620 has been used in TPC benchmarks and is still in circulation. It is a heavy weight at 33.30 dollars a Gigabyte of raw storage.


A new entrant into this space they are boasting some serious cost savings. They claim a 3.00 dollar per Gigabyte usable on their currently shipping product. The unit also includes options for deduplication and compression which can drive the cost down even further. It is also a half depth 1U solution with a built-in 10GbE switch. They are working on a fault tolerant unit due out second half of next year that will up the price a bit but add Fibre Channel connectivity. They have a solid pedigree as they are made up of the guys that brought the Sanforce controllers to market. They aren’t a proven company yet, and I haven’t seen a unit or been granted access to one ether. Still, I’d keep eye on them. At those price points and the crazy small footprint it may be worth taking a risk on them.


I’m putting the DS3524 on a separate entry to give you some contrast. This is a traditional SAN frame that has been populated with all SSD drives. With 112 200 GB drives and a total cost of 702908.00 it comes in at 31.00 a Gigabyte of raw storage. On the higher end but still in the price range I generally look to stay in.


I couldn’t resist putting in a Sun F5100 in the mix. at 3,099,000.00 dollars it is the most expensive array I found listed. It has 38.4 Terabytes of raw capacity giving us a 80.00 dollars per Gigabyte price tag. Yikes!

Dell EqualLogic

When the 3Par deal fell apart Dell quickly gobbled up EqualLogic, a SAN manufacturer that focused on iSCSI solutions. This isn’t a flash array. I wanted to add it as contrast to the rest of the list. I found a 5.4 Terabyte array with a 7.00 dollar per Gigabyte raw storage price tag. Not horrible but still more expensive that some of our all flash solutions.


What list would be complete without including the current king of the PCIe flash hill Fusion-io. I found a retail price listing for their 640 Gigabyte Duo card at 19,000 dollars giving us a 29.00 per usable Gigabyte. Looking at the next lowest card the 320 Gigabyte Duo at 7495.00 dollars ups the price to 32.20 per useable Gigabyte. They are wicked fast though :)

So Now What?

Armed with a bit of knowledge you can go forth and convince your boss and storage team that a SAN array fully based on flash is totally doable from a cost perspective. It may mean taking a bit of a risk but the rewards can be huge.


SQLSaturday #63, Great Event!


I actually had a early morning sessions and gave my Solid State Storage talk and had a great time. The audience was awesome asked very smart questions and I didn’t run over time. The guys and gals here in Dallas have put on another great event and it isn’t even lunch time yet!

As promised here is the slide deck from todays session. As always if you have any questions please drop me a line.

Solid State Storage Deep Dive

Changing Directions

I See Dead Tech….

Knowing when a technology is dying is always a good skill to have. Like most of my generation we weren’t the first on the computer scene but lived through several of it’s more painful transitions. As a college student I was forced to learn antiquated technologies and languages. I had to take a semester of COBOL. I also had to take two years of assembler for the IBM 390 mainframe and another year of assembler for the x86 focused on the i386 when the Pentium was already on the market. Again and again I’ve been forced to invest time in dying technologies. Well not any more!

Hard drives are dead LONG LIVE SOLID STATE!

I set the data on a delicate rinse cycle

I’m done with spinning disks. Since IBM invented them in nineteen and fifty seven they haven’t improved much over the years. They got smaller and faster yes but they never got sexier than the original. I mean, my mom was born in the fifties, I don’t want to be associated with something that old and way uncool. Wouldn’t you much rather have something at least invented in the modern age in your state of the art server?

Don’t you want the new hotness?

I mean seriously, isn’t this much cooler? I’m not building any new servers or desktop systems unless they are sporting flash drives. But don’t think this will last. You must stay vigilant, NAND flash won’t age like a fine wine ether. There will be something new in a few years and you must be willing to spend whatever it takes to deploy the “solid state killer” when it comes out.

Tell Gandpa Relational is Soooo last century

The relational model was developed by Dr. EF Codd while at IBM in 1970, two years before I was born. Using some fancy math called tuple calculus he proved that the relational model was better at seeking data on these new “hard drives” that IBM had laying around. That later tuned into relational algebra that is used today. Holy cow! I hated algebra AND calculus in high school why would I want to work with that crap now?

NoSQL Is The Future!

PhD’s, all neck ties and crazy gray hair.

Internet Scale, web 2.0 has a much better haircut.

In this new fast paced world of web 2.0 and databases that have to go all the way to Internet scale, the old crusty relational databases just can’t hang. Enter, NoSQL! I know that NoSQL covers a lot of different technologies, but some of the core things they do very well is scale up to millions of users and I need to scale that high. They do this by side stepping things like relationships, transactions and verified writes to disk. This makes them blazingly fast! Plus, I don’t have to learn any SQL languages, I can stay with what I love best javascript and JSON. Personally, I think MongoDB is the best of the bunch they don’t have a ton of fancy PhD’s, they are getting it done in the real world! Hey, they have a Success Engineer for crying out loud!!! Plus if you are using Ruby, Python, Erlang or any other real Web 2.0 language it just works out of the box. Don’t flame me about your NoSQL solution and why it is better, I just don’t care. I’m gearing up to hit all the major NoSQL conferences this year and canceling all my SQL Server related stuff. So long PASS Summit, no more hanging out with people obsessed with outdated skills.

Head in the CLOUD

Racks and Racks of Spaghetti photo by: Andrew McKaskill

Do you want this to manage?

Or this?

With all that said, I probably won’t be building to many more servers anyway. There is a new way of getting your data and servers without the hassle of buying hardware and securing it, THE CLOUD!

“Cloud computing is computation, software, data access, and storage services that do not require end-user knowledge of the physical location and configuration of the system that delivers the services. Parallels to this concept can be drawn with the electricity grid where end-users consume power resources without any necessary understanding of the component devices in the grid required to provide the service.”

Now that’s what I’m talking about! I just plug in my code and out comes money. I don’t need to know how it all works on the back end. I’m all about convenient, on-demand network access to a shared pool of configurable computing resources. You know, kind of like when I was at college and sent my program to a sysadmin to get a time slice on the mainframe. I don’t need to know the details just run my program. Heck, I can even have a private cloud connected to other public and private clouds to make up The Intercloud(tm). Now that is sexy!

To my new ends I will be closing this blog and starting up to document my new jersey, I’ll only be posting once a year though, on April 1st.

See you next year!

Moore’s Law May Be The Death of NAND Flash

"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so." -  Mark Twain

I try and keep this quote in my mind whenever I’m teaching about new technologies. You often hear the same things parroted over and over again long after they quit being true. This problem is compounded by fast moving technologies like NAND Flash.

If you have read my previous posts about Flash memory you are already aware of NAND flash endurance and reliability. Just like CPU’s manufacturing processes flash receive boost in capacity as you decrease the size of the transistors/gates used on the device. In CPU’s you get increases in speed, on flash you get increases in size. The current generation of flash manufactured on a 32nm process. This nets four gigabytes per die. Die size isn’t the same as chip, or package size. Flash dies are actually stacked in the actual chip package giving us sixteen gigabytes per package. With the new die shrink to 25nm we double the size to eight gigabytes and thirty two gigabytes respectively. That sounds great, but there is a dark side to the ever shrinking die. As the size of the gate gets smaller it becomes more unreliable and has less endurance than the previous generation. MLC flash suffers the brunt of this but SLC isn’t completely immune.

Cycles And Errors

One of the things that always comes up when talking about flash is the fact it wears out over time. The numbers that always get bantered about are SLC is good for 100,000 writes to a single cell and MLC dies at 10,000 cycles. This is one of those things that just ain’t so any more. Right now the current MLC main stream flash based on the 32nm process write cycles are down to 5000 or so. 25nm cuts that even further to 3000 with higher error rates to boot.

Several manufactures has announced the transition to 25nm on their desktop drives. Intel and OCZ being two of the biggest. Intel is a partner with Micron. They are directly responsible for developing and manufacturing quite a bit of the NAND flash on the market. OCZ is a very large consumer of that product. So, what do you do to offset the issues with 25nm? Well, the same thing you did to offset that problem with 32nm, more spare area and more ECC. At 32nm it wasn’t unusual to see 24 bits of ECC per 512 bytes. Now, I’ve seen numbers as high as 55 bits per 512 bytes to give 25nm the same protection.

To give you an example here is OCZ’s lineup with raw and usable space listed.

Drive Model Production Process Raw Capacity (in GB) Affected Capacity (in GB)
OCZSSD2‐2VTXE60G 25nm 64 55
OCZSSD2‐2VTX60G 32nm 64 60
OCZSSD2‐2VTXE120G 25nm 128 118
OCZSSD2‐2VTX120G 32nm 128 120

As you can clearly see the usable space is significantly decreased. There is a second problem specific to the OCZ drives as well. Since they are now using higher density modules they are only using half as many of them. Since most SSD’s get their performance from multiple read/write channels cutting that in half isn’t a good thing.

SLC is less susceptible to this issue but it is happening. At 32nm SLC was still in the 80,000 to 100,000 range for write cycles but the error rate was getting higher. At 25nm that trend continues and we are starting to see some of the same techniques used in MLC coming to SLC as ECC creeps up from 1 bit per 512 bytes to 8 bits or more per 512 bytes. Of course the down side to SLC is it is half the capacity of MLC. As die shrinks get smaller SLC may be the only viable option in the enterprise space.

It’s Non-Volatile… Mostly

Another side effect of shrinking the floating gate size is the loss of charge due to voltage bleed off over time. When I say “over time” I’m talking weeks or months and not years or decades anymore. The data on these smaller and smaller chips will have to be refreshed every few weeks. We aren’t seeing this severe an issue at the 25nm level but it will be coming unless they figure out a way to change the floating gate to prevent it.

Smaller Faster Cheaper

If you look at trends in memory and CPU you see that every generation the die gets smaller, capacity or speed increases and they become cheaper as you can fit double the chips on a single wafer. There are always technical issues to overcome with every technology. But NAND flash is the only one that gets so inherently so unreliable at smaller and smaller die sizes. So, does this mean the end of flash? In the short term I don’t think so. The fact is we will have to come up with new ways to reduce writes and add new kinds of protection and more advanced ECC. On the pricing front we are still in a position where demand is outstripping supply. That may change somewhat as 25nm manufacturing ramps up and more factories come online but as of today, I wouldn’t expect a huge drop in price for flash in the near future. If it was just a case of SSD’s consuming the supply of flash it would be a different matter. The fact is your cell phone, tablet and every other small portable device uses the exact same flash chips. Guess who is shipping more, SSDs or iPhones?

So, What Do I Do?

The easiest thing you can do is read the label. Check what manufacturing process the SSD is using. In some cases like OCZ that wasn’t a straight forward proposition. In most cases though the manufacturer prints raw and formatted capacities on the label. Check the life cycle/warranty of the drive. Is it rated for 50 gigabytes of writes or 5 terabytes of writes a day? Does it have a year warranty or 5 years? These are indicators of how long the manufacturer expects the drive to last. Check the error rate! Usually the error rate will be expressed in unrecoverable write or read errors per bit. Modern hard drives are in the 10^15 ~ 10^17 range. Some enterprise SSDs are in the 10^30 range. This tells me they are doing more ECC than the flash manufacturer “recommends” to keep your data as safe as possible.

Fundamentals of Storage Systems, Understanding Reliability and Performance of Solid State Storage

Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.

SLC and MLC Characteristics and Differences

Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.

SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.

Measure SLC MLC
Read Speed 25~ nanoseconds 50~ nanoseconds
Write Speed 220~ nanoseconds 900~ nanoseconds
P/E Cycles 100k to 300k 3k to 30k
Minimum ECC Bits required 1 bit per 512 bytes 12 bits per 512 bytes
Block Size 64KB 128KB


SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.

Enterprise Reliability

As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.

SATA, SAS or Neither?

The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing  it did get a lot better but still falls short of SAS in several areas.


Feature SAS SATA
Command Queuing TCQ supports queue depths up to 216 usually capped at 64 NCQ supports queue depths up to 32
Error recovery and detection Uses the SCSI command is more robust SMART Proven to be in adequate. see Google  Paper
Duplex Full Duplex dual port per drive Half Duplex single port
Multi-path IO fully supported at drive level supported in SATA II via expanders

Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this.  Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.

PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.

The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.

Performance Characteristics

By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar.  As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!

Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.

Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.


There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.

Fundamentals of Storage Systems, Solid State Storage Basics

Solid state storage is the new kid on the block. We see new press releases every day about just how awesome this new technology is. Like with any technology, you need a solid foundation in how it works before you can decide if it is right for you. Lets review what solid state storage is and where it differs from traditional hard disks. I will cover solid state storage in a general manor not favoring any specific flash manufacturer or specific type of Flash.

Types of Flash
NAND Flash Structure
NAND Flash Read Properties
NAND Flash Write Properties
Garbage Collection
Write Amplification
Error Detection and Correction

Flash Memory

Flash is a type of memory like the RAM in your computer. There are several key differences though. First NAND is non-volatile, meaning it doesn’t require electricity to maintain the data stored in it. It also has very fast access times, not quite RAM’s access times but in between RAM and a spinning hard disk. It does wear out as you write to it over time. There are several types of Flash memory. The two most common type are NAND and NOR. Each has its benefits. NOR has the ability to write in place and has consistent and very fast read access times but very slow write access times. NAND has a slower read access time but is much faster to write to. This makes NAND more attractive for mass storage devices.

The Structure of NAND Flash

NAND stores data in a large serial array of transistors. Each transistor can store data. NAND Flash arrays are grouped first into pages. A page consists of data space and spare space. Spare space is physically the same as data space but is used for things like ECC and wear-leveling, which we will cover shortly. Usually, a page is 4096 bytes for data and 1 to 4 bits of spare for each 512 bytes of data space. Pages are grouped again into blocks of 64 to 128 pages, which is the smallest erasable unit. There can be quite a few blocks per actual chip, as many as 16 thousand blocks or 8 gigabytes worth, on a single chip. From there manufacturers group chips together usually in a parallel arrangement using controllers to make them look like one large solid state disk. They can vary in form factor. Most common are a standard 2.5” or 3.5” drive or a PCIe device.

NAND Read Properties

NAND Flash operates differently than RAM or a hard disk. Even though NAND is structured in pages like a hard disks, that is where the similarities end. NAND is structured to be access serially. As a type of memory, NAND flash is a poor choice for random write access patterns.  A 15,000 RPM hard disk my have a random access seek time of 5.5 milliseconds. It has to spin a disk and position the read/write head. NAND on the other hand doesn’t actually seek. It does a look up and reads the memory area. It takes between 25 to 50 nanoseconds. It has the same read time no matter the type of operation random or sequential. A single NAND chip may be able to read between 25 and 40 megabytes a second. So, even though it is considered a poor performer for random IO, it is still orders of magnitude faster than a hard disk.

NAND Write Properties

NAND Flash has a much faster read speed than write speed. The same NAND chip that reads at 40 megabytes a second may only sustain 7 megabytes a second in write speed. Average write speed is 250~ nanoseconds. This figure only includes programming a page. Writing to flash can be much more complicated if there is already data in the page.

Program Erase Cycle

NAND does writes based on a program erase(P/E) cycle. When a NAND block is considered erased all bits are set to 1. As you program the bit you set it to 0. Program cycle writes page at a time and can be pretty quick. NAND doesn’t support a overwrite mode where a bit, page or even block can be overwritten without first being reset to a cleared state. The P/E cycle is very different from what happens on a hard disk where it can overwrite data without first having to clear a sector. Erasing a block takes between 500 nanoseconds to 2 milliseconds. Each P/E cycle wears on the NAND block. After so many cycles the block becomes unreliable and will fail to program or erase.


To mitigate the finite number of P/E cycles a NAND chip has we use two different techniques to keep them alive or make sure we don’t use a possible bad block again. Lets take a single NAND MLC chip. It may have 16 thousand blocks on it. Each block may be rated between 3,000 to 10,000 P/E Cycles. If you execute a P/E cycle on one block per second it would take you over five years to reach the wear out rating of 10,000 cycles. If on the other hand you executed a P/E cycle on a single block you could hit the 10,000 rating in about 3 hours! This is why wear-leveling is so important. In the early days of NAND flash wearing out a block was a legitimate concern as applications would just rewrite the same block over and over. Modern devices spread that over not just a single chip but every available chip in the system. Extending the life of your solid state disk for a very, very long time. Ideally, you want to write to each block once before writing the second block. That isn’t always possible due to data access patterns.

It sounds simple enough to cycle through all available blocks before triggering a P/E cycle but in the real world it just isn’t that easy. As you fill the drive with data it is generally broken into two different categories, static data and dynamic data. Static data is something that is written once, or infrequently, and read multiple times. Something like a music file is a good example of this. Dynamic data covers things like log files that are written to frequently, or in our case database files. If you only wear-level the dynamic data you shorten the life of the flash significantly. Alternatively if you also include the static data you are now incurring extra write and read IO in the back ground that can effect performance of the device.

Background Garbage Collection
To defer the P/E cycle and mitigate the penalty of a block erase we rely on garbage collection running in the background of the device. When a file is altered it may be completely moved to clean pages and blocks, the old blocks are now marked as dirty. This tells the garbage collector that it can perform a block erasure on it at any time. This works just fine as long as the drive has enough spare area allocated and the number of write request is low enough for the garbage collector to keep up. Keep in mind, this spare area isn’t visible to the operating system or the file system and is independent of them. If you run out of free pages to program you start forcing a P/E cycle for each write slowing down writes dramatically. Some manufacturers off set this with a large DRAM buffer and also may allow you to change the size of the over provisioned space.

Another technology that has started to gain momentum is the TRIM command. Fundamentally, this allows the operating system and the storage device to communicate about how much free space the file system has and allows the device to use that space like the reserve space or the over provisioned space used for garbage collection. The down sides are it is really only available in Windows 7 and Windows Server 2008 R2. Some manufacturers are including a separate TRIM service on those OS’es that don’t support it natively. Also, TRIM can only be effective if there is enough free space on the file system. If you fill the drive to capacity then TRIM is completely useless. Another thing to consider is an erasable block may be 256 KB and we generally format our file system for SQL Server at 64KB several times smaller than the erasable block. Last thing to remember, and it is good advice for any device not just solid state storage, is grow your files in large chunks to keep file fragmentation down to a minimum. Heavy file fragmentation also cuts down on TRIM’s performance and can’t be easily fixed since running a defragment may actually make the problem worse as it forces whole sale garbage collection and wears out the flash that much faster.

Write Amplification
Another pit fall of wear-leveling and garbage collection is the phenomenon of write amplification. As the device tries to keep up with write request and garbage collection it can effectively bring everything to a stand still. Again, writing serially and deleting serially in large blocks can mitigate some of this. Unfortunately, SQL Server access patterns for OLTP style databases means lots of little inserts, updates and deletes. This adds to the problem. There may be enough free space to accommodate the write but it is severely fragmented by the write pattern and a large amount of garbage collection is needed. TRIM can help with this if you leave enough free space available. This also means factoring free space into your capacity planning ahead of time. A full solid state device is a poor performing one when it comes to writes.

Error Detection and Correction

The nature of NAND Flash makes it susceptible to several types of data corruption. Like hard drives and floppy disks are at risk near magnetic sources. NAND has several vulnerabilities, some of them occur even when just reading data.

Write Disturb
Data in cells that aren’t being written to can be corrupted by writing to adjacent cells or even pages, this is called Program Disturb. Cells not being programmed receive elevated voltage causing them to appear weakly programmed. There isn’t any damage to the physical structure and can be cleared with a normal erase.

Read Disturb
Reading repeatedly from the same block can also have a similar effect call Read Disturb. Cells not being read collects a charge that causes it to appear to be weakly programmed. The main difference from Write Disturb is it is always on the block being read and always on pages not being read. Again, the physical cell isn’t damaged and an erase on the effected block clears the issue.

Charge Loss/Gain
Lastly, there is an issue with data data retention on cells over time. The charge on a floating gate over time may gain or lose charge, making them appear to be weakly programmed or in another invalid state. The block is undamaged and can still be reliably erased and written to.

All of this sounds just as catastrophic as it gets. Fortunately, error correcting code (ECC) techniques effectively deal with these issues.

Bad Block Management

NAND chips aren’t always perfect. Every chip may have defects of some sort and will ship from the factory with bad blocks already on the device. Bad block management is integrated into the NAND chip. When a cell fails a P/E cycle the data is written to a different block and that block is marked bad making it unavailable to the system.


As you can see, Flash in some respects is much more complicated that your traditional hard disk. There are many things you must consider when implementing solid state storage.

Flash read performance is great, sequential or random.
Flash write performance is complicated, and can be a problem if you don’t manage it.
Flash wears out over time. Not nearly the issue it use to be, but you must understand your write patterns.
Plan for over provisioning and TRIM support it can have a huge impact on how much storage you actually buy. Flash can be error prone. Be aware that writes and reads can cause data corruption.

Up next
We will talk about MLC vs SLC. What makes a device enterprise ready. How to effectively benchmark your solid state storage and not be caught off guard when you move into production.