Monthly Archives: June 2010

Fundamentals of Storage Systems – RAID and Hard Disk Reliability, Under the Covers

In the last RAID article we covered the basics. This is a little deeper dive into the underlying mechanics of RAID. Exactly what it does, how it does it and what it doesn’t do that people assume it does. I sited David Patterson, Garth Gibson, and Randy Kats and their work at UC Berkley on RAID. They show something I’ve talked about before the “Pending I/O Crises”. Of course it isn’t pending anymore, its here. One of the concerns has to do with Amdah’s Law and speeding up execution with parallel operations. As processors and memory speed up hard disks are still an order of magnitude slower. Another aspect is Kryder’s Law, which like Moore’s Law, is a estimation of capacity growth of hard disks over time. Kryder’s Law is starting to slow down just as Moore’s law is. The problem with hard drives has never really been capacity, its speed. As areal density increases you do get an increase in data throughput, there is simply more data per square inch on the disk. You also get an improvement in I/O’s, tracks are closer together.  We haven’t broken past the 15k barrier yet. I’ve still got Seagate Cheetah 15k.3 drive from 2002. It has a max sequential throughput around 80 MB/sec. I doubt we will see spinning disks faster than 15k. This is a real problem for scaling I/O up. Enter RAID. It’s simple get a bunch of disks and then stripe data across them. One little problem creeps up.  Reliability goes down for each drive you add to the array. Using RAID 0 pretty much guarantees you will have an array failure. To overcome this We start adding some way to make the data more redundant.

Hard Disk Reliability

People make a lot of assumptions about hard drives and their reliability. Hard disks break down into two classes consumer grade, the drive you have in your desktop and enterprise, the kind usually in your servers. There are misconceptions around both. Recently, Google and others have written papers based on long term large batch sample failure rates and found the enterprise class drives don’t last any longer than consumer class. This study is perfectly valid from a physical reliability point of view. Most drives are manufactured the same way in the same plants. Not like the poor misunderstood lemming, hard disks do all jump off a cliff together. Studies have shown that there is a strong corollary to disk failure and a shared manufacturing batch. Simply put, if they are made around the same time if one has a failure there is a likelihood, around 30%, other drives in that batch will also suffer failures. So, what are we paying for with an enterprise drive besides speed? Data reliability. Enterprise level drives have more robust error correction than their consumer counterparts. On a normal hard drive the smallest piece of data that can be written is 512 bytes. This is the size of a sector. Enterprise drives usually have 520 byte sector 8 bytes are used to verify the data in that sector, this is the Data Integrity Field. DIF isn’t 100% ether. It is more reliable than a consumer drive without it. You can still have write corruption for several reasons. Misdirected writes occur when data is written to the wrong location on disk and reported as a successful write. When the system goes to access again you get a read fault. Torn pages, which we are familiar with, is when an 8k page write is requested but only part of the 8k is actually reported. Corruption outside the drive where the controller makes a bad request to write but it is a perfectly legitimate I/O request at the hard drive level. With larger drives the odds of hitting one of these errors becomes a real possibility. Enterprise drives add this extra layer of protection. Your RAID HBA may also have additional error correction. The last thing I would like to touch on is write catching. Without a battery backup, or if the cache non-volatile in nature, you will loose data on a power failure if a write is in progress.

RAID Host Bus Adapter Reliability

The adapter is as reliable as any other component in your system. Normally, the cache on the controller is ECC based. Also, you usually have the option of a battery module to supply the cache with power incase of an outage so the data in cache can be written to the array when everything comes back up. Most of the issues I have seen with RAID HBAs is almost always driver or firmware related. You may also see inconsistent performance due to write catching and the battery backup unit. The unit has to be taken off line and conditioned to keep it in top condition. The side effect is a temporary disabling of the write cache on the controller. You can override this setting on some controllers but it is dangerous proposition. I personal anecdote from my days at a large computer manufacturer, we started getting a larger volume of failed drive calls into support. We started doing failure analysis. It all pointed back to a particular batch of hard drives. That was when the drive manufacturer made a change in its drives removing very small component. It shaved a few cents off the cost but had a dramatic effect. All the drives were technically good and would pass validation. Under a enough load and attached to a particular RAID HBA they would randomly fall off line. It came down to the little component. It provided a little bit of electrical noise suppression on the SCSI bus. Some cards were effected and others chugged along just fine. This is also confirmed by the Google paper, they observed the same behavior. They also point out that 20% to 30% of all returned drives have no detectible problems. The point is validate your entire I/O stack. Any single component may be within specification but may not play well with others.

RAID Parity, Mirroring, and Recoverability

Not to belabor the point, RAID isn’t bullet proof. People rap RAID round themselves like Superman’s cape. There are several issues that all the RAID schemes in the world don’t protect against. With current hard disks in the two terabyte range it is possible to build even a small RAID 5 array and have potential for complete failure. The problem is the amount of data that has to be read for the rebuild process. Having a hot spare available reduces the time to replace a failed drive to zero but that is only part of the equation. The much larger part is rebuild time. Lets say you have a 14 drive RAID 5 array with the new two terabyte drives installed and suffer a failure. If you have no activity on the array and all the IO is detected to the rebuilt it could still take two or three days to rebuild the array. During that time you are effectively running on a RAID 0 array that is now under load. Your chance of total array failure is near 100%. RAID by its very nature assumes a failure is a hard failure. A drive goes off line and the redundant part of the system takes over. It also makes the assumption that if a write succeeds then, barring a hardware failure, the read will also be valid. Data is only validated on writes not on reads. If it was RAID 5 would be twice as slow on reads and four times as slow on writes as a single drive or RAID 0. With all the potential hidden write failures it is completely possible to have hidden corruption and not know it until it is way to late. RAID levels with striped parity are most susceptible to this kind of silent creeping corruption. It is possible that the corrupted data is in the parity stripe making it completely unusable for data reconstruction. If that particular piece of data doesn’t change you can go a very long time with a RAID 5 array with polluted parity. You know how to recover from a polluted parity stripe? Simple, copy all the data off the array, figure out which files are now corrupt and restore them. RAID 6 with its dual stripes makes it more likely to recover your data from a single parity stripe becoming corrupt. You do pay a price in write speed for that extra level of protection. RAID 1 and RAID 10 aren’t perfect ether. On a mirrored pair if the write is assumed good there is no way to validate that on read. Without a third piece of information, like a checksum, it would be a coin toss. If the read is successful there is no way to tell which drive has the bad data. It is possible to have a mirrored pair run just fine with one giving you corrupted data on reads all day long. It would manifest itself as file corruption or some other anomaly that could be difficult to track down. We are back to relying on the disk to tell us all is well. We often recommend RAID 10 over everything else for speed and reliability, and I still hold to that. RAID 10 can still suffer from a catastrophic failure due to a single mirrored pair failing at the same time. With the probability of correlated disk failures it can’t be ignored.

What Can We Do?

There are a few tools available to us that can help predict the failure of a drive or that something is wrong with the array. All modern drives support the SMART protocol. Even though Google found it wasn’t as useful and wasn’t 100% reliable, closer to 30%, some warning is better than none in my opinion. All modern RAID HBA’s also come with tools to detect parity errors. You do take a hit when you run these internal consistency checks. Just like you run maintenance on your databases via DBCC your RAID arrays need checkups too. They are a necessary evil if you don’t want any surprises one day when you have a failed drive in your RAID 5 array and can’t rebuild it. If you have intermittent problems with a drive, don’t mess around, replace it. The HBA almost always has the ability to send SNMP messages to something like nagios or HP Openview, Use it. If you aren’t running something like that usually you can configure email alerts on error to go out. Proactive is the name of the game.

Don’t take my word for it….

Short list of papers to get you started on your path to paranoia.

Google Disk Failure analysis

Original RAID Paper

NetApp disk failure analysis

CERN data corruption tests

Silent Data Corruption in SATA arrays

Series To Date:
  1. Introduction
  2. The Basics of Spinning Disks
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers – You are here!
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems

SQL Saturday #35 Notes and Observations

First of all, I want to congratulate all the volunteers that made this happen. It was a very well organized event and ran smoothly. I had a great time. It was nice meeting people that I couldn’t have met any other way.


As A Vendor…

The Good:

Ryan Adams did a very good job keeping things coordinated up to the event. Making sure that everything we were entitled to we got. Always very responsive to emails and questions.

The day of I always had Ryan or one of the volunteers stop by between sessions and check that everything was good. I have always had a good experience with PASS events, but I’ve never had so many people checking on us before!

Needs Improvement:

Table placements. I just didn’t understand the flow and layout of the event until I saw the venue first hand. I would have picked a different table. I don’t think it hurt us, we had crazy foot traffic and lots of conversations.

It did bottleneck up sometimes around the vendor tables as sessions let out but I think over all the placement was OK. There isn’t much room to work with and I don’t know if I would have done much better in their shoes!


As a vendor I was very happy with the event and the amount of time I got to spend talking to folks about Nitrosphere and what we do. As a new company getting out and meeting people is very important.

Only having one or two big conferences a year is difficult and costs 5 to 10 times the amount of money that a SQL Saturday does.


As a Speaker…

The Good:

Again, well coordinated no scheduling issues or anything like that. I found the different tracks layer out well. The meet and greet the night before was nice.

Speaker room was big enough.Internet access seemed fine to me.

Again, I was checked on by the staff over and over to make sure things were OK.

We also had a handler feeding us time to help keep us on track.

Needs Improvement:

Recording sessions was spotty. It was a last minute thing and most of us could have used a little hand holding getting it right.


As a speaker I was happy again with the organization and attention to detail.


As an attendee…

The Good:

Lots of tracks an sessions for everyone. I enjoyed seeing so many local and new speakers making the break.

Plenty of interaction between people and speakers.

The food was great, I NEVER get the chicken salad, I ate two for lunch :). Oh the ice cream…. so evil.

Needs Improvement:

Bathroom Queue Length’s were a little long but did clear up.

Finding the stairs to the second floor was fun.


Yet again, no real complaints. Plenty of seating solid flow and awesome choices. I still can’t believe this was a free day of training!


I will be making room for other SQL Saturdays going forward.

What happens when Windows is the step-child? Adventures in Ruby on Rails.

Like many of you I’ve heard the developer community going on about Rails for quite a while now. It wasn’t until recently I had any reason to dip into that world. Over at Nitosphere the website is all run on Rails. We got a inexpensive web host and it was pretty easy to get it up and running. Like most shared web host, it is all linux/open source based. We have now grown to the point that hosting our own server would be cheap enough and give us complete control over the box. As a Microsoft ISV I thought it would be nice to have our new box be a Windows box. It would also be nice to hook in to SQL Server instead of MySQL as well. After a little digging I did find that Microsoft is sponsoring IronRuby, a Ruby clone that runs on the .net platform. Unfortunately, it isn’t completely compatible with one of the packages that we need to run the website on. So, back to Ruby. There is also a gem to run Rails apps against SQL Server, It isn’t compatible with some of the stuff on our website ether. Finally, I fell back to ODBC to connect to SQL Server. Everything wired up but there was still an incompatibility issue. I’ll keep trying to work it out but our fall back was MySQL.

You will need:

Source Control:

If you plan on getting the source for anything you will need ether GIT or Subversion.

GIT for windows:

Subversion clients: basic client 32bit or 64bit

some folks prefer tortoisesvn 


Ruby core:


Ruby 1.8.7 has the most compatibility with existing gems. Get the latest installer if the one linked isn’t it. There are installers for 1.8.6 and 1.9.1. Again, check to see if they will support the gems you will need to get your site up and running!

install the dev kit if you would like to compile some gems instead of manually downloading them. If you don’t install the devkit some gems will fail to install since they can’t compile to a native extension. To get around that you can also use –platform=mswin32 when you install a gem.

Example: gem install fastercsv –platform=mswin32 will fetch the precompiled windows gem if it exists. for more information on the devkit check out this link.




By default when you create an application Ruby on Rails defaults to sqlite3. If you want to use Sqlite3 you will need to download the dll’s and the command line executable.


Since I couldn’t get our site to talk to SQL Server we are staying on MySQL for now. You can use the latest installer 64 bit or 32 bit but you must have the 32 bit library for Ruby to work properly. The libmySQL.dll from the 5.0.15 install did the trick for me. 

SQL Server

If you would like to try the SQL Server adapter just do a gem install activerecord-sqlserver-adapter to get it.


HTTP Ruby Server:

I went with mongrel, it may not be the best but it was pretty easy to setup and get up and running. I am running version 1.1.5 right now since that seems to work best with mongrel_service which you will need if you don’t want to stay logged into your web server with a dos prompt running. I opted for the latest beta of mongrel_service since it cut out some dependencies and seems pretty stable at the moment. As always adding –pre gets the latest beta gem. Also, –include-dependencies will grab everything the gem will need to run, including mongrel.


Gems specific to my install:

Substruct uses rmagick for thumbnail generation, which requires image magic to do the actual work.

Redcloth is a textile markup language for Ruby. If you didn’t install the devkit don’t for get to add –platform=mswin32 to your gem install commands.


My installation steps:

Install Ruby

Make sure c:\ruby\bin (or where you installed it to) is in the path. I recommend a path with no spaces so no c:\program files.

extract the devkit to the c:\ruby directory.

extract the sqlite exe and dlls to c:\ruby\bin

extract libmySQL.dll from the 32 bit 5.0.15 archive

Open an command prompt with administrator privileges.

Issue these commands:

gem update –system

gem install rails –no-ri –no-rdoc

gem install sqlite3-ruby –no-ri –no-rdoc

gem install mysql –no-ri –no-rdoc

gem install mongrel_service –no-ri –no-rdoc –platform mswin32 –include-dependencies –pre


After that install any gems you need for your Rails app. Make sure and test that your app works in production mode with mongrel before anything else. There will be some kinks to work out I’m sure. Once you are happy that everything is running as expected you can install your mongrel service.

mongrel_rails service::install -N MyAppsServiceName -c c:\app\myapp -p 3000 -e production

The –N is the service name. –c is where the app will be served from. –p is the port number that it will listen on. –e is the mode it will run in like development or production. I chose a few high ports 3000 to 3008 for my services to run in.

You can always remove a service if something is wrong.

mongrel_rails service::remove -N MyAppsServiceName


Setting up IIS7:

Install the application request routing 2.0 and URL Rewrite plug-ins using the Web Platform Installer

Once that is done you will need to create a new web farm.

 create farm

Next you will need to add at least one server entry. You may want to edit your host file and add additional aliases to your IP Address so you can run multiple copies of mongrel to service all request. The recommendation is one per cpu/core.

add server

The last step the wizard ask if you want to add the routing rules. The answer is yes.

add rules

You can confirm the routing rules are in place.

edit inbound rule

Make sure you have a website in IIS running and listening on port 80. Without this there is nothing for IIS to route to your new server farm.


If you have any questions post them up. I’m not a Rails expert but I have just been through the pain of Rails on Windows!

What I’ve Read and Recommend to Others – General Database and Theory by C.J. Date

Date on Database: Writings 2000-2006
This is a collection of writings by C.J. Date, one of the fathers of the relational model. It has a nice tribute to E.F. Codd, inventor of the relational model.
If you are looking for tips and insights into relational databases on a higher level this is a solid read. 
SQL and Relational Theory: How to Write Accurate SQL Code
Another solid text from Date. This one helps you understand the theory so you can write effective real world code. It has lots of solid examples.
It covers topics like granting access directly to the data or through views. How NULLS effect the answers returned by your queries. Advanced coverage
of constraints. I’m currently reading this book again. I don’t think you could absorb all of the goodness this book has to offer in one pass.
Temporal Data & the Relational Model (The Morgan Kaufmann Series in Data Management Systems)
If you are working with dates and time this book will teach you a new modeling technique. Unlike Domain Key Normal Form this builds on the previous
Normalization rules and is a logical extension of them. It does use Tutorial D to explain its examples which is kind of a pain.
Database in Depth: Relational Theory for Practitioners
Another fundamentals book that transcends any particular product line and gets to the heart of the relational model. It is a short but concise read and one
I generally recommend to people wanting to expand their theoretical base.
Refactoring Databases: Evolutionary Database Design
If you have ever had to refractor a database this book is for you. I’ve read and re-read this book over the last few years. It isn’t the easiest read in the world
but it can help you apply what you have learned from the books above to your current database without having to start from scratch.


As always if you have any questions or want to suggest a book let me know!

Sometimes, you have to fix it yourself

The Problem

SQL Server is a huge product with lots of moving parts. Bugs happen. Microsoft has a place to voice your issues or problems. They allow you to vote on the issue and then decide when or if it will get fixed. I’ve used Connect when I hit a bug and I have voted on items that were important to me. Recently I hit a bug in sp_createstats. I use this system stored procedure generate statistics in an automated process I’ve got that manages statistics. I added a new vendor database to the system and on the first run hit “Column ‘DAYSOPEN’ in table ‘dbo.TBL_OPPORTUNITY’ cannot be used in an index or statistics or as a partition key because it is non-deterministic.”. Well, we all know you can’t create stats on a computed column! I quickly went to the connect site and someone else had already entered it. The down side was it had so few votes it was only slated to go into the next cumulative update/service pack. When I hit this issue they hadn’t yet announced service pack 4. I already had this procedure coded into my routines and really didn’t want to rewrite them to get past this one problem.

The Solution


By doing what I am about to describe could break at a later date or randomly kill baby kittens.

Since it is a system stored procedure I am loathe to make any changes to it directly. There are ways to modify some system stored procedures but they involve the installation CD and creativity. With that door closed there was only one avenue open to me. Create my own system stored procedure with the fix in it. There is a problem with this solution as well, if it gets dropped due to a service pack or an upgrade anything calling it will break. The first thing I did was to see if the procedure text was available by executing sp_helptext sp_createstats. Luckily it was! Now all I had to do was figure out where it was broken. The procedure is pretty simple and uses some cursors to loop through all the objects and create column statistics where they don’t exist.

declare ms_crs_cnames cursor local for select from sys.columns c  
     where c.object_id = @table_id  
     and (type_name(c.system_type_id) not in ('xml'))  
     and not in (select col_name from #colpostab where col_pos = 1)  
     and (( in (select col_name from #colpostab)) or (@indexonly <> 'INDEXONLY'))
    -- populate temporary table of all (column, index position) tuples for this table  

It was pretty easy to spot. The weren’t checking to see if the column was computed so I added a line to the where clause.

and c.is_computed = 0

That’s it. One little check to see if it is a computed column. Now that I had fixed it I created a new procedure named sp_createstats_fixed in the master database. Just creating it in master doesn’t make it act like the original procedure or make it a system stored procedure. For that I had to execute EXECUTE sp_MS_marksystemobject ‘sp_createstats_fix’. This is an undocumented stored procedure and could change or go way any time. The only way to unmark it in SQL Server 2005 is to drop the procedure and recreate it. Now it acts just like the old procedure. Next I had to replace all references to the old proc with the new one. I made an entry into our bug tracking system about the change so we would have a record of what I did and why.


This wasn’t the most elegant solution. It could break later. The upside is it only took me about 30 minutes to fix and deploy versus the hours of re-coding and then testing that I would have had to do before. Do I think you should go around creating your own system stored procedures? Not at all. I don’t recommend you put anything in the master database period. If the problem had been more complex I would have redone the original routines to exclude the broken procedure. This time it just happened to be a very quick fix to a non-critical part of our system.

Quick Tip Of The Day

Over and over again we are told that the DMV’s only hold data since your last reboot. So, how do you know when your server was last rebooted? Well, every time your SQL Server service restarts tempdb is recreated every time. With a quick query to sys.databases we can get the creation date of tempdb! Armed with that little nugget you can then analyze what is in the DMV’s relevant to the last system restart.


    create_date AS last_restart_time



   WHERE  name = 'tempdb'

Three Turns On The Road

Huh? My story? Okay. It was never easy for me. I was born a poor black child. I remember the days, sittin’ on the porch with my family, singin’ and dancin’ down in Mississippi. – The Jerk

This meme was started by Paul Randal (Blog | Twitter) who asks the question "What three things or events brought you to where you are today?" I was tagged by David Taylor (Blog | Twitter). I wasn’t born in Mississippi but we sure were poor. I grew up I a small West Texas town of farmers and ranchers. I grew up between two households. I am a single child but have a large extended family and was a middle child in that mix. I as never as tall or fast as my closest brothers but I have always had a very sharp wit and a quick thinker. With all of those factors competition for resources was pretty fierce.


Turn number one, being broke breeds invention!

imageMy aptitude for mechanical things and electronics was my first love. Being handy with a socket set has a great value when you have to fix everything you own yourself. It also allowed me to scavenge parts where ever I could to build stuff I wanted. It wasn’t ever pretty but it was mine. The bike I rode during my middle years was made up of at least four others. My radio was also made up of three or four other radios. One of the great things was one of my best friends had a great hookup into technology, his dad owned the Radio Shack in town. I spent a lot of time there and got to hone my soldering skills there on a real Weller station with adjustable temp!. My mom saved and saved. She new computers were the future. Having a Radio Shack meant I got a TRS-80! I scraped up the money to buy a tape drive and eventually a modem. I’ll never forget them telling my mom that this thing would take me through high school and college! I spent the better part of my junior high and high school life on this and later a CoCo3. My High school had Apple II’s and eventually some very coveted Mac’s. My best friend through high school was one of two people that had those “IBM PC” things. I fell in love with those 8088’s and his dads “high end” 286 and got to spend a lot of time on those as well. Those few of us that had modems would dial into each others machines. Needless to say my folks weren’t real happy when the picked up the phone and a computer was screaming at them. I wasn’t to happy ether since they usually killed me right at the end of a two hour download :)




Turn number two, the college yearsimage and dial tones.

Alas, it wasn’t my goal to be a computer guy. I loved speech and theatre and did quite well in both. I dabbled in physics but quickly realized it was hard and I wasn’t very fond of the math. The other thing I majored in was partying. I slowly quit playing with computers all together and for about a year was just a college bum. My first real friend in college had a 286 and a modem as well. He showed me the local BBS’es and I was 100% hooked again. He gave me a old hard drive and I went down to a local computer store to buy a used computer and start learning all over again. The shop owner was a friend of my friend and knew who I was. I walked in with enough money to buy a 286 he had for sale. I told him I needed a computer and was planning on starting a real BBS. He smiled, and flatly refused to sell it to me. I argued with him for the better part of an hour. I left his store pretty mad but none of the other computer stores in town had a computer that met my limited budget. I went back the next day and started in on him again. He stuck to his guns telling me I wouldn’t be happy with that old clunker. My friend had told him I was handy with the soldering iron so he made me a deal. I put a down payment on a cutting edge 486 and worked the rest off fixing computers. He took me under his wing and helped fill in the gaps in my computer hardware education. He had a crazy huge stockpile of old stuff that I could tinker with. That is where I found out I truly loved computers and connecting with others via my BBS. Looking back on what Duane did for me is one of the biggest reasons I give back to the community today. Spending hours working on my board I learned to love DOS like no other. It very quickly grew into a rekindling of my passion of programming as well. I changed majors(again) and started the computer science thing. Having people come to the house to play their turns on BRE or Trade Wars, posting up on fidonet groups or sending mail to friends around the world was a huge kick for me. I met some of my closest friends through the board. The Internet will never replace the BBS in my heart. I had a loyal local group of BBS’ers with the ability to communicate globally. Eventually, I dropped out of school to work full time again. As the Director of Technology for a small school outside of San Angelo, Texas I had a huge impact on the students and helped bring computers and the Internet, into every classroom. I also participated in the speech and theater stuff with the high school students as a coach.


Turn number three, the big city.

360 Bridge, Austin, TX by MoralesDirect.

this image is property of Dan Morales…/set-72157601003144486

After a shakeup at the school system I was working at occurred and the Superintendent of Schools left, I decided it was the right time to make a move to the big city. I had always loved Austin. I’d done a tour of service in the other major cities and found that Austin was the only one that felt like a real community. I did what lots of folks did in the late 90’s and interviewed at start ups. To be honest, they scared the crap out of me. Most of the time it was people younger than me, with more money than I’d ever seen, doing whatever the hell they wanted. It looked like a lot of fun, but not so much a solid way to build a business. So I did what everyone does, I went to work for Dell. Did it suck? Yeah, kind of. I met a great group of guys there though. It was also the first place I’d ever been that had what you might call a growth path. My boss and my team mates were great people to work with. I had been doing side work pretty much most of the 90’s small software projects or working in the new open source movement. I loved data, so I spend a lot of time working with databases or building custom data engines. When I had landed at Dell I’d already had several years of tinkering with SQL Server. We had a single server at the school and I worked with clients I met through the computer store as well. I also love a puzzle. I watched the calls that came in. Saw the patterns in them and started plugging them into a brand new SQL Server 7.0 database. Quickly I moved off the phones full time and started working on a special projects team analyzing call volumes, doing root cause analysis and coming up with solutions. Dell also opened my eyes to the possibilities and gave me the confidence boost to just strike out and blaze my own path. I did my time at Dell and moved on to another large company, Broadwing. Again, great group of guys, not so great working environment. But, for the first time I was a full on DBA. Not a DBA/developer/janitor/whateverthebossneeds. I LOVED IT. Lots of machines, lots of data. From there it has just been a natural progression always growing and learning but staying laser focused on SQL Server. Like many folks, the 90’s was an awesome decade for me.



Well, I’m still with SQL Server. I love community work. I’m a family man. None of the things I take for granted today wouldn’t have happened if I’d never been exposed to computers at a young age, or had a stranger take a chance on me, or worked up the nerve to just move to another city when I had a stable job where I was at. Give back when you can. You never know how you will shape someone else’s road for the better.

Out Of My Comfort Zone: Building a Web App

If you read Fundamentals of Storage Systems – Stripe Size, Block Size, and IO Patterns you know I built a little web tool to help you with sizing and estimating your RAID array’s performance. This is way out of my area of expertise. Luckily for me I like a challenge and had a guiding hand from some friends. I haven’t done any web programming since I wrote a photo album mod for Snitz! forum package in 2002, using classic ASP. I still get thank you emails from folks that have been running it for years. Needless to say my skills are a little rusty. My first instinct was to fire up Visual Studio 2008 and build an page tied to a back end SQL Server 2008 database. Well, is just different enough that I was struggling to do the most basic things. So, I thought it would be good to “get back to basics”. A very good friend of mine has been a professional web developer for the same company since about the time my photo album came out. He lives and breathes web technology. The problem was he doesn’t do at all. Everything he does is standards compliant HTML and JavaScript. I told him about my spread sheet of calculations and my desire to turn it into a web page. Joe quickly begged me to leave him alone. Once I calmed him down, he did a little sample page to get me on the right track. I took a look at it and thought it would be easy to goof around in this JavaScript stuff. I was wrong. What a convoluted world web developers live in.

I hacked on it the weekend before Christmas and was pretty happy with my handy work, right up until I showed it to some people. Comments like “Kill it with fire!” were common. I explained what it was and that took some of the hostility out of the feedback. The next Monday I asked the two lead web heads at work to look at it. Once Ben had washed his eyes out with bleach he told me under no certain terms could he fix it. My powers of persuasion, and a threat to never approve another schema change, helped bring him around. He explained several new technologies to me I wasn’t aware off, like CSS. Ben showed me quickly how to format the page so peoples heads wouldn’t burst into flames when they saw it. He also spoke of things like Ajax and JSON. Being the clueless data guy that I am I turned to my trusty friend yet again, Google.

Quick Definition List

JavaScript, the underpinning of the modern dynamic web 2.0 world.

Ajax (asynchronous JavaScript and XML, a group of interrelated web technologies that allow for dynamic web design.

JSON (JavaScript Notation), a lightweight data-interchange format, a way to structure data like XML implementing name/value pairs and ordered lists.

JQuery, a JavaScript library used to traverse HTML elements and build dynamic content easily.

JavaScript Implementation

What struck me was just how much had changed, yet stayed the same. One of the things I’ve always hated is no two browsers render the page the same way. If it worked in Firefox it didn’t work in IE. The other thing I didn’t realize for quite a while was any of the browsers would happily run the worlds worst code. I’m not talking poorly written code, I mean flat wrong. They will let any old JavaScript run. JavaScript is defined as a loosely typed language. The way it is implemented it goes from loosely to sloppy very quickly. Don’t get me wrong here. The language specification EMCAScript, is solid. The implementations are very poor in most browsers and that is where the problems developing for JavaScript really come in. I’m not saying JavaScript is slow, quite the opposite in fact. It is just difficult to make sure you are actually writing the correct code. Having worked with Visual Studio for quite some time, I’ve become spoiled with having my IDE tell me when the syntax is wrong before I hit the compile button. So, I was back to Google again looking for tools to help me out.

The Tools

I quickly settled on Aptana Studio 2.0 to do the actual code work. I also used Notepad ++, which is by far the best text editor in the world. with Aptana Studio allowing me to see errors in real time and use the preview function to look at page renders without having to have browser windows open. I did look at Microsoft’s new Expression Suite 3.0. It is nice. The designer is slick, but in typical Microsoft fashion it mangles some parts of the code that I then have to fix by hand. Aptana Studio also has a built in JavaScript minimizer that shrinks the size of your JavaScript files for quicker load times. Aptana Studio doesn’t enforce the use of line terminators and when I compacted my JavaScript it wouldn’t load. I did some additional digging and found an awesome Lint tool for validating JavaScript JSLint. My favorite thing about it is the quote and link on the front page “Warning: JSLint will hurt your feelings.” I knew I had found the right tool. After several passes I got my code to validate and minimize without any errors. I quickly looked for other Lint based tools and found one for JSON as well JSONLint.

The Work

Over the course of a few weeks I tinkered with the page, adding features, correcting mistakes and cleaning up the layout. The biggest thing that I learned is just how fast JavaScript can be. Even after I did the initial page I had no doubt a database would be involved at some point. I had compiled a small database of hard drives and their characteristics and wanted to add that to the page. The problem was I would basically have to redo the page again in or some other dynamic language to pull the data from the database. So, I cheated. As a proof of concept I built a JSON structure with a sample of the data to load the drop down list box and all the variables dynamically. I was stunned at just how fast the page was to respond. I hacked together some T-SQL that would generate the JSON object from my database, all 1200 entries. I cleaned it up a bit made sure that JSONLint validated it ran the minimizer on it and plugged it in. Holy cow, it worked! It was also still very vast. I just couldn’t believe it. The only drawback is having to update the JSON if I add new drives to the database.


What I’ve Learned

The main takeaways for me on this project are simple, web development stinks. Now I know why web folks have a hard time with T-SQL, they already have to be experts in five or six different technologies across two or more platforms. I use to introduce Joe as “My friend who does JavaScript”, now I know is is a hard core developer! What I build is simple and works. I think the layout is tolerable. If you have any suggestions or feed back just drop me a note.

Fundamentals of Storage Systems – Stripe Size, Block Size, and IO Patterns

If you have been following this series we have covered system buses, hard disks, host bus adapters and RAID. Along the way we also covered how to capture your IO patterns and the SQLIO tool. Now we will pull it all together.We move up the stack even further to the actual layout of the RAID stripe and the file system. How the stripe and file system are laid out on your disks has a huge impact on performance. One of the things that has really gotten some traction over the last few years is sector alignment. This one thing, if not done, could cost you 30% to 40% of your IO potential. Jimmy May has covered sector alignment in depth So I won’t hash it here again. Kendal Van Dyke also has a good series that covers offset, stripe size, and allocation units with different raid levels.

It Don’t Add Up…

Something I’ve seen, and been guilty of, is taking a drives base specifications and just multiplying out. Say the manufacturer says the drive will to 79MB/Sec minimum throughput, we have 10 drives so that is 790MB/Sec of throughput! We all know from experience that this isn’t so. What eats us up is how much slower it really can be. As we have seen throughout this series there is overhead associated to everything. Before we just throw a bunch of disks in an enclosure and press it into service it would be nice to have an idea of what the performance should be. It’s also recommended to do some of this work before you actually buy anything so you don’t have to go back to your boss and beg for more money and explain to him that your wild guess was wrong.

Always add a pinch of salt to whatever the disk manufacturer puts in the specifications. Most of the time they will be close enough. The problem lies in the fact they don’t always disclose the methods for archiving those numbers. For instance, when they report minimum and maximum throughput they are usually talking about a scan of the entire disk including all meta data stored between tracks, the best possible throughput possible. You won’t see those results in every day life. They also give you numbers that can be completely irrelevant like single sector read rates. very rarely do you read a single sector at a time. Personally, I would love if the drive makers gave the engineering specifications. I know that won’t happen, it would make my life easier though. The disk characteristics that are important are, sector size,spindle speed, seek times read and write, sequential times read and write. To a lesser extent sequential throughput in megabytes per second. With the single disk numbers we can move on to the RAID configuration.

Configuring your RAID Array

There are several factors that impact the RAID arrays ability to perform. The RAID level, size of the IO request, and stripe size. RAID level is the easy one, what kind of hits do you take on writes vs. capacity of the array. On the stripe size there is a direct corollary with the size of the IO request. If the IO request is bigger than the stripe size it will have to seek across another disk to satisfy the data request. If the IO request size is very small and random you may loose some IO performance if the requests pile up on one disk causing a hot spot. There are established calculations that you can perform to get an idea of how to configure you array. I’ve built a web page that you can use to do all the basic calculations, Disk Drive RAID Configuration Tool. These equations are base line estimates so you aren’t working completely in the dark. You can enter your own drive statistics or pick from one of 1100 hard drives in the database. This web calculator is based off of Peter Chen’s equations for estimating RAID performance and best stripe size. I’ll add more to it as I get time.

SQL Server IO Patterns and Array Performance

SQL Server works with two specific IO request size 8K and 64K in general. If you did your due diligence earlier you could also add any other request size that you saw come through. Focusing on the page size and extent size is a good place to start. Using the raid calculator tool I chose a Seagate Savvio 15K.2 drive as my base. One of the things my calculator can’t take into consideration is your system and RAID HBA. This is where testing is essential. You will find there are anomalies in every card, physical limits on throughput and IO’s. Since my RAID card won’t do a stripe bigger than 256k that is my cap for size. Reading through several IO white papers on SQL Server the general recommendation is for 2000/2005 a 64k or 128k stripe size and for SQL Server 2008 a 256k stripe size. I’ve found as general guidance, this is a good place to start as well. The calculator tells me for a RAID 10 array with 24 drives at a 256k stripe size and 8k IO request I should get 9825 IOs/Sec and 76.75 MB/Sec on average, across reads, writes, sequential and random IO requests. That’s right, 76 MB/Sec throughput for 24 drives rated at 122 MB/sec minimum. That is 2.5 MB/Sec per drive. The same array at a 64k IO request size yields 8102 IOs/Sec and 506 MB/Sec. A huge difference in throughput just based on the IO request size. Still, not anywhere near 122 MB/Sec. As an estimate, I find that these numbers are “good enough” to start sizing my arrays. If I needed to figure out how big the array needs to be to support say 150 MB/sec throughput or 10000 IOs/Sec you can do that with the calculator as well. Armed with our estimates it’s time to actually test our new RAID arrays. I use SQLIO to do synthetic benchmarking before running any actual data loads.

After doing a round of testing I found that in some cases the numbers were a little high or a little low. Other factors that are hard to calculate are cache hit ratios. Enterprise RAID HBA’s usually disable the write cache on the local disk controller and just use their own batter backed cache for all write operations. This is safer but with more and more disks on a single controller the amount of cache per disk can get pretty low. The HBA will also want you to split that between read and write operations. On my HP RAID HBA’s the default is 25% read and 75% write. In an older study I found on disk caches and cache size saw diminishing returns above 2 MB gaining between 1 and 2 percent additional cache hits per megabyte of cache. I expect that to flatten out even more as the caches get larger, you simply can’t get 100% cache ratios that would mean the whole drive fit in the ram cache or your IO request are the same over and over. Generally if that is the case you will find SQL Server won’t have to go to disk it will have what it needs in the buffer pool for reads. I find that if you have less than 20 percent write activity leaving the defaults is fine. If I do have a write heavy load I will set the cache to 100% writes.

The Results

Having completed my benchmarking I found that 128k or 256k stripe size was fine on average. Just realize that if you optimize for one IO pattern the others will suffer. Latency is also important and I have included it here as well. You find that the larger the IO request and the smaller the stripe size latency gets worse. Here are the results from my tests on a DL380 G5 with a P411 and 24 drives in a MSA 70 enclosure. I’ve included tests for an 8k to 256k stripe sizes.

As a footnote I’d like to thank Joe Handley, Ben Poliakoff, David Gosslin and Dale Davis for helping me get the Disk Drive RAID Configuration Tool together. I’m not a web guy!

WARNING! Lots of charts below!

Read 8K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads
Random Sequential
image image
image image
image image
Write 8K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads
Random Sequential
image image
image image
image image
Read 64K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads
Random Sequential
image image
image image
image image
Write 64K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads
Random Sequential
image image
image image
image image
Series To Date:
  1. Introduction
  2. The Basics of Spinning Disks
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns – You are here!
  8. Capturing IO Patterns
  9. Testing IO Systems

Fundamentals of Storage Systems – RAID, An Introduction

In previous articles, we have covered the system bus, host bus adapters, and disk drives. Now we will move up the food chain at take a look at getting several disks to operate as one.

In 1988 David A. Patterson, Garth Gibson, and Randy H. Katz authored a seminal paper, A Case for Redundant Arrays of Inexpensive Disks (RAID). The main concept was to use off the shelf commodity hardware to provide better performance and reliability and a much lower price point than the current generation of storage. Even in 1988, we already knew that CPUs and memory were outpacing disk drives. To try to solve these issues Dr. Patterson and his team laid out the fundamentals of our modern RAID structures almost completely RAID levels 1 through 5 all directly come from this paper. There have been improvements in the error checking but the principals are the same. In 1993, Dr. Patterson along with his team released a paper covering RAID 6.

RAID Level Disk Required Usable
RAID 0 2 N 325px-RAID_0.svg
RAID 0 is striping without parity. Technically, not a Redundant array of disks just an array of disks but lumped in since it uses some of the same technical aspects. Other hybrid raid solutions utilize RAID 0 to join other RAID arrays together. Each disk in the array holds data and no parity information. Without having to calculate parity, there are no penalties on reads or writes. This is the fastest of all the RAID configurations. It is also the most dangerous. One drive failure means you lose all your data. I don’t recommend using RAID 0 unless you are 100% sure losing all your data is completely OK.
RAID 1 2 N/2 325px-RAID_1.svg
RAID 1 is mirroring two disks. RAID 1 writes and reads to both disks simultaneously. You can lose one disk and still operate. Some controllers allow you to read data from both disks; others return only data from the disk that delivers it first. Since there are no parity calculations, it is generally the easiest RAID level to implement. Duplexing is another form of RAID 1 where each disk has its own controller.
RAID 5 3 N-1 675px-RAID_5.svg
RAID 5 is a striped array with distributed parity. This is similar to RAID 0 in that all data is striped across all available disks. Where it differs is one stripe holds parity information. If a drive fails, the data contained on that drive is recreated on the fly using the parity data from the other drives. More than one disk failure equals total data loss. The more drives you have in a RAID 5 array the greater the risk of having a second disk failure during the rebuild process from the first disk failure. The general recommendation at this time is 8 drives or less. In general, the larger the drive the fewer of them you should have in a RAID 5 configuration due to the rebuild time and the likely hood of a second drive failure.
RAID 6 4 N-2 800px-RAID_6.svg
RAID 6 is a striped array with dual distributed parity. Like RAID 5 it is a distributed block system with two parity stripes instead of one. This allows you to sustain a loss of two drives dramatically reducing the risk of a total stripe failure during a rebuild operation. Also known as, P+Q redundancy using Reed-Solomon isn’t practical to implement in software due to the math intensive calculations that have to take place to write parity data to two different stripes. The current recommendation is to use 8 drives or more.
RAID 10 4 N/2 180px-RAID_10
RAID 10 is a hybrid or nested striping scheme combining RAID 1 mirrors with a RAID 0 stripe. This is for high performing and fault tolerant systems. Like RAID 1, you lose half your available space. You could lose N/2 drives and still have a functioning array. Duplexing each mirror between two drive chassis is common. You could lose a drive chassis and still function. The absence of parity means write speeds are high. Along with excellent redundancy, this is probably the best option for speed and redundancy.
RAID 0+1 4 N/2 180px-RAID_0 1
RAID 0 + 1 is not interchangeable with RAID 10. There is one huge difference and that is reliability. You can lose only one drive and have a functioning array. With the more drives in a single RAID 0 stripe the greater the chance you take. Speed characteristics are identical to RAID 10. I have never implemented RAID 0 + 1 when RAID 10 was available.
RAID 50 6 (N-1)*R 320px-RAID_50
Since RAID 5 becomes more susceptible to failure with more drives in the array keeping the RAID 5 stripe small, usually under 8 drives and then striping them with RAID 0 increases the reliability while allowing you to expand capacity. You will lose a drive per RAID 5 stripe but that is a lot less than loosing half of them in a RAID 10. Before RAID 6, this was used to get higher reliability in very large arrays of disks.
RAID 60 8 (N-2)*R 400px-RAID_60
RAID 60 is the exact same concept as RAID 50. Generally, a RAID 6 array is much less susceptible to an array failure during a rebuild of a failed drive due to the nature of the dual striping that it uses. It still is not bullet proof though the RAID 6 array sizes can be much larger before hitting the probability of a dual drive failure and then a failure during rebuild than RAID 5. I do not see many RAID 60 configurations outside of SAN internal striping schemes. You do lose twice as many drives worth of capacity as you do in a RAID 50 array.
RAID 100 8 N/2 320px-RAID_100
RAID 100 is RAID 10 with and additional RAID 0 stripe. Bridging multiple drive enclosures is the most common use of RAID 10. It also reduces the number of logical drives you have to maintain at the OS level.

Speed, Fault Tolerance, or Capacity?

You can’t have your cake and eat it too. In the past, it was hard to justify the cost of RAID 10 unless you really needed speed and fault tolerance. RAID 5 was the default because in most situations it was good enough. Offering near raid 0 read speeds. If you had a heavy write workload, you took a penalty due to the parity stripe. RAID 6 suffers from this even more so with two parity stripes to deal with. Today, with the cost of drives coming down and the capacity going up RAID 10 should be the default configuration for everything.

Here is a breakdown of how each RAID level handles reads and writes in order of performance.

RAID Level Write Operations Notes Read Operations Notes
RAID 0 1 operation High throughput, low CPU utilization.
No data protection
1 operation High throughput, low CPU utilization.
RAID 1 2 IOP’s Only as fast as a single drive. 1 IOP Two read schemes available. Read data from both drives, or data from the drive that returns it first. One is higher throughput the other is faster seek times.
RAID 5 4 IOP’s Read-Modify-Write requires two reads and two writes per write request. Lower throughput higher CPU if the HBA doesn’t have a dedicated IO processor. 1 IOP High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.
RAID 6 6 IOP’s Read-Modify-Write requires three reads and three writes per write request. Do not use a software implementation if it is available. 1 IOP High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.

Choosing your RAID level

This is not as easy as it should be. Between budgets, different storage types, and your requirements, any of the RAID levels could meet your needs. Let us work of off some base assumptions. Reliability is necessary, that rules out RAID 0 and probably RAID 0+1. Is the workload read or write intensive? A good rule of thumb is more than 10% reads go RAID 10. In addition, if write latency is a factor RAID 10 is the best choice. For read workloads, RAID 5 or RAID 6 will probably meet your needs just fine. One of the other things to take into consideration if you need lots of space RAID 5 or RAID 6 may meet your IO needs just through sheer number of disks. Take the number of disks divide by 4 for RAID 5 or 6 for RAID 6 then do your per disk IO calculations you may find that they do meet your IO requirements.

Separate IO types!

The type of IO, random or sequential, greatly affects your throughput. SQL Server has some fairly well documented IO information. One of the big ones folks overlook is keeping their log separate from their data files. I am not talking about all logs on one drive and all data on another, which buys you nothing. If you are going to do that you might as well put them all on one large volume and use every disk available. You are guaranteeing that all IO’s will be random. If you want to avoid this, you must separate your log files from data files AND each other! If the log file of a busy database is sharing with other log files, you reduce its IO throughput 3 fold and its data through put 10 to 20 fold.

RAID Reliability and Failures

Correlated Disk Failures

Disks from the same batch can suffer similar fate. Correlated disk failures can be due to a manufacturing defect that can affect a large number of drives. It can be very difficult to get a vendor to give you disks from different batches. Your best bet is to hedge against that and plan to structure your RAID arrays accordingly.

Error rates and Mean Time Between Failures

As hard disks get larger the chance for an uncorrectable and undetected read or write failure. On a desktop drive, that rate is 10^14 bits read there will be an unrecoverable error. A good example is an array with the latest two-terabyte SATA drives would hit this error on just one full pass of a 6 drive RAID 5 array. When this happens, it will trigger a rebuild event. The probability of hitting another failure during the rebuild is extremely high. Bianca Schroeder and Garth A. Gibson of Carnegie Mellon University have written an excellent paper on the subject. Read it, it will keep you up at night worrying about your current arrays. Enterprise class drives are supposed to protect against this. No study so far proves that out. That does not mean I am swapping out my SAS for SATA. Performance is still king. They do boast a much better error rate 10^16 or 100 times better. Is this number accurate or not is another question all together. Google also did a study on disk failure rates, Failure Trends in a Large Disk Drive Population. Google also found correlated disk failures among other things. This is necessary read as well. Eventually, RAID 5 just will not be an option, and RAID 6 will be where RAID 5 is today.

What RAID Does Not Do

RAID Doesn’t back your data up. You heard me. It is not a replacement for a real backup system. Write errors do occur.As database people we are aware of atomic operations, the concept of an all or nothing operation, and recovering from a failed transaction. People assume the file system and disk is also atomic, it isn’t. NTFS does have a transaction system now TxF I doubt SQL Server is using it. Disk drives limit data transfer guarantees to the sector size of the disk, 512 bytes. If you have the write cache enabled and suffer a power failure, it is possible to write part of the 8k block. If this happens, SQL Server will read new and old data from that page, which is now in an inconsistent state. This is not a disk failure. It wrote every 512-byte block it could successfully. When the disk drive comes back on line, the data on the disk is not corrupted at the sector level at all. If you have turned off torn page detection or page checksum because you believe it is a huge performance hit, turn it back on. Add more disks if you need the extra performance don’t put your data at risk.

Final Thoughts

  1. Data files tend to be random reads and writes.
  2. Log files have zero random reads and writes normally.
  3. More than one active log on a drive equals random reads and writes.
  4. Use Raid 1 for logs or RAID 10 if you need the space.
  5. Use RAID 5 or RAID 6 for data files if capacity and read performance are more important than write speed.
  6. The more disks you add to an array the greater chance you have for data loss.
  7. Raid 5 offers very good reliability at small scale. Rule of thumb, more than 8 drives in a RAID 5 could be disastrous.
  8. Raid 6 offers very good reliability at large scales. Rule of thumb, less than 9 drives you should consider RAID 5 instead.
  9. Raid 10 offers excellent reliability at any scale but is susceptible to correlated disk failures.
  10. The larger the disk drive capacity should adjust your number of disks down per array.
  11. Turn on torn page for 2000 and checksum for 2005/08.
  12. Restore Backups regularly,
  13. RAID isn’t a backup solution.
Series To Date:
  1. Introduction
  2. The Basics of Spinning Disks
  3. The System Bus
  4. Disk Controllers, Host Bus Adapters and Interfaces
  5. RAID, An Introduction – You are here!
  6. RAID and Hard Disk Reliability, Under The Covers
  7. Stripe Size, Block Size, and IO Patterns
  8. Capturing IO Patterns
  9. Testing IO Systems