Fundamentals of Storage Systems – Stripe Size, Block Size, and IO Patterns
If you have been following this series we have covered system buses, hard disks, host bus adapters and RAID. Along the way we also covered how to capture your IO patterns and the SQLIO tool. Now we will pull it all together.We move up the stack even further to the actual layout of the RAID stripe and the file system. How the stripe and file system are laid out on your disks has a huge impact on performance. One of the things that has really gotten some traction over the last few years is sector alignment. This one thing, if not done, could cost you 30% to 40% of your IO potential. Jimmy May has covered sector alignment in depth So I won’t hash it here again. Kendal Van Dyke also has a good series that covers offset, stripe size, and allocation units with different raid levels.
It Don’t Add Up…
Something I’ve seen, and been guilty of, is taking a drives base specifications and just multiplying out. Say the manufacturer says the drive will to 79MB/Sec minimum throughput, we have 10 drives so that is 790MB/Sec of throughput! We all know from experience that this isn’t so. What eats us up is how much slower it really can be. As we have seen throughout this series there is overhead associated to everything. Before we just throw a bunch of disks in an enclosure and press it into service it would be nice to have an idea of what the performance should be. It’s also recommended to do some of this work before you actually buy anything so you don’t have to go back to your boss and beg for more money and explain to him that your wild guess was wrong.
Always add a pinch of salt to whatever the disk manufacturer puts in the specifications. Most of the time they will be close enough. The problem lies in the fact they don’t always disclose the methods for archiving those numbers. For instance, when they report minimum and maximum throughput they are usually talking about a scan of the entire disk including all meta data stored between tracks, the best possible throughput possible. You won’t see those results in every day life. They also give you numbers that can be completely irrelevant like single sector read rates. very rarely do you read a single sector at a time. Personally, I would love if the drive makers gave the engineering specifications. I know that won’t happen, it would make my life easier though. The disk characteristics that are important are, sector size,spindle speed, seek times read and write, sequential times read and write. To a lesser extent sequential throughput in megabytes per second. With the single disk numbers we can move on to the RAID configuration.
Configuring your RAID Array
There are several factors that impact the RAID arrays ability to perform. The RAID level, size of the IO request, and stripe size. RAID level is the easy one, what kind of hits do you take on writes vs. capacity of the array. On the stripe size there is a direct corollary with the size of the IO request. If the IO request is bigger than the stripe size it will have to seek across another disk to satisfy the data request. If the IO request size is very small and random you may loose some IO performance if the requests pile up on one disk causing a hot spot. There are established calculations that you can perform to get an idea of how to configure you array. I’ve built a web page that you can use to do all the basic calculations, Disk Drive RAID Configuration Tool. These equations are base line estimates so you aren’t working completely in the dark. You can enter your own drive statistics or pick from one of 1100 hard drives in the database. This web calculator is based off of Peter Chen’s equations for estimating RAID performance and best stripe size. I’ll add more to it as I get time.
SQL Server IO Patterns and Array Performance
SQL Server works with two specific IO request size 8K and 64K in general. If you did your due diligence earlier you could also add any other request size that you saw come through. Focusing on the page size and extent size is a good place to start. Using the raid calculator tool I chose a Seagate Savvio 15K.2 drive as my base. One of the things my calculator can’t take into consideration is your system and RAID HBA. This is where testing is essential. You will find there are anomalies in every card, physical limits on throughput and IO’s. Since my RAID card won’t do a stripe bigger than 256k that is my cap for size. Reading through several IO white papers on SQL Server the general recommendation is for 2000/2005 a 64k or 128k stripe size and for SQL Server 2008 a 256k stripe size. I’ve found as general guidance, this is a good place to start as well. The calculator tells me for a RAID 10 array with 24 drives at a 256k stripe size and 8k IO request I should get 9825 IOs/Sec and 76.75 MB/Sec on average, across reads, writes, sequential and random IO requests. That’s right, 76 MB/Sec throughput for 24 drives rated at 122 MB/sec minimum. That is 2.5 MB/Sec per drive. The same array at a 64k IO request size yields 8102 IOs/Sec and 506 MB/Sec. A huge difference in throughput just based on the IO request size. Still, not anywhere near 122 MB/Sec. As an estimate, I find that these numbers are “good enough” to start sizing my arrays. If I needed to figure out how big the array needs to be to support say 150 MB/sec throughput or 10000 IOs/Sec you can do that with the calculator as well. Armed with our estimates it’s time to actually test our new RAID arrays. I use SQLIO to do synthetic benchmarking before running any actual data loads.
After doing a round of testing I found that in some cases the numbers were a little high or a little low. Other factors that are hard to calculate are cache hit ratios. Enterprise RAID HBA’s usually disable the write cache on the local disk controller and just use their own batter backed cache for all write operations. This is safer but with more and more disks on a single controller the amount of cache per disk can get pretty low. The HBA will also want you to split that between read and write operations. On my HP RAID HBA’s the default is 25% read and 75% write. In an older study I found on disk caches and cache size saw diminishing returns above 2 MB gaining between 1 and 2 percent additional cache hits per megabyte of cache. I expect that to flatten out even more as the caches get larger, you simply can’t get 100% cache ratios that would mean the whole drive fit in the ram cache or your IO request are the same over and over. Generally if that is the case you will find SQL Server won’t have to go to disk it will have what it needs in the buffer pool for reads. I find that if you have less than 20 percent write activity leaving the defaults is fine. If I do have a write heavy load I will set the cache to 100% writes.
Having completed my benchmarking I found that 128k or 256k stripe size was fine on average. Just realize that if you optimize for one IO pattern the others will suffer. Latency is also important and I have included it here as well. You find that the larger the IO request and the smaller the stripe size latency gets worse. Here are the results from my tests on a DL380 G5 with a P411 and 24 drives in a MSA 70 enclosure. I’ve included tests for an 8k to 256k stripe sizes.
As a footnote I’d like to thank Joe Handley, Ben Poliakoff, David Gosslin and Dale Davis for helping me get the Disk Drive RAID Configuration Tool together. I’m not a web guy!
WARNING! Lots of charts below!
|Read 8K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
|Write 8K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
|Read 64K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
|Write 64K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|