Category Archives: Fundamentals
It’s that time again, T-SQL Tuesday is here! This time Pat Wright (blog|twitter) is hosting and has put forth automating tasks using ether T-SQL or Powershell. I LOVE automating stuff. As a production DBA in some very large shops you can’t do your job unless you make your servers work for you. I’ve been using T-SQL and *GASP* xp_cmdshell, OSQL and file shares to gather stats and push configurations to servers for decades. Log before fancy things like C# and Powershell existed. These days I use a variety of home grown tools but doing things with just T-SQL can be just as powerful. I’m going to leverage this post to start a series on a pure T-SQL implementation of configuration management, data gathering and utility procedure deployment.
Where Is Your Management Database?
Every DBA should have two things, a utility, or management database on every server and a central repository where all the locally collected data in the management database is pulled back to. What surprises people when I talk about this methodology is I don’t always advocate using a licensed copy of SQL Server. By that I mean that SQL Express works just great as a central repository server. I usually put this on a virtual machine, Hyper-V or what ever flavor of virtual environment your company supports. This allows you to do things like enable CLR and xp_cmdshell on a server that technically is non-production and keep your security risk much lower. Every server that is deployed in my shop gets a local management database. From the central repository I can push out everything I need to manage a server and keep all my servers up to date as I make improvements or bug fixes to the management code. That’s all I’m really going to say about that in this post though I just wanted to give you an idea of just how deep the rabbit hole can go.
DMV’s give you the information, you have to use it.
Since SQL Sever 2005 Microsoft let of of the black box mentality and started providing crazy useful information via Dynamic Management Views. Virtual file statistics though have been around for quite a while. They got a touch up in the DMV but the basic information was available in SQL Server 2000 via function call. The DMV I’m after is sys.dm_io_virtual_file_stats. It has a ton of information in it. It’s main problem though is it is an aggregation over time and doesn’t really tell you what is different from yesterday to today. To get around that we have to build our own sampling routine.
The VirtualFileStats Table
We create a table in our local management database to collect the daily numbers. I try to keep things simple.
CREATE TABLE dbo.VirtualFileStats ( RecordID INT IDENTITY(1,1) NOT NULL, ServerName VARCHAR(255) NOT NULL, DBID INT NOT NULL, FileID INT NOT NULL, Reads BIGINT NULL, ReadsFromStart BIGINT NULL, Writes BIGINT NULL, WritesFromStart BIGINT NULL, BytesRead BIGINT NULL, BytesReadFromStart BIGINT NULL, BytesWritten BIGINT NULL, BytesWrittenFromStart BIGINT NULL, IostallInMilliseconds BIGINT NULL, IostallInMillisecondsFromStart BIGINT NULL, IostallReadsInMilliseconds BIGINT NULL, IostallReadsInMillisecondsFromStart BIGINT NULL, IostallWritesInMilliseconds BIGINT NULL, IostallWritesInMillisecondsFromStart BIGINT NULL, RecordedDateTime DATETIME NULL, IntervalInMilliseconds BIGINT NULL, FirstMeasureFromStart BIT NULL ) GO CREATE TABLE dbo.VirtualFileStatsHistory ( RecordID INT NOT NULL, ServerName VARCHAR(255) NOT NULL, DBID INT NOT NULL, FileID INT NOT NULL, Reads BIGINT NULL, ReadsFromStart BIGINT NULL, Writes BIGINT NULL, WritesFromStart BIGINT NULL, BytesRead BIGINT NULL, BytesReadFromStart BIGINT NULL, BytesWritten BIGINT NULL, BytesWrittenFromStart BIGINT NULL, IostallInMilliseconds BIGINT NULL, IostallInMillisecondsFromStart BIGINT NULL, IostallReadsInMilliseconds BIGINT NULL, IostallReadsInMillisecondsFromStart BIGINT NULL, IostallWritesInMilliseconds BIGINT NULL, IostallWritesInMillisecondsFromStart BIGINT NULL, RecordedDateTime DATETIME NULL, IntervalInMilliseconds BIGINT NULL, FirstMeasureFromStart BIT NULL )
This is what we need to gather, and later analyze the data. Since we are managing our samples we have to know when the sampling started and what the first sample is. FirstMeasureFromStart lets us know that it is the first base measurements the rest of the samples will delta off of.
GatherVirtualFileStats Stored Procedure
Next we need a stored procedure to do the sampling. One thing you will notice is the procedure executes continuously with a WAIT FOR DELAY built into it so you can get finer grained than the 1 minute limitation of the SQL Agent. Sometimes, I will do one off sampling for a short period, say 30 minutes at a 30 second interval but most often I just let it run and set the sample rate at 1 minute or larger depending on how busy the system is.
IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = Object_id(N'[dbo].[GatherVirtualFileStats]') AND Objectproperty(id, N'IsProcedure') = 1) DROP PROCEDURE [dbo].[GatherVirtualFileStats] GO -------------------------------------------------------------------------------------- -- GatherVirtualFileStats -- by: Wesley D. Brown -- date: 02/08/2011 -- mod: 04/14/2011 -- description: -- This stored procedure is used to sample sys.dm_io_virtual_file_stats to track -- performance at a database file level. This is useful for finding -- hotspots on SAN's or under performing IO systems. -- parameters: -- @Duration = '01:00:00' How long to run before exiting -- @IntervalInSeconds = 120 Number of seconds between samples --@DB = -1 DB_ID to monitor, -1 for all --@DBFile = -1 File_ID of file to monitor, -1 for all -- usage: -- DECLARE @RC INT, -- @StartTime DATETIME, -- @databaseID INT -- SELECT @StartTime = Getdate(), -- @databaseID = Db_id() -- EXEC @RC = Gathervirtualfilestats -- '00:45:30', -- 30, -- 10, -- -1 -- SELECT * -- FROM dbo.VirtualFileStats -- WHERE DBID = 10 -- ORDER BY RecordID -- platforms: -- SQL Server 2005 -- SQL Server 2008 -- SQL Server 2008 R2 -- tested: -- SQL Server 2005 SP2 -- SQL Server 2008 R2 --------------------------------------------------------------------------------------- -- *** change log *** -- Added history table and perge on start up if there is data in the main table -- *** end change log *** ------------------------------------------------------------------------------------- CREATE PROC dbo.Gathervirtualfilestats @Duration DATETIME = '01:00:00', @IntervalInSeconds INT = 120, @DB INT = -1, @DBFile INT = -1 AS SET nocount ON DECLARE @StopTime DATETIME, @LastRecordedDateTime DATETIME, @CurrentDateTime DATETIME, @ErrorNumber INT, @NumberOfRows INT, @ErrorMessageText NVARCHAR(4000), @CurrentServerName VARCHAR(255), @DifferenceInMilliSeconds BIGINT IF EXISTS (SELECT 1 FROM dbo.VirtualFileStats) BEGIN IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = Object_id(N'[dbo].[VirtualFileStats]') AND Objectproperty(id, N'IsTable') = 1) BEGIN INSERT INTO dbo.VirtualFileStatsHistory SELECT * FROM VirtualFileStats; TRUNCATE TABLE dbo.VirtualFileStats; END END SELECT @CurrentServerName = Cast(Serverproperty('servername') AS VARCHAR(255)) SET @DifferenceInMilliSeconds = Datediff(ms, CONVERT(DATETIME, '00:00:00', 8), @Duration) SELECT @StopTime = Dateadd(ms, @DifferenceInMilliSeconds, Getdate()) WHILE Getdate() <= @StopTime BEGIN SELECT @LastRecordedDateTime = @CurrentDateTime SELECT @CurrentDateTime = Getdate() INSERT INTO dbo.VirtualFileStats (ServerName, DBID, FileID, Reads, ReadsFromStart, Writes, WritesFromStart, BytesRead, BytesReadFromStart, BytesWritten, BytesWrittenFromStart, IostallInMilliseconds, IostallInMillisecondsFromStart, IostallReadsInMilliseconds, IostallReadsInMillisecondsFromStart, IostallWritesInMilliseconds, IostallWritesInMillisecondsFromStart, RecordedDateTime, IntervalinMilliseconds, FirstMeasureFromStart) SELECT @CurrentServerName, vfs.database_id, vfs.[file_id], vfs.num_of_reads - dbaf.ReadsFromStart AS Reads, vfs.num_of_reads AS ReadsFromStart, vfs.num_of_writes - dbaf.WritesFromStart AS Writes, vfs.num_of_writes AS WritesFromStart, vfs.num_of_bytes_read - dbaf.BytesReadFromStart AS BytesRead, vfs.num_of_bytes_read AS BytesReadFromStart, vfs.num_of_bytes_written - dbaf.BytesWrittenFromStart AS BytesWritten, vfs.num_of_bytes_written AS BytesWrittenFromStart, vfs.io_stall - dbaf.IostallInMillisecondsFromStart AS IostallInMilliseconds, vfs.io_stall AS IostallInMillisecondsFromStart, vfs.io_stall_read_ms - dbaf.IostallReadsInMillisecondsFromStart AS IostallReadsInMilliseconds, vfs.io_stall_read_ms AS IostallReadsInMillisecondsFromStart, vfs.io_stall_write_ms - dbaf.IostallWritesInMillisecondsFromStart AS IostallWritesInMilliseconds, vfs.io_stall_write_ms AS IostallWritesInMillisecondsFromStart, @CurrentDateTime, CASE WHEN @LastRecordedDateTime IS NULL THEN NULL ELSE Datediff(ms, dbaf.RecordedDateTime, @CurrentDateTime) END AS IntervalInMilliseconds, CASE WHEN @LastRecordedDateTime IS NULL THEN 1 ELSE 0 END AS FirstMeasureFromStart FROM sys.Dm_io_virtual_file_stats(@DB, @DBFile) vfs LEFT OUTER JOIN VirtualFileStats dbaf ON vfs.database_id = dbaf.dbid AND vfs.[file_id] = dbaf.fileid WHERE ( @LastRecordedDateTime IS NULL OR dbaf.RecordedDateTime = @LastRecordedDateTime ) SELECT @ErrorNumber = @@ERROR, @NumberOfRows = @@ROWCOUNT IF @ErrorNumber != 0 BEGIN SET @ErrorMessageText = 'Error ' + CONVERT(VARCHAR(10), @ErrorNumber) + ' failed to insert file stats data!' RAISERROR (@ErrorMessageText, 16, 1) WITH LOG RETURN @ErrorNumber END WAITFOR DELAY @IntervalInSeconds END
I Have Data, Now What?
This is where the fun begins! If you just query the table the data doesn’t make much sense.
We need to do some simple math and get the data into a metric that is meaningful.
SELECT TOP 10 Db_name(dbid) AS 'databasename', File_name(fileid) AS 'filename', Reads / ( IntervalInMilliSeconds / 1000 ) AS 'readspersecond', Writes / ( IntervalInMilliSeconds / 1000 ) AS 'writespersecond', ( Reads + Writes ) / ( IntervalInMilliSeconds / 1000 ) AS 'iopersecond', CASE WHEN ( Reads / ( IntervalInMilliSeconds / 1000 ) ) > 0 AND IostallReadsInMilliseconds > 0 THEN IostallReadsInMilliseconds / Reads ELSE 0 END AS 'iolatencyreads', CASE WHEN ( Reads / ( IntervalInMilliSeconds / 1000 ) ) > 0 AND IostallWritesInMilliseconds > 0 THEN IostallWritesInMilliseconds / Writes ELSE 0 END AS 'iolatencywrites', CASE WHEN ( ( Reads + Writes ) / ( IntervalInMilliSeconds / 1000 ) > 0 AND IostallInMilliseconds > 0 ) THEN IostallInMilliseconds / ( Reads + Writes ) ELSE 0 END AS 'iolatency', RecordedDateTime FROM management.dbo.VirtualFileStats WHERE DBID = 10 AND FirstMeasureFromStart = 0 ORDER BY RecordID
This gives us reads, writes and io latency per second results.
Now we are cooking! We can now see that on this database we are seeing some spikes in latency, the number of milliseconds it takes to complete a single IO request, and may warrant investigation. As a general rule of thumb if I see IO latency above 20 milliseconds consistently I start looking deeper into the IO system to see what is wrong. A single modern hard disk is capable of 130 random IO’s a second. Another thing to consider is how many databases are on the same disk. I will pull the database files together with the volume they are on to get a true aggregate number of IO’s and latencies. You may find that a single database is dominating the disks and causing other databases to slow down even when the number of IO’s for those databases is small. On a SAN being able to get to the file level can help you locate the physical disks for that LUN and help your SAN administrators look at a very specific spot instead of the overall health of the SAN and actually fix your performance issues.
Lastly, I run a nightly job that performs these aggregations for me and moves them into a table that I keep long term so I can see performance over time. This is a great way to see if you are getting near your IO capacity and if the steps you have done correct the issues.
So, That’s it?
Not by a long shot. Measuring IO latencies is just one piece of a much larger puzzle for troubleshooting performance problems. It is a valuable one though, and keeping this information over time is just priceless.
I added a history table an a check at the beginning to move all data to history before starting up. Mike Kaplan reported below that multiple runs caused issues if there was data in the table from a previous run.
Solid state storage has come on strong in the last year. With that explosion of new products it can be hard to look at all the vendor information and decide which device is best for you. Between the different manufacturers using different methods to benchmark their products showing two different numbers for reads and writes using different methodologies it can be extremely confusing. If you haven’t read Solid State Storage Basics you may not understand all the terms used in this article.
SLC and MLC Characteristics and Differences
Right now there are two main flavors of NAND Flash that are in use. Single Level Cell(SLC) and Multi Level Cell(MLC). SLC stores a single bit cell while MLC can store two bits. There are flavors of MLC that can store three and four bits but are unsuitable at this time for mass storage like hard drives. They have very low endurance and wear out quickly.
SLC has several desirable characteristics that have made it the choice for enterprise applications for quite a while. It is more durable in every way over MLC. Where it loses out is on capacity and price.
|Read Speed||25~ nanoseconds||50~ nanoseconds|
|Write Speed||220~ nanoseconds||900~ nanoseconds|
|P/E Cycles||100k to 300k||3k to 30k|
|Minimum ECC Bits required||1 bit per 512 bytes||12 bits per 512 bytes|
SLC can cost as much as five times as MLC. This alone is enough for many manufacturers to look at MLC over SLC. Couple that with the increased capacity makes MLC a compelling alternative for mass storage. The problem has been how to make MLC reliable in the enterprise.
As you can see, SLC is more robust requiring less error correcting code to fix data issues. Just a few years ago, MLC wasn’t considered good enough to be in even consumer grade drives. Over the last three years several manufacturers have focused on building NAND Flash controllers that could compensate for this using large amounts of error correction. In some cases several times the 12 bits per 512 bytes. This combined with better garbage collection and wear-leveling algorithms have finally extended MLC into the enterprise. This comes with a price though. ECC has to be stored somewhere, usually sacrificing storage space, and you need a much more powerful controller to handle the calculations without hurting performance. Another one of the techniques to extend the performance and endurance used is to put as many chips in a parallel arrangement with multiple channels. Think of it as RAID on a chip level instead of a hard disk level. This allows them to spread the IO load as wide as possible. The larger the capacity of the storage device the more area it has to use things like TRIM and it’s own internal garbage collection across multiple NAND chips keeping IO from stalling out due to write amplification. It also increases the life of the device as well since you can spread the wear-leveling out. There are standards bodies like JEDEC that help define endurance and longevity but you must still read the fine print. A good example is the Intel product manual for the X-25M SSD. If you look at page 6 you see the minimum useful life rated at 3 years. But, if you look at the write endurance you see that the 80 gigabyte drive is rated at 7.5 terabytes. That is 7.5 terabytes period, for the life of the drive. That means you shouldn’t write more than 21 gigabytes a day in changed data to the drive. For SQL Server that can be quite a low number. I’ve seen data warehousing processes load multiple terabytes over a 8 hour load window. Again, capacity equals endurance the 160 gigabyte drive can sustain 15 terabytes worth of data change. Intel will tell you that the X25-M is meant for enterprise workloads, they are wrong. In contrast, the X-25E SSD has a much longer life due to the SLC it uses instead of MLC. the 32 gigabyte version supports 1 petabyte of random writes and the 64 gigabyte drive supports 2 petabytes of random writes over the life of the drive. This makes the X-25E a better candidate for server work loads. Fusion-io rates their MLC based ioDrive at 5 terabytes a day. They also claim a life expectancy of 16 years. That is 28 petabytes of P/E cycles. This is to just show you that with enough engineering you can have an MLC based device still be very reliable.
SATA, SAS or Neither?
The interface for your solid state disk is also critical to the performance of the drive. We are quickly hitting a wall with SATA II and solid state where a single SSD can saturate a single SATA channel. SAS and SATA both have released the new third generation standard allowing up to 600 megabytes a second of through put but even that doesn’t offer much head room for growth. Several manufacturers are calling their SSD offerings enterprise even though they are on a SATA interface. If you are building a high performance IO subsystem SATA isn’t the best option. With SATA II and the addition of Native Command Queuing it did get a lot better but still falls short of SAS in several areas.
SATA Vs. SAS
|Command Queuing||TCQ supports queue depths up to 216 usually capped at 64||NCQ supports queue depths up to 32|
|Error recovery and detection||Uses the SCSI command is more robust||SMART Proven to be in adequate. see Google Paper|
|Duplex||Full Duplex dual port per drive||Half Duplex single port|
|Multi-path IO||fully supported at drive level||supported in SATA II via expanders|
Some of these features were nice but if you were choosing between a 7200 RPM SATA drive and a 7200 RPM SAS drive there wasn’t a huge difference. Add in flash though and SATA very quickly shows its short comings. I cannot stress how important command queuing is to flash storage. If the drive you have picked supports NCQ make sure your HBA supports NCQ and ACHI mode to get the most out of it, PC Perspective has a nice write up on this. Lastly, most SATA drives don’t honor the OS request to disable write caching on the drive. This is a big deal for SQL Server where protecting the data is very important. That alone usually keeps me from putting critical databases on SATA based storage. Most RAID HBA’s may let you toggle the drives write cache on or off on a per drive basis but there is still no guarantee that the drive will honor that request ether.
PCIe add in cards
If you aren’t limited to the standard 3.5” or 2.5” form factor and can choose a PCIe based flash device I would recommend starting with Fusion-io. I haven’t had any experience with the Texas Memory System PCIe card though. OCZ, Super Talent and others like them use a combination of bridge chips, RAID controller chips and flash controller chips to build up their SATA PCIe offerings. The form factor may be more convenient but they are ultimately the same as multiple SATA drives plugged into a RAID HBA.
The last thing to remember is TRIM doesn’t work through RAID HBAs SAS or SATA doesn’t matter.
By the numbers
I see people quote performance numbers from different manufactures about just how fast their particular solid state storage is. The problem is, there is no real standard for measuring performance and it can be almost impossible to do an apples to apples comparison between two different devices. If you start at the product specification for the X25-M you see the what you expect. 4K read IOPS 35,000 at 100 percent span(using the entire drive). Write IOPS however are a little different. Using 100 percent span the IO/Sec drop to 350. If you only use one tenth of the drive it shoots up to 3300. The difference is startling. Using an old technique called short stroking, they are able to show the drive in a better light. Using this technique on hard disks yields higher IO’s per second at the cost of capacity and throughput. Applying this technique to a solid state disk limits the amount of data space used for writes and gives the maximum amount of free space for wear-leveling and garbage collection greatly reducing the write amplification effect. Rarely do you see the lower number quoted. On the X-25E all numbers are quoted at full span, showing again the higher performance of SLC. Also, if you look at the footnotes all write tests were done with drive caches enabled. For SQL Server this is a bad idea, if you have a power outage any data in the drive cache is lost. They perform these tests at the maximum queue depth for Native Command Queuing (NCQ) can handle. Again, this pushes the device to its peak throughput. This isn’t a bad thing for SSD’s, but most SQL Server setups have been engineered to keep queue depths low to decrease latencies from the IO system which is usually made up of spinning disks. If you don’t have latency issues now, you may not see a huge improvement by replacing your spinning disks with solid state ones. Size of the IO request is also very important Usually for number of IO’s they will use a sector sized request. On SSD’s that is normally 4 kilobytes. For throughput megabytes per second they use a 128 kilobyte request to get higher numbers. So, when you read the specifications you get the impression that a drive will do say 260 MB/sec at 35,000 IOs/Sec which just isn’t true. This isn’t a new game, hard drive benchmarks also do something similar. As you look at the 4k numbers you can effectively cut them in half since SQL Server works on an 8k page request size. SSDs also perform differently on random and sequential IO loads just like hard disks do. When you look at the specification make sure and note the IO mix, if they don’t give those numbers assume that you will have to do your own testing!
Previous Writes Effect Future Writes
Another issue with the performance numbers quoted has to do with the state of the drive. When a solid state disk is new, i.e. never been written to, it is at it’s peak. Performance will be the best it is ever going to be. When you test your solid state devices doing short duration tests can be very misleading. As I have already pointed out, if you only use a small section of the drives for writes you get inflated numbers. If you only do a short test on the entire drive you are effectively doing the same thing. You must test the entire drive. You must also understand your workload. If you don’t know what the workload will be don’t be afraid to test a wide range of IO sizes and types. Sequential writes tend to leave large contiguous blocks of free space making garbage collection faster. In contrast random writes typically leave lots of small blocks of free space forcing garbage collection to work overtime slowing writes down. As you move from one IO type to another you should add in extra time for the drive to settle into a new steady state before resuming valid samples. Your goal is to get the drive to perform in a predictable manor for your IO load. Realize you may need to discard a range of samples that cover the transition from one steady state to the other. It can lower or inflate your averages and cause you to under or over provision your storage to meet your IO requirements.
Performance over Time
Unlike a hard drive, as you use a solid state disks performance degrades over time for several reasons. In the case of the X-25M the first firmware suffered from poor garbage collection and IO pattern recognition on large volumes of small IO’s causing the drive to suffer as much as a five fold decrease in write performance. We aren’t just talking small files but small changes to large files, like SQL Server data files. This particular problem was partially fixed with a firmware update. In general, all solid state devices suffer As you use your drive over a longer period it will lose performance as part of the normal wear on the NAND Flash chips themselves. They develop more errors cause more write retries. These issues are corrected using ECC and bad block management, but it still leads to poorer performance. SLC has an advantage over MLC again due to it’s much higher endurance but isn’t 100 percent immune to this. If you replace your hardware on a three or five year cycle this may not be a huge issue for you, but it still pays to monitor the performance over time.
There is a lot to learn when it comes to solid state storage. Making sure you do your own testing and research can keep you from suffering from premature failure and poor performance down the road. Remember, NAND Flash has been around for a while but this new wave of solid state storage is only a few years old. Not having a large pool of these devices in the field for longer than their rated life span makes it hard to predict if they are truly as reliable as we all hope they are.
In the last RAID article we covered the basics. This is a little deeper dive into the underlying mechanics of RAID. Exactly what it does, how it does it and what it doesn’t do that people assume it does. I sited David Patterson, Garth Gibson, and Randy Kats and their work at UC Berkley on RAID. They show something I’ve talked about before the “Pending I/O Crises”. Of course it isn’t pending anymore, its here. One of the concerns has to do with Amdah’s Law and speeding up execution with parallel operations. As processors and memory speed up hard disks are still an order of magnitude slower. Another aspect is Kryder’s Law, which like Moore’s Law, is a estimation of capacity growth of hard disks over time. Kryder’s Law is starting to slow down just as Moore’s law is. The problem with hard drives has never really been capacity, its speed. As areal density increases you do get an increase in data throughput, there is simply more data per square inch on the disk. You also get an improvement in I/O’s, tracks are closer together. We haven’t broken past the 15k barrier yet. I’ve still got Seagate Cheetah 15k.3 drive from 2002. It has a max sequential throughput around 80 MB/sec. I doubt we will see spinning disks faster than 15k. This is a real problem for scaling I/O up. Enter RAID. It’s simple get a bunch of disks and then stripe data across them. One little problem creeps up. Reliability goes down for each drive you add to the array. Using RAID 0 pretty much guarantees you will have an array failure. To overcome this We start adding some way to make the data more redundant.
Hard Disk Reliability
People make a lot of assumptions about hard drives and their reliability. Hard disks break down into two classes consumer grade, the drive you have in your desktop and enterprise, the kind usually in your servers. There are misconceptions around both. Recently, Google and others have written papers based on long term large batch sample failure rates and found the enterprise class drives don’t last any longer than consumer class. This study is perfectly valid from a physical reliability point of view. Most drives are manufactured the same way in the same plants. Not like the poor misunderstood lemming, hard disks do all jump off a cliff together. Studies have shown that there is a strong corollary to disk failure and a shared manufacturing batch. Simply put, if they are made around the same time if one has a failure there is a likelihood, around 30%, other drives in that batch will also suffer failures. So, what are we paying for with an enterprise drive besides speed? Data reliability. Enterprise level drives have more robust error correction than their consumer counterparts. On a normal hard drive the smallest piece of data that can be written is 512 bytes. This is the size of a sector. Enterprise drives usually have 520 byte sector 8 bytes are used to verify the data in that sector, this is the Data Integrity Field. DIF isn’t 100% ether. It is more reliable than a consumer drive without it. You can still have write corruption for several reasons. Misdirected writes occur when data is written to the wrong location on disk and reported as a successful write. When the system goes to access again you get a read fault. Torn pages, which we are familiar with, is when an 8k page write is requested but only part of the 8k is actually reported. Corruption outside the drive where the controller makes a bad request to write but it is a perfectly legitimate I/O request at the hard drive level. With larger drives the odds of hitting one of these errors becomes a real possibility. Enterprise drives add this extra layer of protection. Your RAID HBA may also have additional error correction. The last thing I would like to touch on is write catching. Without a battery backup, or if the cache non-volatile in nature, you will loose data on a power failure if a write is in progress.
RAID Host Bus Adapter Reliability
The adapter is as reliable as any other component in your system. Normally, the cache on the controller is ECC based. Also, you usually have the option of a battery module to supply the cache with power incase of an outage so the data in cache can be written to the array when everything comes back up. Most of the issues I have seen with RAID HBAs is almost always driver or firmware related. You may also see inconsistent performance due to write catching and the battery backup unit. The unit has to be taken off line and conditioned to keep it in top condition. The side effect is a temporary disabling of the write cache on the controller. You can override this setting on some controllers but it is dangerous proposition. I personal anecdote from my days at a large computer manufacturer, we started getting a larger volume of failed drive calls into support. We started doing failure analysis. It all pointed back to a particular batch of hard drives. That was when the drive manufacturer made a change in its drives removing very small component. It shaved a few cents off the cost but had a dramatic effect. All the drives were technically good and would pass validation. Under a enough load and attached to a particular RAID HBA they would randomly fall off line. It came down to the little component. It provided a little bit of electrical noise suppression on the SCSI bus. Some cards were effected and others chugged along just fine. This is also confirmed by the Google paper, they observed the same behavior. They also point out that 20% to 30% of all returned drives have no detectible problems. The point is validate your entire I/O stack. Any single component may be within specification but may not play well with others.
RAID Parity, Mirroring, and Recoverability
Not to belabor the point, RAID isn’t bullet proof. People rap RAID round themselves like Superman’s cape. There are several issues that all the RAID schemes in the world don’t protect against. With current hard disks in the two terabyte range it is possible to build even a small RAID 5 array and have potential for complete failure. The problem is the amount of data that has to be read for the rebuild process. Having a hot spare available reduces the time to replace a failed drive to zero but that is only part of the equation. The much larger part is rebuild time. Lets say you have a 14 drive RAID 5 array with the new two terabyte drives installed and suffer a failure. If you have no activity on the array and all the IO is detected to the rebuilt it could still take two or three days to rebuild the array. During that time you are effectively running on a RAID 0 array that is now under load. Your chance of total array failure is near 100%. RAID by its very nature assumes a failure is a hard failure. A drive goes off line and the redundant part of the system takes over. It also makes the assumption that if a write succeeds then, barring a hardware failure, the read will also be valid. Data is only validated on writes not on reads. If it was RAID 5 would be twice as slow on reads and four times as slow on writes as a single drive or RAID 0. With all the potential hidden write failures it is completely possible to have hidden corruption and not know it until it is way to late. RAID levels with striped parity are most susceptible to this kind of silent creeping corruption. It is possible that the corrupted data is in the parity stripe making it completely unusable for data reconstruction. If that particular piece of data doesn’t change you can go a very long time with a RAID 5 array with polluted parity. You know how to recover from a polluted parity stripe? Simple, copy all the data off the array, figure out which files are now corrupt and restore them. RAID 6 with its dual stripes makes it more likely to recover your data from a single parity stripe becoming corrupt. You do pay a price in write speed for that extra level of protection. RAID 1 and RAID 10 aren’t perfect ether. On a mirrored pair if the write is assumed good there is no way to validate that on read. Without a third piece of information, like a checksum, it would be a coin toss. If the read is successful there is no way to tell which drive has the bad data. It is possible to have a mirrored pair run just fine with one giving you corrupted data on reads all day long. It would manifest itself as file corruption or some other anomaly that could be difficult to track down. We are back to relying on the disk to tell us all is well. We often recommend RAID 10 over everything else for speed and reliability, and I still hold to that. RAID 10 can still suffer from a catastrophic failure due to a single mirrored pair failing at the same time. With the probability of correlated disk failures it can’t be ignored.
What Can We Do?
There are a few tools available to us that can help predict the failure of a drive or that something is wrong with the array. All modern drives support the SMART protocol. Even though Google found it wasn’t as useful and wasn’t 100% reliable, closer to 30%, some warning is better than none in my opinion. All modern RAID HBA’s also come with tools to detect parity errors. You do take a hit when you run these internal consistency checks. Just like you run maintenance on your databases via DBCC your RAID arrays need checkups too. They are a necessary evil if you don’t want any surprises one day when you have a failed drive in your RAID 5 array and can’t rebuild it. If you have intermittent problems with a drive, don’t mess around, replace it. The HBA almost always has the ability to send SNMP messages to something like nagios or HP Openview, Use it. If you aren’t running something like that usually you can configure email alerts on error to go out. Proactive is the name of the game.
Don’t take my word for it….
Short list of papers to get you started on your path to paranoia.
Series To Date:
If you have been following this series we have covered system buses, hard disks, host bus adapters and RAID. Along the way we also covered how to capture your IO patterns and the SQLIO tool. Now we will pull it all together.We move up the stack even further to the actual layout of the RAID stripe and the file system. How the stripe and file system are laid out on your disks has a huge impact on performance. One of the things that has really gotten some traction over the last few years is sector alignment. This one thing, if not done, could cost you 30% to 40% of your IO potential. Jimmy May has covered sector alignment in depth So I won’t hash it here again. Kendal Van Dyke also has a good series that covers offset, stripe size, and allocation units with different raid levels.
It Don’t Add Up…
Something I’ve seen, and been guilty of, is taking a drives base specifications and just multiplying out. Say the manufacturer says the drive will to 79MB/Sec minimum throughput, we have 10 drives so that is 790MB/Sec of throughput! We all know from experience that this isn’t so. What eats us up is how much slower it really can be. As we have seen throughout this series there is overhead associated to everything. Before we just throw a bunch of disks in an enclosure and press it into service it would be nice to have an idea of what the performance should be. It’s also recommended to do some of this work before you actually buy anything so you don’t have to go back to your boss and beg for more money and explain to him that your wild guess was wrong.
Always add a pinch of salt to whatever the disk manufacturer puts in the specifications. Most of the time they will be close enough. The problem lies in the fact they don’t always disclose the methods for archiving those numbers. For instance, when they report minimum and maximum throughput they are usually talking about a scan of the entire disk including all meta data stored between tracks, the best possible throughput possible. You won’t see those results in every day life. They also give you numbers that can be completely irrelevant like single sector read rates. very rarely do you read a single sector at a time. Personally, I would love if the drive makers gave the engineering specifications. I know that won’t happen, it would make my life easier though. The disk characteristics that are important are, sector size,spindle speed, seek times read and write, sequential times read and write. To a lesser extent sequential throughput in megabytes per second. With the single disk numbers we can move on to the RAID configuration.
Configuring your RAID Array
There are several factors that impact the RAID arrays ability to perform. The RAID level, size of the IO request, and stripe size. RAID level is the easy one, what kind of hits do you take on writes vs. capacity of the array. On the stripe size there is a direct corollary with the size of the IO request. If the IO request is bigger than the stripe size it will have to seek across another disk to satisfy the data request. If the IO request size is very small and random you may loose some IO performance if the requests pile up on one disk causing a hot spot. There are established calculations that you can perform to get an idea of how to configure you array. I’ve built a web page that you can use to do all the basic calculations, Disk Drive RAID Configuration Tool. These equations are base line estimates so you aren’t working completely in the dark. You can enter your own drive statistics or pick from one of 1100 hard drives in the database. This web calculator is based off of Peter Chen’s equations for estimating RAID performance and best stripe size. I’ll add more to it as I get time.
SQL Server IO Patterns and Array Performance
SQL Server works with two specific IO request size 8K and 64K in general. If you did your due diligence earlier you could also add any other request size that you saw come through. Focusing on the page size and extent size is a good place to start. Using the raid calculator tool I chose a Seagate Savvio 15K.2 drive as my base. One of the things my calculator can’t take into consideration is your system and RAID HBA. This is where testing is essential. You will find there are anomalies in every card, physical limits on throughput and IO’s. Since my RAID card won’t do a stripe bigger than 256k that is my cap for size. Reading through several IO white papers on SQL Server the general recommendation is for 2000/2005 a 64k or 128k stripe size and for SQL Server 2008 a 256k stripe size. I’ve found as general guidance, this is a good place to start as well. The calculator tells me for a RAID 10 array with 24 drives at a 256k stripe size and 8k IO request I should get 9825 IOs/Sec and 76.75 MB/Sec on average, across reads, writes, sequential and random IO requests. That’s right, 76 MB/Sec throughput for 24 drives rated at 122 MB/sec minimum. That is 2.5 MB/Sec per drive. The same array at a 64k IO request size yields 8102 IOs/Sec and 506 MB/Sec. A huge difference in throughput just based on the IO request size. Still, not anywhere near 122 MB/Sec. As an estimate, I find that these numbers are “good enough” to start sizing my arrays. If I needed to figure out how big the array needs to be to support say 150 MB/sec throughput or 10000 IOs/Sec you can do that with the calculator as well. Armed with our estimates it’s time to actually test our new RAID arrays. I use SQLIO to do synthetic benchmarking before running any actual data loads.
After doing a round of testing I found that in some cases the numbers were a little high or a little low. Other factors that are hard to calculate are cache hit ratios. Enterprise RAID HBA’s usually disable the write cache on the local disk controller and just use their own batter backed cache for all write operations. This is safer but with more and more disks on a single controller the amount of cache per disk can get pretty low. The HBA will also want you to split that between read and write operations. On my HP RAID HBA’s the default is 25% read and 75% write. In an older study I found on disk caches and cache size saw diminishing returns above 2 MB gaining between 1 and 2 percent additional cache hits per megabyte of cache. I expect that to flatten out even more as the caches get larger, you simply can’t get 100% cache ratios that would mean the whole drive fit in the ram cache or your IO request are the same over and over. Generally if that is the case you will find SQL Server won’t have to go to disk it will have what it needs in the buffer pool for reads. I find that if you have less than 20 percent write activity leaving the defaults is fine. If I do have a write heavy load I will set the cache to 100% writes.
Having completed my benchmarking I found that 128k or 256k stripe size was fine on average. Just realize that if you optimize for one IO pattern the others will suffer. Latency is also important and I have included it here as well. You find that the larger the IO request and the smaller the stripe size latency gets worse. Here are the results from my tests on a DL380 G5 with a P411 and 24 drives in a MSA 70 enclosure. I’ve included tests for an 8k to 256k stripe sizes.
As a footnote I’d like to thank Joe Handley, Ben Poliakoff, David Gosslin and Dale Davis for helping me get the Disk Drive RAID Configuration Tool together. I’m not a web guy!
WARNING! Lots of charts below!
|Read 8K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
|Write 8K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
|Read 64K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
|Write 64K IO Request 24 73GB 15K Drives RAID 10 64K File System Cluster Size 1 Outstanding IO’s 8 Threads|
Series To Date:
In previous articles, we have covered the system bus, host bus adapters, and disk drives. Now we will move up the food chain at take a look at getting several disks to operate as one.
In 1988 David A. Patterson, Garth Gibson, and Randy H. Katz authored a seminal paper, A Case for Redundant Arrays of Inexpensive Disks (RAID). The main concept was to use off the shelf commodity hardware to provide better performance and reliability and a much lower price point than the current generation of storage. Even in 1988, we already knew that CPUs and memory were outpacing disk drives. To try to solve these issues Dr. Patterson and his team laid out the fundamentals of our modern RAID structures almost completely RAID levels 1 through 5 all directly come from this paper. There have been improvements in the error checking but the principals are the same. In 1993, Dr. Patterson along with his team released a paper covering RAID 6.
Speed, Fault Tolerance, or Capacity?
You can’t have your cake and eat it too. In the past, it was hard to justify the cost of RAID 10 unless you really needed speed and fault tolerance. RAID 5 was the default because in most situations it was good enough. Offering near raid 0 read speeds. If you had a heavy write workload, you took a penalty due to the parity stripe. RAID 6 suffers from this even more so with two parity stripes to deal with. Today, with the cost of drives coming down and the capacity going up RAID 10 should be the default configuration for everything.
Here is a breakdown of how each RAID level handles reads and writes in order of performance.
|RAID Level||Write Operations||Notes||Read Operations||Notes|
|RAID 0||1 operation||High throughput, low CPU utilization.
No data protection
|1 operation||High throughput, low CPU utilization.|
|RAID 1||2 IOP’s||Only as fast as a single drive.||1 IOP||Two read schemes available. Read data from both drives, or data from the drive that returns it first. One is higher throughput the other is faster seek times.|
|RAID 5||4 IOP’s||Read-Modify-Write requires two reads and two writes per write request. Lower throughput higher CPU if the HBA doesn’t have a dedicated IO processor.||1 IOP||High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.|
|RAID 6||6 IOP’s||Read-Modify-Write requires three reads and three writes per write request. Do not use a software implementation if it is available.||1 IOP||High throughput low CPU utilization normally, in a failed state performance falls dramatically due to parity calculation and any rebuild operations that are going on.|
Choosing your RAID level
This is not as easy as it should be. Between budgets, different storage types, and your requirements, any of the RAID levels could meet your needs. Let us work of off some base assumptions. Reliability is necessary, that rules out RAID 0 and probably RAID 0+1. Is the workload read or write intensive? A good rule of thumb is more than 10% reads go RAID 10. In addition, if write latency is a factor RAID 10 is the best choice. For read workloads, RAID 5 or RAID 6 will probably meet your needs just fine. One of the other things to take into consideration if you need lots of space RAID 5 or RAID 6 may meet your IO needs just through sheer number of disks. Take the number of disks divide by 4 for RAID 5 or 6 for RAID 6 then do your per disk IO calculations you may find that they do meet your IO requirements.
Separate IO types!
The type of IO, random or sequential, greatly affects your throughput. SQL Server has some fairly well documented IO information. One of the big ones folks overlook is keeping their log separate from their data files. I am not talking about all logs on one drive and all data on another, which buys you nothing. If you are going to do that you might as well put them all on one large volume and use every disk available. You are guaranteeing that all IO’s will be random. If you want to avoid this, you must separate your log files from data files AND each other! If the log file of a busy database is sharing with other log files, you reduce its IO throughput 3 fold and its data through put 10 to 20 fold.
RAID Reliability and Failures
Correlated Disk Failures
Disks from the same batch can suffer similar fate. Correlated disk failures can be due to a manufacturing defect that can affect a large number of drives. It can be very difficult to get a vendor to give you disks from different batches. Your best bet is to hedge against that and plan to structure your RAID arrays accordingly.
Error rates and Mean Time Between Failures
As hard disks get larger the chance for an uncorrectable and undetected read or write failure. On a desktop drive, that rate is 10^14 bits read there will be an unrecoverable error. A good example is an array with the latest two-terabyte SATA drives would hit this error on just one full pass of a 6 drive RAID 5 array. When this happens, it will trigger a rebuild event. The probability of hitting another failure during the rebuild is extremely high. Bianca Schroeder and Garth A. Gibson of Carnegie Mellon University have written an excellent paper on the subject. Read it, it will keep you up at night worrying about your current arrays. Enterprise class drives are supposed to protect against this. No study so far proves that out. That does not mean I am swapping out my SAS for SATA. Performance is still king. They do boast a much better error rate 10^16 or 100 times better. Is this number accurate or not is another question all together. Google also did a study on disk failure rates, Failure Trends in a Large Disk Drive Population. Google also found correlated disk failures among other things. This is necessary read as well. Eventually, RAID 5 just will not be an option, and RAID 6 will be where RAID 5 is today.
What RAID Does Not Do
RAID Doesn’t back your data up. You heard me. It is not a replacement for a real backup system. Write errors do occur.As database people we are aware of atomic operations, the concept of an all or nothing operation, and recovering from a failed transaction. People assume the file system and disk is also atomic, it isn’t. NTFS does have a transaction system now TxF I doubt SQL Server is using it. Disk drives limit data transfer guarantees to the sector size of the disk, 512 bytes. If you have the write cache enabled and suffer a power failure, it is possible to write part of the 8k block. If this happens, SQL Server will read new and old data from that page, which is now in an inconsistent state. This is not a disk failure. It wrote every 512-byte block it could successfully. When the disk drive comes back on line, the data on the disk is not corrupted at the sector level at all. If you have turned off torn page detection or page checksum because you believe it is a huge performance hit, turn it back on. Add more disks if you need the extra performance don’t put your data at risk.
- Data files tend to be random reads and writes.
- Log files have zero random reads and writes normally.
- More than one active log on a drive equals random reads and writes.
- Use Raid 1 for logs or RAID 10 if you need the space.
- Use RAID 5 or RAID 6 for data files if capacity and read performance are more important than write speed.
- The more disks you add to an array the greater chance you have for data loss.
- Raid 5 offers very good reliability at small scale. Rule of thumb, more than 8 drives in a RAID 5 could be disastrous.
- Raid 6 offers very good reliability at large scales. Rule of thumb, less than 9 drives you should consider RAID 5 instead.
- Raid 10 offers excellent reliability at any scale but is susceptible to correlated disk failures.
- The larger the disk drive capacity should adjust your number of disks down per array.
- Turn on torn page for 2000 and checksum for 2005/08.
- Restore Backups regularly,
- RAID isn’t a backup solution.
Series To Date:
We have covered the Hard Disk and the System Bus. This time around we will cover disk controllers and host bus adapters.
In The Beginning…
There were three distinct components to your IO subsystem, the disk, controller, and the host bus adapter. Today there are still three distinct components but the arrangement has changed. The physical disk we have covered and you know about. What you may not realize is the disk controller is actually the circuit board on the back of the hard drive. In the past this board may have been an add-in card, a back plane that the drives plugged into or even an add-in card with the hard disk mounted on it! It took a little time for the configuration we take for granted today to settle out. Once the form factor for a hard drive and the controller was done there was still the issue of what a host bus adaptor was suppose to do. Some of you may remember the old days of MFM, RLL, and proprietary disk layouts. Having to do a low level format, setting the interleave, even having to park the drive when you were done with the computer. Those days are long gone. Now, low level formatting is done at the factory, there is no need for interleaving, and all drives auto-park. Whoa, what a time warp. All of these things were eliminated mostly due to the advancement in disk controllers and host bus adapters.
The Disk Controller
The card that slots into your system and is connected via cable to your hard drive isn’t the disk controller. The disk controller resides on the hard drive and handles all the low level operations. From spinning the disk, moving the heads and transferring the data the disk controller does most of the heaving lifting. Once the data has been read it finally makes its way down the wire to the host bus adapter.
There have been several data encoding and signaling schemes over the years. We have touched on MFM and RLL as the first wide spread standards used early on. The two standards that have stood the test of time are IDE/ATA and SCSI. These standards can be implemented on top of other protocols like IP and Fibre Channel both network protocols.There are ATA implementations on FC and IP but nether are as popular as SCSI. Fibre Channel is pretty much the domain of Storage Area Networks(SAN) which we will cover in a future article.
A breakdown of speeds.
|Bus Type||Speed MB/Sec|
Alternate SCSI/ATA implementations
|Fibre Channel 1GFC||106|
|Fibre Channel 2GFC||212|
|Fibre Channel 4GFC||425|
|Fibre Channel 8GFC||850|
|iSCSI Gigabit Ethernet||125|
|iSCSI 10 Gigabit Ethernet||1250|
A modern spinning disks would have a hard time using even ATA/133’s available bandwidth all by itself. The older parallel ATA (PATA) and SCSI standards are giving way to their newer serial counterparts SATA and SAS. The previous generation had several marked differences between them. ATA could only have two drives per channel while SCSI could have up to 15. ATA was unidirectional, only able to read or write, to the drive while SCSI was bidirectional. This has carried over to the new standards as well.
If you have a SAS HBA it will accept both SAS and SATA drives. Another great feature is the reliability of the connectors. Both ATA and SCSI relied on large ribbon cables and in the case of SCSI termination to the cable chain. I have been kept up at night troubleshooting faulty SCSI cabling running down the chain to try and figure out which drive was causing the problem or if it was a termination issue. The new cables are much smaller and are all point to point, no daisy chaining or termination issues to worry about. The last boon added was the idea of using expanders in the case of SAS or port multipliers for SATA only arrays. The old SCSI standard with 15 drives in a single chain was limiting. You also had the issue that 15 drives could easily saturate a single U320 channel. The biggest SCSI RAID HBA’s usually shipped with 4 channels. In contrast, the new SAS HBA’s may have 4 times that amount. With the SAS expanders you can aggregate SAS channels and have more drives in a single chain. With the SAS 300 standard you could have 4 drives saturate a single channel. With a single 4 drive expander you could have 4 drives on that single channel making the most use of the available bandwidth. You can also have up to 128 drives on a edge expander and up to an astounding 16,384 SAS devices in a single SAS domain. This gives you a lot of flexibility when it comes to configuring your storage and utilizing the bandwidth available.
As you plan your configuration you must be mindful of how many channels you have, what kind of bus the HBA uses and how much bandwidth is available through the entire stack. For example, If you have a PCIe RAID controller with 28 ports that is a theoretical throughput of 8.4 gigabytes a second of available bandwidth via the SAS 300 protocol. The drive may be able to deliver 80 megabytes a second if you only use one drive per port and no expanders that is 2.2 gigabytes a second. If the HBA isn’t plugged into a PCIe 2.0 x8 slot or PCIe 1.0 x16 slot you aren’t going to get that 2.2 GB/Sec of throughput. You should still get the IO’s available but sustained throughput will be limited. Just because an HBA says it can support 108 drives doesn’t mean you will get all the throughput of those drives. You may have an HBA that only supports PCIe 1.0 and only has 4 lanes for a total of 1GB/sec of throughput to the system. Again, you get the IO increase and for SQL Server sometimes that is exactly what you are after.
Host Bus Adapter
This is what most people think of as the disk controller or controller card. In its simplest form it transfers data to and from the system board to the hard disk controller. Of course there are other things that can happen on the HBA. It can have intergraded RAID functions, additional caching, or other things that are not appropriate to do at the disk controller level. There are several types of HBA’s from the ones built into your computers motherboard to high end SAS RAID controllers.
Cache, Disk Controllers, and HBA’s
Almost all enterprise class HBA’s usually have caching as an option or built into the card. This is a particular interest to us and SQL Server. Your data will be safe, SQL Server guarantees this over all else. In my post on capturing IO patterns I discuss why and how SQL Server does this and the concept of stable media. SQL Server assumes that it is talking to a single physical disk and opens the data files in such a way that write caching isn’t used even if it is available. SAS and SCSI drives honor this request normally. But, one of the options more advanced HBA’s offer you are the ability to use the cache and still have stable media. This is usually accomplished through a battery backup unit mounted on the card that keeps the cache memory active during a system failure. Some controllers will gladly let you shoot yourself in the foot by letting you turn the write cache on without a battery and also enable the local write cache on the disk drive as well. In this situation if you have a sudden power failure, data loss is going to happen if there are any writes at that time. Currently, there isn’t a disk drive on the market with a battery backed cache that I know of. There is a new possibility of using fast NAND flash instead of DRAM to act as the cache on drives and HBA’s. Since NAND is non-volatile it doesn’t need to have constant power. To make up for the slower speed of NAND over DRAM caches, they are making them two or more times the size.
Just in case you haven’t had a chance to peek into your servers here is an assortment of HBA’s from yesterday and today.
Until Next Time
I hope you know a little more about HBA’s now and have a better understanding what they are and what they do.
Series To Date:
12/03/2009 – UPDATE! There were a couple of bugs in the SQLIOCommandGenerator new SQLIOTools.zip has been updated.
I often tell people one of the greatest things about SQL Server is that anyone can install it. I also tell people what the worst things about SQL Server is that anyone can install it. Microsoft fostered a “black-box” approach to SQL Server in 7.0 and 2000. Thankfully, they are reversing this course. As a follow-on to my last article, capturing I/O patterns, we will take a quick look at building some synthetic tests based on those results. There are several tools on the market test I/O systems, some of them free some of the not. SQLIO has been around for several years. There are lots of good articles already on the web describing various uses for this tool.SQLIO was specifically designed to test the limits of your I/O system at different workloads. The problem is people tend to run this tool, will look at the best results, and assume that they will see the same results when the server goes live. But, without understanding your current workloads that is an unreasonable expectation at best. What ends up happening, is a misconfigured I/O system, lots of headaches, with no idea why the system performs so poorly.
I always advocate testing new systems before they go into production. I also understand that it always isn’t an option. Having found myself in that exact situation recently, I’ve decided to take my own advice and pull the new storage off-line to do the proper testing. I’m also taking this opportunity to refine my testing methodology and gather as many data points before the system goes live.
The Test Scripts
With my IO patterns in hand I set out to build a couple of little tools to help me generate all the test scripts and manage the data. As usual, I built these as command line tools since I have no skill at all with GUI’s. It is all in C# and I will be posting them up to Codeplex. You can download the tools here SQLIOTools.zip, this zip has the two tools, they are beta and don’t have a ton of error checking built into them yet. The first tool, SQLIOCommandGenerator does just that, generates the batch file that has all the commands. I does depend on the SQLIO.exe being in the same directory as well as having already defined a parameter file for it to use.
X: \ S Q L I O _testfile0.dat 8 0x0 150240
The first parameter is the test file name that SQLIO will create on start up or use if it already exists. Second is the number of threads that will access that file. Third is the affinity mask. Fourth is the file size in megabytes. Make sure and size the file large enough to be representative of a real database you would be housing on the system. If it is too small it will simply fit in the RAID controllers cache and give you inflated results. I also tend to use one thread per physical CPU core. Be careful though, if you are using a lot of files, having too many threads can cause SQLIO to run out of memory.
We assume -F<paramfile> -LS -d,-R,-f,-p,-a,-i,-m,-u,-S,-v, -t not implemented
Usage: SQLIOCommandGenerator [OPTIONS]
Generates the command line syntax for the SQLIO.exe program output into a batch file.
-f, –iopattern[=VALUE] Random, Sequential or Both
-k, –iotype[=VALUE] Read,Write or Both
-s, –seconds[=VALUE] Number of seconds to run each test 1(60) to 10(600) minutes is normal
-c, –cooldown[=VALUE] Number of seconds pause between tests suggested minimum is 5 seconds.
–os, –outstandingiostart[=VALUE] Starting number of outstanding IOs 1
–oi, –outstandingioincrament[=VALUE] Multiply Outstanding IO start by X i.e 2
–oe, –outstandingioend[=VALUE] Ending Number of outstanding IOs i.e. 64
–ol, –outstandingiolist[=VALUE] Specific Outstanding IO List i.e. 1,2,4,8,16,32,64,128,256,512,1024
–oss, –iosizestart[=VALUE] Starting Size of the IO request in kilobytes i.e. 1
–osi, –iosizeincrament[=VALUE] Multiply IO size by X in kilobytes i.e. 2
–ose, –iosizeend[=VALUE] Ending number of outstanding IOs in kilobytes – i.e. 1024
–osl, –iosizeList[=VALUE] Specific IO Sizes in kilobytes i.e. 1,2,4,8,16,32,64,128,256,512,1024
-b, –buffering[=VALUE] Set the type of buffering None, All, Hardware, Software. None is the default for SQL Server
–bat, –sqliobatchfilename[=VALUE] The name of the output batch file that will be created
-?, -h, –help show this message and exit
So I passed it this command:
SQLIOCommandGenerator.exe -k=Both -s=600 -c=5 –os=1 –oi=2 –oe=256 –oss=1 –osi=2
–se=1024 -b=all –bat=c:\wes_sqlio_bat.txt -f=both
That generates this sample:
:: Generated by SQLIOCommandGenerator
:: This relies on SQLIO.exe being in the same directory.
:: c:\wes_sqlio_bat.txt c:\paramfile.txt c:\outputfile.csv “description of the tests”
:: param1 sqlio parameter file, param2 output of each test to single csv file, param3 test description
ECHO ComputerName: %COMPUTERNAME% > %OUTFILE%
ECHO Date: %DATE% %TIME% >> %OUTFILE%
ECHO Runtime: %RUNTIME% >> %OUTFILE%
ECHO Cool Off: %COOLOFF% >> %OUTFILE%
ECHO Parameters File: %PARAMFILE% >> %OUTFILE%
ECHO Description: %DESC% >> %OUTFILE%
ECHO Test Start >> %OUTFILE%
ECHO Command Line: sqlio -kW -s%RUNTIME% -frandom -b1 -o1 -LS -BY -F%PARAMFILE% >> %OUTFILE%
sqlio -kW -s%RUNTIME% -frandom -b1 -o1 -LS -BY -F%PARAMFILE% >> %OUTFILE%
timeout /T %COOLOFF%
ECHO End Date: %DATE% %TIME% >> %OUTFILE%
:: This batch will take approximately 264.0014 Hours to Execute.
The batch file has the instructions for calling it and what parameters you can pass into it. You can omit seconds and cool down if you want to generate a more generic batch file.
This tool is flexible enough for my needs. I can generate specific targeted tests when I have data back that up, or I can generate more general tests to feel out the performance edges.
You may have noticed the estimate run time, that is pretty accurate. This is a worst case scenario where you have chosen pretty much every possible test to run. I wouldn’t recommend this. With the data we have already we can narrow down our testing to just a few IO sizes and queue depths to keep the test well within reason.
SQLIOCommandGenerator.exe -k=Both -s=600 -c=5 –ol=2 –osl=8,64 -b=None –bat=c:\wes_sqlio_bat.txt -f=both
This batch will take approximately 80.08334 Minutes to Execute.
Much better! by focusing on our IO targets we now have a test that is meaningful and repeatable.
Why would you want to repeat this test over and over? Simple, not all RAID controllers are created equal. You may need to adjust several options before you hit the optimal configuration.
Running The Tests
Now that I have my tests defined I need to start running them and gathering information. There are some constants I always stay with. One, use diskpart.exe to sector align your disks. Two, format NTFS with a 64k block size. Since I”m doing these tests over and over I wrote a little batch file for that too. Diskpart can take a command file to do its work. Once the RAID controller is in I create an array and look what disk number is assigned to it. As long as you don’t make multiple arrays you will always get the same disk number. After that I format the volume accordingly. WARNING, I do use the /Y so the format happens without prompting for permission!
select disk 2
create partition primary align = 64
assign letter = X
diskpart /S z:\diskpart.txt
format x: /q /FS:NTFS /V:TEMP /A:64K /Y
I I also use the RAID controllers command line interface if it has one to make it easier to construct the tests and just let them run using a batch file as a control file. If that isn’t possible don’t worry, the bulk of your time will be waiting for the test to complete anyway.
Gathering The Data
As you have guessed, I have a tool to parse the output of the tests and import them into SQL Server or export it as a CSV file for easy access in Excel. SQLIOParser is also pretty simple to use.
Usage: SQLIOParser [OPTIONS]
Process output of the SQLIO.exe program piped to a text file.
-c, –computername[=VALUE] The comptuer name that the test was executed on.
-s, –sqlserver[=VALUE] The SQL Server you want to import the data into.
-u, –sqluser[=VALUE] If using SQL Server authentication specify a user
-p, –sqlpass[=VALUE] If using SQL Server authentication specify a password
-t, –tablename[=VALUE] The table you want to import the data into.
-d, –databasename[=VALUE] The database you want to import the data into.
-f, –sqliofilename[=VALUE] The file name you want to import the data from.
-a, –sqliofiledirectory[=VALUE] The directory containing the files you want to import the data from.
-o, –csvoutputfilename[=VALUE] The file name you want to export the data to.
-?, -h, –help show this message and exit
It will work with a single file or import a set of files in a single directory. If you are importing to SQL Server you need to have the table already created.
CREATE TABLE [dbo].[SQLIOResults](
[ComputerName] [varchar](255) NULL,
[TestDescription] [varchar](255) NULL,
[SQLIOCommandLine] [varchar](255) NULL,
[SQLIOFileName] [varchar](255) NULL,
[ParameterFile] [varchar](255) NULL,
[TestDate] [datetime] NULL,
[RunTime] [int] NULL,
[CoolOff] [int] NULL,
[NumberOfFiles] [int] NULL,
[FileSize] [int] NULL,
[NumberOfThreads] [int] NULL,
[IOOperation] [varchar](255) NULL,
[IOSize] [varchar](255) NULL,
[IOOutstanding] [int] NULL,
[IOType] [varchar](255) NULL,
[IOSec] [decimal](18, 2) NULL,
[MBSec] [decimal](18, 2) NULL,
[MinLatency] [int] NULL,
[AvgLatency] [int] NULL,
[MaxLatency] [int] NULL
This is the same structure the CSV is in as well.
Analyzing The Results
I will warn you that the results you get will not match your performance 100% once the server is in production. This shows you the potential of the system. If you have horrible queries hitting your SQL Server those queries are still just as bad as before. Generally, I ignore max latency and min latency focusing on the average. That is what I am most worried about as the IO load changes or queue depth increases how will the system respond. Remember raw megabytes a second isn’t always king. Number of IO’s at a given IO block size is also very important. I will go into great detail in the next article as I walk you through analyzing the results from my own system so stay tuned for that.
These tests aren’t the end of your road. I still advocate playing back traces and seeing how the system responds with your exact workload whenever possible. If you can’t do that then using tools like SQLIO is better than nothing at all. We are also working under the assumption that we are upgrading or replacing an existing production server. If that isn’t the case and this is a brand new deployment using SQLIO will help you know what your I/O system is capable of before you have a problem with bad queries or other issues that always crop up on new systems.
You can always to more testing. It is almost a never ending process, my goal isn’t to give you the end solution just to give you another tool to pull out when you need it. As always, I look forward to your feedback!
Series To Date:
We often take the advice given to us on forums or in articles at face value. Even though the authors almost always say things like “your mileage may vary” or “may not apply to your situation” people still assume it is the gospel. Sometimes it is lack of experience. Other times it is just lack of knowledge on how to verify these things on your own. In this article I’m going to give you a tool to look at what SQL Server is doing at the disk level and allow you to make better decisions on how to configure your underlying disks.
There are several things you need to know about how SQL Server accesses the database files and the implications of that before you can construct a proper testing methodology.
http://technet.microsoft.com/en-us/library/cc966500.aspx covers the basics. There are a few things I will highlight here.
ACID and WAL
ACID (Atomicity, Consistency, Isolation, and Durability) is what makes our database reliable. The ability to recover from a catastrophic failure is key to protecting your data.
WAL (Write-Ahead Logging) is how ACID is achieved. Basically, the log record must be flushed to disk before the data file is modified.
Stable media isn’t just the disk drive. A controller with a battery backed cache is also considered stable. Since SQL Server can request blocks as big as 64KB make sure your controller can handle that block size request in cache. Some older controllers only do a 16KB block or smaller.
FUA (Forced Unit Access)
With the requirement of stable media SQL Server creates and opens all files with a specific set of flags. FILE_FLAG_WRITETHROUGH tells the underlying OS not to use write caching that isn’t considered stable media. So, the local disk cache is normally bypassed. Not all hard drives honor the flag though, Some SATA/IDE drives ignore it. Usually, the drive manufacturer provides a tool to turn off write caching. If you are using desktop drives in a mission critical situation be aware of the potential for data loss. FILE_FLAG_NO_BUFFERING tells the OS not to buffer the file ether. At this point the only cache available will be the battery backed or other durable cached on the controller.
SQL Server uses asynchronous access for data and log files. This allows IO request to queue up and use the IO system as efficiently as possible. The main difference between the two are SQL Server will try and gather writes to the data file into bigger blocks but the log is always written to sequentially.
All of these rules apply to everything but tempdb. Since tempdb is recreated at restart every time recoverability isn’t an issue.
SQL Server data access patterns
Searching around you will find these generalities about SQL Server’s IO patterns
Sequential 512 bytes to 64KB
Data File Read/Writes
Read ahead – more important to Enterprise Edition
8KB to 125KB
8KB to 128KB
512 byte – full initialize on log file only.
Backup Sequential Read/Write
Restore Sequential Read/Write
DBCC – CHECKDB
Sequential Read 8K – 64K
DBCC – DBREINDEX
(Read Phase) Sequential Read (see Read Ahead)
DBCC – DBREINDEX
(Write Phase) Sequential Write
Any multiple of 8K up to 128K
DBCC – SHOWCONTIG
Sequential Read 8K – 64K
Now that we have an idea of what SQL Server is suppose to be doing its time to verify our assumptions.
Capturing IO activity
There are a few tools that will allow you to capture the file activity at the system level. Process Monitor is a free tool from Microsoft that I will use to collect some base line information. In it’s standard configuration Process Monitor captures a ton of stuff and uses the page file to spool the info to. So, before we begin we need to change the default configuration.
Capturing IO data using process monitor.
Filter to apply
process is sqlservr.exe
Operation is Read
Operation is Write
Columns to choose.
Date & Time
Time of Day
Change Backing File.
The maximum number of events it will capture is 199 million. This is enough on my system to capture 12 hours of activity easily. Once we have a good sample you can save it off as an XML file or CSV. Choosing CSV it is pretty easy to import the data into SQL Server using SSIS or your tool of choice.
I import the CSV into a raw table first.
Raw table to import into.
CREATE TABLE [SQLIO].[dbo].[pm_imp] ( [Process Name] VARCHAR(12), [PID] SMALLINT, [Path] VARCHAR(255), [Detail] VARCHAR(255), [Date & Time] DATETIME, [Time of Day] VARCHAR(20), [Relative Time] VARCHAR(50), [Duration] REAL, [TID] SMALLINT, [Category] VARCHAR(6) )
Next I create a cleaner structure with some additional information separated from the detail provided.
SELECT [Process Name] AS ProcessName, PID AS ProcessID, PATH AS DatabaseFilePath, Detail, [Date & Time] AS EventTimeStamp, [Time of Day] AS TimeOfDay, [Relative Time] AS RelativeTime, [Duration], TID AS ThreadID, Category AS IOType, substring(detail,charindex('Length: ',detail,0) + 8
,(charindex(', I/O',detail,0) - charindex('Length:',detail,0) - 8)) AS IOLength, CASE reverse(left(reverse(PATH),3)) WHEN 'mdf' THEN 'Data' WHEN 'ndf' THEN 'Data' WHEN 'ldf' THEN 'Log' END AS FileType INTO SQLIOData FROM dbo.pm_imp WHERE reverse(left(reverse(PATH),3)) IN ('mdf','ndf','ldf')
Once we have the data cleaned up a bit we can now start doing some analysis on it.
Queries for interesting patterns.
This query gives us our read and write counts.
SELECT count(*) IOCount ,IOType FROM SQLIOData GROUP BY IOType ORDER BY count(*) DESC
This one shows us the size of the IO and what type of operation it is.
SELECT count(*) IOCount ,IOLength ,IOType FROM SQLIOData GROUP BY IOLength,IOType ORDER BY count(*) DESC
This is a look at activity by file type data or log.
SELECT count(*) IOCount, FileType FROM SQLIOData GROUP BY FileType ORDER BY count(*) DESC
Since we are capturing the thread id we can see how many IO’s by thread.
SELECT count(*) IOCount, ThreadID FROM SQLIOData GROUP BY ThreadID ORDER BY count(*) DESC
We can also look at IO types, sizes and count by file helping you see which ones are hot.
SELECT count(*) IOCount, databasefilepath, iotype, iolength FROM SQLIOData WHERE databasefilepath LIKE '%filename%' GROUP BY databasefilepath, iotype, iolength HAVING count(*) > 10000 ORDER BY databasefilepath, count(*) DESC
Now that we see exactly what our IO patterns are we can make adjustments to the disk subsystem to help scale it up or tune it for a particular pattern.
This is just another tool in your tool belt. This is a supplement to using fn_virtualfilestats to track file usage. I use it to get a better idea of the size of the IO’s being issued.Using these two tools I can see the size of the IO’s in a window of time that is reported by my fn_virtualfilestats capture routine.
Always verify your assumptions, or advice from others.