The Dangers of Sub-queries!!!!
Posted by Wes Brown
Ok, now that I have your attention this really should be titled the danger of not qualifying all objects, but that isn’t quite as sensational enough to make you click through to here :)
Imagine if you will, a developer expects ten records to be inserted into a table and the query has been running for almost an hour. A quick glance and sp_who2 shows that it is using a lot of IO, I mean a lot. The first thing that happens is a command decision from the boss to kill the query. I took a quick peek and found the plan still in the plan cache while we waited for this thing to roll back. As I look at the plan I see something really wrong. It looks like the table receiving the inserts isn’t getting the ten records we thought but 169 million records, every record from the table in the select clause.
I look at the query and on the surface everything looks good. When I hit the check mark it compiles without error. Looking at the plan in detail I notice a little icon that tipped me off on where to look next.
Oh that’s bad. Warnings: No join predicate. Ouch. But this is a sub-query, how can I have no join predicate? Lets take a look at an example.
DROP TABLE #t1 DROP TABLE #t2 GO CREATE TABLE #t1 ( t1id INT NOT NULL, t1name VARCHAR(50) NULL, t1place VARCHAR(50) NULL) ALTER TABLE #t1 ADD PRIMARY KEY ( t1id ) CREATE TABLE #t2 ( t2id INT NOT NULL, t1name VARCHAR(50) NULL, t1place VARCHAR(50) NULL) ALTER TABLE #t2 ADD PRIMARY KEY ( t2id ) INSERT INTO #t1 SELECT spid, loginame, hostname FROM MASTER.dbo.sysprocesses WHERE hostname <> '' INSERT INTO #t2 SELECT spid, loginame, hostname FROM MASTER.dbo.sysprocesses WHERE hostname <> ''
This gives us two tables with some data. We want to find all the records in #t1 that also exist in #t2 but instead of a join we will use an IN and a sub-query.
SELECT t1name, t1place FROM #t1 WHERE t1id IN (SELECT t1id FROM #t2)
Looks simple enough and will pass the compile test all day long even though t1id doesn’t exist in #t2. Since you can use columns from the top level query in the sub-query this is a perfectly valid piece of T-SQL.
And there you have it, a join without an ON clause as far as the optimizer is concerned. By not qualifying all the columns in the sub-query we opened ourselves up to this error. This isn’t the optimizers fault! If we re-write this as it was mean to be:
SELECT t1name, t1place FROM #t1 WHERE t1id IN (SELECT t2id FROM #t2)
We get a plan that is more to our liking.
You could also re-write this as a join instead of using the IN clause and would have avoided this problem as well. Scary way to learn that SQL Server will do exactly what you tell it to even if you are wrong!