One of the things i always look for when i am tuning queries is the usage of the IN operator. If this operator is not used correctly you can suffer from some severe performance issues. In this blog post i try to give some understanding about when to use the IN operator and when to use an EXISTS.
According to Books Online this operator “Determines whether a specified value matches any value in a subquery or a list.”. Hence the “any” part of that line. So in other words, the IN operator will return TRUE/FALSE/UNKNOWN when a value is IN the result of a (sub)query.
In Books Online we see that the EXISTS operator “Specifies a subquery to test for the existence of rows.”. And also “Returns TRUE if a subquery contains any rows.”. So existence of a row is key here. Unlike the “any” part in the “IN” operator the “EXISTS” only has to look if there is at least 1 value.
IN vs EXISTS Compilation resources
So, what does this all mean? Well, the IN operator is used to check if a value is available in some subquery, where it will be used on the complete result-set of this subquery. If you have a big result-set where you are checking for some values you might suffer performance issues. The EXISTS operator on the other hand is very useful for checking if a value exists because it uses something we call “short-circuiting”. If a value is found in the subquery it will immediately stop, because the operator can return TRUE. So on a very large result-set you will get your logical answer quicker when using the EXISTS in stead of the IN operator.
Let’s look at this with an example.
SELECT p.FirstName , p.LastName , e.JobTitle FROM Person.Person AS p JOIN HumanResources.Employee AS e ON e.BusinessEntityID = p.BusinessEntityID JOIN HumanResources.EmployeeDepartmentHistory AS edh ON e.BusinessEntityID = edh.BusinessEntityID WHERE edh.DepartmentID IN (SELECT DepartmentID FROM HumanResources.Department WHERE Name LIKE 'P%');
This will produce the following execution plan:
And then the same query, but this time optimized with the EXISTS operator
SELECT p.FirstName , p.LastName , e.JobTitle FROM Person.Person AS p JOIN HumanResources.Employee AS e ON e.BusinessEntityID = p.BusinessEntityID WHERE EXISTS (SELECT * FROM HumanResources.Department AS d JOIN HumanResources.EmployeeDepartmentHistory AS edh ON d.DepartmentID = edh.DepartmentID WHERE e.BusinessEntityID = edh.BusinessEntityID AND d.Name LIKE 'P%');
And this will produce the following execution plan:
That looks almost the same, doesn’t it. Well, ALMOST! When we look at the cost of the several operators we see that most of them are different. When we run both queries together we see that the query with the IN operator costs 66% of the complete execution, where the query with the EXISTS operator takes 34%. That is a significant difference in cost. Lets look at the SELECT part of both queries. The SELECT properties of the query with the IN operator is on the left, the SELECT properties of the query with the EXISTS operator is on the right.
We see several things. CompileCPU and CompileTime for instance is showing that the EXISTS operator use less CPU compile time. The CPU memory is even lower, together with memory grant. So with this information we can say that the query with the EXISTS operator takes less time to compile, so therefor it is faster.
Besides it takes less CPU Compile time when using an EXISTS operator, there is also a logical part that is better when using the EXISTS and IN operators. In the beginning of this post i said something about the possible values that the IN and EXISTS operators will return. TRUE/FALSE/UNKNOWN. The TRUE/FALSE part is self explaining. Just a normal Boolean operator. However, the UNKNOWN part is somewhat different. The UNKNOWN will be used for NULL columns for example. And NULL is equal to nothing, so comparing NULL with NULL becomes not true. Nothing isn’t equal to nothing.
For “normal” usage we will not see any real difference and/or damage done with the IN or EXISTS operator. But when we are using NOT IN or NOT EXISTS, things will get nasty! For example. When using the NOT statement the logical part will be NOT TRUE or NOT UNKNOWN. So with a column being queried that contains a NULL value, you will get no results, even when there are columns which are NOT TRUE. So NOT IN is a no go here!
And here comes the short-circuiting part of the EXISTS operator as being the solution for this problem. Even if a column contains a NULL value, if ANY column will return TRUE the operator will stop processing since there is an existing record.
Well, if there is any conclusion that you can take from this is to NOT USE THE IN OPERATOR at any time. It uses more compiletime, so more resources and logic wise it isn’t always giving the results that are expected.