Last week I was teaching a junior programmer about the differences between not in and not exists and the issue of nulls came up. This isn’t the most intuitive thing, so I thought I would post about it as well.
For a simple example, I will just create 2 simple data sets of one column each. One with 13 rows (3 being null) and the other with 15 rows all populated.
Now lets look at the difference between NOT IN and NOT EXISTS when comparing the 2 data sets:
NOT IN
with nulls as ( select rownum id from dual connect by level <= 10 union all select null id from dual connect by level <= 3 ), no_nulls as ( select rownum id from dual connect by level <= 15 ) select * from no_nulls where id not in ( select id from nulls );
no rows selected
NOT EXISTS
with nulls as ( select rownum id from dual connect by level <= 10 union all select null id from dual connect by level <= 3 ), no_nulls as ( select rownum id from dual connect by level <= 15 ) select * from no_nulls nn where not exists ( select null from nulls n where n.id = nn.id );
ID ---------- 13 11 12 14 15
And one last test, the NOT IN when there are no nulls present:
with nulls as ( select rownum id from dual connect by level <= 10 union all select null id from dual connect by level <= 3 ), no_nulls as ( select rownum id from dual connect by level <= 15 ) select * from no_nulls where id not in ( select id from nulls where id is not null );
ID ---------- 11 12 13 14 15
So why this behavior? Well, it comes down to how NOT IN and NOT EXISTS are evaluated. For NOT EXISTS, the subquery is run and the expression is either true or false depending on if rows are returned or not. For this example, the subquery will return no rows for IDs 11 - 15, so the NOT EXISTS will evaluate to TRUE and those rows are returned.
For NOT IN, the expression can be thought of as the following for each row in the dataset:
[current_value from "no_nulls"] != [first_value from "nulls"] AND [current_value from "no_nulls"] != [second_value from "nulls"] AND [current_value from "no_nulls"] != [third_value from "nulls"] ... AND [current_value from "no_nulls"] != [last_value from "nulls"]
Since the "nulls" query has at least one null value, the expression is trying to compare if a value is not equal to null, which Oracle can't know, so it returns null. This does not evaluate the expression to TRUE, so that row is skipped, as are all of the others by the same logic.
When you exclude the nulls out of the NOT IN query, all of the expressions evaluate to either TRUE or FALSE, so the rows are returned as expected.
Keep in mind that this behavior is intentional and does make sense, since a null means literally that the value is not known, even if that means it is a pain to deal with sometimes. 🙂