A: The most fundamental difference between hash and nested loop joins

Ok guys, thanks for waiting!

I ended up expanding the article quite a lot compared to what I had originally planned. In fact I only wrote 50% of what I plan to write, I’ll update the rest… um… later… Instead of just stating the difference between the joins I took a step back and elaborated something what I often see people doing (and talking about in newsgroups and lists too).

Basically the most fundamental (or biggest or most important) difference between nested loop and hash joins is that:

  • Hash joins can not look up rows from the inner (probed) row source based on values retrieved from the outer (driving) row source, nested loops can.

In other words, when joining table A and B (A is driving table, B is the probed table), then a nested loop join can take 1st row from A and perform a lookup to B using that value (of the column(s) you join by). Then nested loop takes the next row from A and performs another lookup to table B using the new value. And so on and so on and so on.

This opens up additional access paths to the table B, for example when joining ORDERS and ORDER_ITEMS by ORDER_ID (and ORDER_ID is leading column of PK in ORDER_ITEMS table), then for whatever orders are taken from ORDERS table, we can perform a focused, narrow index range scan on ORDER_ITEMS for every ORDER_ID retrieved from the driving ORDERS table. A hash join can’t do that.

Of course this doesn’t mean that hash joins can’t use any indexes for tables they read – index range scans and unique lookups can still be used under a hash join, but only if there are constant values in the query text (in form of literal or bind variables). If there are no such constant (filter) conditions under a hash join, then the other options to use that index would be to do an INDEX FULL SCAN (which is a range scan from end to end of the index) or INDEX FAST FULL SCAN (which is like a full table scan through the entire index segment). However none of these opportunities give the same benefits as nested loops looking up rows from row source B dynamically based on what was retrieved from A during runtime.

Note that this nested loops benefit isn’t limited to indexes only on table B, the difference is more fundamental than just a specific access path. For example, if table B happens to be a single table hash cluster or indexed X$ table, then the nested loop is also able to do “optimized” lookups from these row-sources, based on the values retrieved from table A.

So, my article with a lot of (loosely) related details is here:

In the comments section of my question, Tom, Bernard Polarski, Christian Antognini and Marc Musette got the closest to what I had in my mind when I asked the question. However, of course your mileage may vary somewhat depending on what kind of problems you have experienced the most over all the years. Also, Jonathan Lewis had a valid comment regarding that the answer depends on what exactly does one mean by “fundamental” and yeah this was open to interpretation.

Nevertheless, I wanted to emphasize that there’s a more important difference between NL and hash joins, than the usual stuff you see in training material which talk about implementation details like hash tables and memory allocation…

Some day I will complete that article, I plan to add some design advice in there, like denormalization opportunities for getting the best of the both worlds etc.┬áBut now I’m gonna get a beer instead.

Thanks for reading and answering my blog, I was quite impressed by the volume of comments & answers to my question. I must do this more often!

This entry was posted in Cool stuff, Oracle and tagged , , , . Bookmark the permalink.

11 Responses to A: The most fundamental difference between hash and nested loop joins

  1. Bix says:

    Hash Joins -’by definition’ can not do as you mentioned-
    “Hash joins can not look up rows from the inner (probed) row source based on values retrieved from the outer (driving) row source, nested loops can” .

    I understand you are simply ‘defining’ what is hash join in your article -we would be really expecting something new points from you about hash joins and nested loops:)

  2. Tanel Poder says:

    Yes, I am just *describing* how hash joins vs nested loops work. The 2nd part of the article (some day) will describe the practical implications (or opportunities) of that.

    So, there are no silver bullets or magic hints which make everything 10x faster here, but just some fundamentals which to use for designing stuff better in the future

  3. Hi Tanel,

    The Article is very much innovative and informative.. !!
    I am very much impressed that you kept the word on “fundamental” when I went through the article and found the below things

    1.Join order
    2.Join methods
    3.Access paths

    Yes, the above are basic fundamental where people are missing while “tuning” queries and “jonathan lewis’ – says in his seminar to know your data and how well and where it resides in database before procceeding to tune quries. Even you have suggeseted as they are fundamental in order to write queries,as people are missing now-a-days those things to follow.

    It’s very nice explanation Tanel.. impressed.

    Really you need ..you need ..you need have a break ..”Try for Kitkat” .. instead of “beer”… :-)

    we will be waiting for your next follow article on this concept.
    Some thing crossed in my mind.. looking up the test case and let’s try myself some thing different which might help to understand for others too..

    Regards
    Pavan kumar N

  4. Piqouskerberos says:

    Impressive and made us think more.

  5. Hey Tanel,

    I was thinking about that and realized that I know at least one exception to this fundamental difference.
    Now mind you it’s assisted by an additional step, but it does allow hash join to be used while rows are ‘filtered’ from the inner(probed) based on values from the outer (driving) table.

    I am speaking about the new 11g “PARTITION RANGE JOIN-FILTER” access path.

    You could argue this is not part of hash joins since it is outlined in a separate step in the explain plan display, but it is usually (only?) used in conjunction with hash joins, so it can be considered “part of”.

  6. Tanel Poder says:

    @Christo Kutrovsky
    Christo, yes, good point. I wanted to write about it as well, but run out of time. So I’ll include a comment about it in the next half of the article.

    However, it’s a different mechanism and provides different functionality if you look at it closer. A nested loop can take a single row from row source A and then immediately perform a lookup into B with the join column taken from A. And then take 2nd row from A and perform another, new lookup to B etc (that’s why you see the Starts value for accessing the row source B in a profiled SQL plan being > 1 when the A row source returned more than 1 row).

    And nested loops can use any compatible access path (index, hash cluster lookup) with that join column value. Hash join, even with the pushed join filter can’t do this. It can’t look up exact values from row source B using an index for example, it can only do partition elimination or early filtering in Exadata smart scan, but it still does brute force full segment scans and just throws away the non-matching things.

  7. Tanel.

    There’s an interesting, presumably unconscious, bias introduced in the way you describe the difference: “Hash joins CAN NOT …”, followed by “This opens up additional access paths …”.

    One could equally say something like: “Nested loop joins are required to visit the second table for every relevant row in the first table”, and “This stops them from taking advantage of alternative access paths.”

    In terms of HOW the two joins work there is a certain symmetry in how you can choose to describe the advantages or disadvantages of the mechanisms.

    If you want to highlight the fundamental difference in mechanism, though I think the most important point to make is that a hash join has to pre-process the first table in its entirety before it can access the second table.

  8. Bix says:

    Hi Tanel, Please correct if my understanding is wrong .

    Hash Join : this algorithm will select the small(in general) table 1 , access the full table1 (FTS or index full scan) ,and build the ‘hash table’ in the session PGA , this hash table is a sort of *data structure* and this contains all rows from that table 1, once the ‘hash table’ is built – the algorithm wont access the table 1 any more (physically or in any other way)…..

    Now it will take table 2 (big table), access first block (or multiple blocks) ,parse it in PGA to get the rows,for each row -look up *the hash table* to find the match , if there is a match – then output the record from (table2 +hash table). so the fundamental difference between the nested loops and hash table is – the look up is done on the ‘hash table’ – a data structure and the hash algorithm does not access the table1(small table) any more to find the match for the row from the outer table (table2) where as in case of nested loops – the second table (physically) directly (unlike hash table) is accessed through index/table access for each row from outer table.My point here is – as the hash algorithm never needs *physical access* of the small table (probed) table again after the hash table is built so It is not possible (not necessary) to ‘look up’ using ‘index access’ in case of hash algorithm – Please correct if my understanding is wrong .

    As usual – YOU ROCK IN THIS ARTICLE AS WELL..

    Many thanks
    Bix

  9. Amos says:

    Hi Tanel,
    Do you plan to finish this article: http://tech.e2sn.com/oracle/sql/the-fundamental-difference-between-nested-loops-and-hash-joins? It has been 2 years without update. :-).
    Amos

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>