The Algorithms behind SQL Server Join processing — and how to benefit from knowing them

Stefan Graf
4 min readMar 30, 2022

--

Joins are probably one of the most useful tool provided by SQL Databases. I would even say that you can’t imagine doing any serious work without them with relational databases.

But how do they even actually work in the background? I want to provide you with a deep dive into the different types a join is being processed inside SQL Server. Also I want to show you, how you can potentially benefit from this knowledge.

Photo by Caspar Camille Rubin on Unsplash

I guess everyone who clicks on an article like this has done a lot of querying using SQL, including a lot of joins. So I don’t want to waste any time on too much code examples, because I think you are totally aware on how a join is done correctly. I want to talk mainly about the different strategies the SQL Engine behind SQL Server uses to address the problem of performing joins as efficient as possible.

Therefore I will present to you the three algorithms implemented in SQL Server:

  • Nested loops
  • Hash
  • Merge

Afterwards I will provide you with an example on how to explicitly define the join which should be used (this is normally abstracted from the developer) and how this could become handy for you.

The three Join algorithms

Nested loops: The classic way

This algorithm was the first one implementing a Join and still functions as a fallback solution if the other two won’t work. One example of this are non equi joins (where the predicate is not defined by an ‘=’ but for example by ‘<’ or ‘>’). Using this kind of joins, only nested loops can be applied.

So how does it work then? On a fairly simple way: It iterates through the top input only once and, using a nested loop, goes through every row in the bottom input to find matches.

Nested loops perform fine under following circumstances:

  • The outer input is small
  • The inner join has an index on the used join column

Hash

This method relies in the implementation of a hash table based on one of the input tables. Normally the smaller one is being used, as this method performs best, when the hash table fits in memory. If this is not possible, it can be partitioned, which unfortunately is not ideal for performance.

These Hash values get arranged into hash bucket, depending on which hash function is being used. For example, if the engines chooses a modulo 40, 40 buckets get created, each with the items where the hash values results in the same score. A little side note for evey Azure Data Engineer: this is very similar to the Hash distribution of a Synapse Analytics table.

Afterwards the hash function is applied to the join column from the other table. Based on the result, the algorithm can search in the cohering bucket for matches.

This approach is very performant at Data Warehouse queries, where frequently small dimension tables are being joined with big fact tables. Another great usecase for this method is when there are not the right indexes at place. The Hash table can be viewed as an alternative index structure for this query.

One thing to watch out for is the fact, that enough memory is needed in order to function properly. Only this way, the hash method, is rewarding you with great performance.

Merge

Last but not least, we also have the merge algorithm. This method relies on both tables to be sorted by the join column. With this sorted data, the join can be imagined like a zipper bringing both sides together.

The sorting can be provided in 2 ways. Best case is that there is already an index in place, which the optimizer can use to perform an index order scan. The alternative is that the SQL engine has to sort the tables before the join process is beginning. This comes with the additionally cost of the sorting task.

Because of this extra cost, Merge performs best when the index is already used for both tableswithin the join. Also, it is an option when there is a big table with the index and a small table which needs to be sorted prior to the join.

Forcing the Join algorithm

Most of the time the optimizer does a great job in choosing the right method for each join. But if you are performing query optimizing and you have the feeling, that there is a better option available, you can actually force the method, which you think performs better. There are 2 options in place to do this:

Using a joint hint prior to the JOIN keyword like so:

SELECT *
FROM dbo.table1 as t1
INNER LOOP|HASH|MERGE JOIN dbo.table2 as t2
ON t1.id = t2.id

Defining the join this way, you also specify that the first table is the outer input and the right table is the inner input.

Another way is to use the OPITON clause. Here you have the advantage, that you can define more than one algorithm and only exclude the third one.

SELECT *
FROM dbo.table1 as t1
INNER JOIN dbo.table2 as t2
ON t1.id = t2.id
OPTION(HASH JOIN, MERGE JOIN)

But be aware that these tweaks can also cause a lot of problems. Forcing the join methods can maybe be beneficial while developing, but may be a performance bottleneck when the your data solution goes live or the data basis changes in a drastic way.¹

Conclusion

Working daily with SQL brings a lot of challenges and often we do not even know, why something works like it does. I hope I could provide you with some useful knowledge for times when your next query optimizing or your next deep dive into the execution plan comes.

[1] T-SQL Querying (Developer Reference) from Itzik Ben-Gan

--

--

Stefan Graf
Stefan Graf

Written by Stefan Graf

Data Engineer Consultant @Microsoft — Data and Cloud Enthusiast

Responses (1)