Monday 19 January 2009

Merge Join task not always matching

I’m not a big fan of the merge join task, mainly because you have to sort things and that comes with a performance cost. And to make things worse I stumbled on an issue with SQL 2005 where the merge join task doesn’t always perform as expected.



The issue occurs when you want to merge two data sources, but don’t want to use the SSIS sort task. Instead of using the sort task, you change the data sources to use a sql command, generate some SQL with an order by clause in it. Then in the advance editor of the source, set the IsSorted property to True.

Then you have to set the SortKeyPosition to match your order by clause in the SQL statement. Do this for both data source and then use them as inputs to the merge join task. As below.
Problematic approach

This in theory should work, although for some reason, which I have yet to discover doesn’t. For some reason the merge join doesn’t always match. Replacing the sorts with sort tasks seems to fix the problem, what this method does differently to using a order by clause, I’m not sure.
Working , with performance issues

Yet, if you are involving large volumes of data I would consider removing all of this with a simple lookup statement. Hopefully the problem will make itself apparent to me and I’ll update this blog with the reason.

2 comments:

Marco Russo said...

It seems a problem related to collation. Using the Sort inside the SSIS package is not a good idea... Check the collation you're using.

Matt said...

Still no idea what is going on here, Marco gave me some other things to check.
From collation, codepage on both inputs being the same to manually comparing the output to a file from a sort and a order by. But still the merge doesn't work.
Still to try SP3, never know that may fix some internal issue.