Implements Algorithm to do Merge
It does target right outer join source on the merge condition specified.
MergeJoin = target right outer join source
Filter(MergeJoin, target.rowId != null) will provide the matched rows for UPDATE/DELETE
Filter(MergeJoin, target.rowId == null) will provide non-matched rows for INSERT
DFs with the rows to UPDATE/DELETE/INSERT are created and corresponding operations
are performed on HiveAcidTable. Under same transactions, different statementIds
are assigned to each operation so that they don;t collide while writing delta files.
Performance considerations:
Special handling is done where only INSERT clause is provided.
In such case source left anti join target gives the rows to be
inserted instead of expensive right outer join.
We donot want right outer join to happen for every operation.
So join dataframe is converted to RDD and back to Dataframe.
This ensures that transformations on the converted DataFrame
donot recompute the RDD i.e., join is executed just once.
According to SQL standard we need to error when multiple source
rows match same target row. We use the same join done above for
other operations to figure that out instead of running more joins.
Implements Algorithm to do Merge It does target right outer join source on the merge condition specified. MergeJoin = target right outer join source
Filter(MergeJoin, target.rowId != null) will provide the matched rows for UPDATE/DELETE Filter(MergeJoin, target.rowId == null) will provide non-matched rows for INSERT
DFs with the rows to UPDATE/DELETE/INSERT are created and corresponding operations are performed on HiveAcidTable. Under same transactions, different statementIds are assigned to each operation so that they don;t collide while writing delta files.
Performance considerations:
Special handling is done where only INSERT clause is provided. In such case source left anti join target gives the rows to be inserted instead of expensive right outer join.
We donot want right outer join to happen for every operation. So join dataframe is converted to RDD and back to Dataframe. This ensures that transformations on the converted DataFrame donot recompute the RDD i.e., join is executed just once.
According to SQL standard we need to error when multiple source rows match same target row. We use the same join done above for other operations to figure that out instead of running more joins.