SQL: Self Joins
Why is it sometimes we see SQL queries that list a table more than once in the same query? And this isn’t where you see a table in both the outer query, and also in a subquery. But when the same table is coded more than once in the same FROM clause. For example you might see:
SELECT …….
FROM EMP E1 INNER JOIN ….
PROJ P INNER JOIN …
EMP E2
WHERE ….
There are a number of reasons (logic wise) where this self-joining is needed in SQL, along with some performance gains. This article lists 3 of the more common reasons for seeing a table twice, along with descriptions and examples.
1. Self-Referencing Foreign Keys
2. Comparing rows in a table to other rows in the same table
3. When a lookup table is accessed from different values in the same query
The following descriptions and examples that follow are based on the set of Db2 sample tables (EMP, PROJ, DEPT).
1. Self-Referencing Foreign Keys:
In the DEPT table, there exists a department number, the department name, the manager of the department,
and the administrative department that each department is assigned to. So for example if you are the manager of department ‘D01’, you ultimately manage departments ‘D11’ and ‘D21’. If you want to create an output that contains: DEPT, DEPTNAME, MGRNO, ADMRDEPT, and ADMRDEPT NAME. How/Where do you get the name of the ADMRDEPT value?
With most foreign keys, the relationship is from a column in a table to a primary key column of another table. But in some case, and as we see here, the other table is the same table. So, you have to code a join back to the same table because that is where the name for those ADMRDEPT values exist.
Foreign Key Definition:
ALTER TABLE DEPT ADD FOREIGN KEY (ADMRDEPT) REFERENCES DEPT(DEPTNO).
Joining back to the same table requires the table being in the FROM clause twice, each with their own correlation ID. With each having their own correlation ID, it looks like 2 separate tables but in actuality only one table that gets referenced separately. Here is how the SQL will look.
SELECT D1.DEPTNO, D1.DEPTNAME, D1.MGRNO,
D1.ADMRDEPT, D2.DEPTNAME
FROM DEPT D1 INNER JOIN DEPT D2
ON D1.ADMRDEPT = D2.DEPTNO
;
Results:
If you code ‘WHERE ADMRDEPT = DEPTNO’ without the reference to a second DEPT table as in the following, you will only get the first row where the ADMRDEPT value equals the DEPTNO in the same row.
By coding the table a second time, the processing will take each ADMRDEPT value and go read the DEPT D2 based on the primary key DEPTNO.
2. Comparing rows in a table to other rows in the same table
When logic is needed to compare data in a table to other rows in the same table, the FROM clause must contain the table twice. This makes it look like you are comparing rows in a table to rows in another table. But again it is comparing the rows back to itself.
For example:
- Show employees that have the same birthdate as others
- Show employees that have higher education levels, yet lower salaries than others in the same table
EMP TABLE: EMPNO, LASTNAME, BIRTHDATE, EDLEVEL, SALARY, ….
Same Birthdate Example:
SELECT E1.EMPNO, E1.LASTNAME, E1.BIRTHDATE,
E2.EMPNO, E2.LASTNAME, E2.BIRTHDATE
FROM EMP E1 INNER JOIN EMP E2
ON E1.BIRTHDATE = E2.BIRTHDATE
WHERE E1.EMPNO < E2.EMPNO
;
Results:
NOTE: The reason for the E1.EMPNO < E2.EMPNO is because in the compare, the processing is looking at the tables individually (as if there are physically 2 tables). Without this added predicate, every employee will be returned with the same birthdate as itself because it will exist in both tables.
For example EMPNO 000010 exists in the E1 table and the E2 table (as do all the others). So in having this extra predicate on EMPNO, you are dropping the rows for the same employees.
NOTE: The key here is in the SQL. It must look like 2 separate tables, even though it is the same table. The processing however does not create a second table.
3. Lookup table is accessed from different values in output row.
There are times when a code/value is coming from two different places in each row of the output and needs to join to the same table to get a description.
For example: Query for the following:
DEPTNO, DEPTNAME, MGRNO, MGRNO LASTNAME, PROJNO, PROJNAME, RESPEMP, RESPEMP LASTNAME
In this query, you are looking for departments with their associated projects. You also need the lastname for the department manager (MGRNO column), and the lastname for the project manager (RESPEMP column) in row’s output.
The lastname values exist in the EMP table. But because the department manager could be different than the project manager (the MGRNO value is different than the RESPEMP value), they must be joined to separate EMP tables in order to get the different lastnames. The MGRNO and RESPEMP columns are both foreign keys to the same EMP primary key column EMPNO.
Here is the query: Notice the join to the EMP E1 table to get the department manager’s lastname and the join to the EMP E2 table to get the project manager’s lastname.
SELECT D.DEPTNO, D.DEPTNAME,
D.MGRNO, E1.LASTNAME,
P.PROJNO, P.PROJNAME,
P.RESPEMP, E2.LASTNAME
FROM DEPT D INNER JOIN PROJ P
ON D.DEPTNO = P.DEPTNO
INNER JOIN EMP E1
ON D.MGRNO = E1.EMPNO
INNER JOIN EMP E2
ON P.RESPEMP = E2.EMPNO
ORDER BY DEPTNO, PROJNO
;
Results:
Often times we see in cases like this the query gets broke out into separate queries, with an EMP query for retrieving lastname separate and then executed for each different value. Having them all in one query is much more efficient. It lessens the number of times a query is sent from your application to Db2 for execution. The lesser number of queries sent in a unit of work, the better.