Thursday, August 2, 2012

Hacking nested IN and NOT IN queries in hive

If you are not using nested select, then IN and NOT IN queries work fine.

But if you are looking for nested select, here is a alternate way to solve it :

For IN queries :

[SQL]    select id,name from user where id in (select id from students)
[HIVE]   select user.id,user.name from user JOIN (select id from students) tmp on (tmp.id = user.id)

For NOT IN queries :
[SQL]   select id,name from user where id not in (select id from students)
[HIVE]  select user.id,user.name from user  LEFT OUTER JOIN students ON (students.id = user.id) where students.id is null

Above will work only when students.id is supposed to be not null. Essentially idea is to have no rows from second table and all rows from first table in left outer join. So, accordingly one can modify where clause. Also in case of multiple joins, you might encounter duplicate rows in result. In that use,
[HIVE] select distinct user.id,user.name from user  LEFT OUTER JOIN students ON (students.id = user.id) where students.id is null

Distinct will involve one more step  of map reduce in HIVE.