2016 年,一互联网彩票行业的朋友说以下 SQL 要跑几十分钟(数据库环境 Oracle11gR2)。
select count(distinct a.user_name), count(distinct a.invest_id)from base_data_login_info@agent awhere a.str_day <= '20160304'and a.str_day >= '20160301'and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform;
Plan hash value: 2367445948----------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|----------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 130 | 754 (2)|| 1 | SORT GROUP BY | | 1 | 130 | ||* 2 | HASH JOIN | | 4067K| 504M| 754 (2)||* 3 | HASH JOIN | | 11535 | 360K| 258 (1)||* 4 | TABLE ACCESS FULL| TB_USER_CHANNEL | 11535 | 157K| 19 (0)|| 5 | TABLE ACCESS FULL| TB_CHANNEL_INFO | 11767 | 206K| 238 (0)|| 6 | REMOTE | BASE_DATA_LOGIN_INFO | 190K| 17M| 486 (1)|----------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------2 - access("A"."CHANNEL_ID"="CHANNEL_RLAT")3 - access("A"."CHANNEL_ID"="B"."CHANNEL_ID")4 - filter("A"."USER_ID"=5002)Remote SQL Information (identified by operation id):----------------------------------------------------6 - SELECT "USER_NAME","INVEST_ID","STR_DAY","CHANNEL_ID","PLATFORM" FROM"BASE_DATA_LOGIN_INFO" "A" WHERE "STR_DAY"<='20160304' AND "STR_DAY">='20160301'AND "PLATFORM" IS NOT NULL (accessing 'AGENT' )
想要优化 SQL,必须要知道表大小。TBUSER_CHANNEL 有 1 万行数据,TB_CHANNEL INFO 有 1 万行左右,BASE_DATA_LOGIN_INFO 有 19 万行,过滤之后剩下 4 万行左右。执行计划走的是 HASH 连接,每个表都只扫描一次,虽然是全表扫描,但是最大表才 19 万行,按道理说不应该执行几十分钟,正常情况下应该可以 1 秒左右出结果。起初我们怀疑是 SQL 中 DBLINK 传输数据导致性能问题,于是在本地创建一个一模一样的表,但是该 SQL 还是执行缓慢。
我们只能一步一步排查 SQL 哪里出了问题,让朋友执行如下 SQL。
select count(*) ---改动了这里from base_data_login_info@agent awhere a.str_day <= '20160304'and a.str_day >= '20160301'and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform;
上面 SQL 可以秒出。于是朋友继续执行如下 SQL。
select count(a.user_name) ---改动了这里from base_data_login_info@agent awhere a.str_day <= '20160304'and a.str_day >= '20160301'and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform;
上面 SQL 也可以秒出。我们继续排查。
select count(a.user_name), count(a.invest_id) ---改动了这里from base_data_login_info@agent awhere a.str_day <= '20160304'and a.str_day >= '20160301'and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform;
以上 SQL 还是可以秒出,我们继续排查。
select count(distinct a.user_name), count(a.invest_id) ---改动了这里from base_data_login_info@agent awhere a.str_day <= 『20160304』and a.str_day >= 『20160301』and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform;
上面 SQL 依然可以秒出。现在我们找到引起 SQL 慢的原因了,select 中同时 count(distinct a.user_name),count(distinct a.invest_id)导致 SQL 查询缓慢。
在实际工作中,要优先解决问题,再去查找问题的根本原因。我们将 SQL 进行如下改写。
with t1 as(select /*+ materialize */a.user_name, a.invest_idfrom base_data_login_info@agent awhere a.str_day <= '20160304'and a.str_day >= '20160301'and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform)select count(distinct user_name) ,count(distinct invest_id) from t1;
为什么改写成以上 SQL 能解决性能问题呢?因为在排查问题的时候 count 不加 distinct 是可以秒出的,所以我们先将能秒出的 SQL 放到 with as 子句,通过添加 HINT::/+ materialize /生成临时表,再对临时表进行 **count(distinct...),count(distinct)**,这样就能解决问题。改写后的 SQL 执行计划如下。
Plan hash value: 901326807-------------------------------------------------------------------------------------| Id |Operation | Name | Rows |Bytes|Cost(%CPU)|-------------------------------------------------------------------------------------| 0 |SELECT STATEMENT | | 1 | 54| 1621 (1)|| 1 | TEMP TABLE TRANSFORMATION| | | | || 2 | LOAD AS SELECT | SYS_TEMP_0FD9D6720_EB8EA | | | ||* 3 | HASH JOIN RIGHT SEMI | | 190K| 22M| 744 (1)|| 4 | VIEW | VW_NSO_1 | 11535 | 304K| 258 (1)||* 5 | HASH JOIN | | 11535 | 360K| 258 (1)||* 6 | TABLE ACCESS FULL | TB_USER_CHANNEL | 11535 | 157K| 19 (0)|| 7 | TABLE ACCESS FULL | TB_CHANNEL_INFO | 11767 | 206K| 238 (0)|| 8 | REMOTE | BASE_DATA_LOGIN_INFO | 190K| 17M| 486 (1)|| 9 | SORT GROUP BY | | 1 | 54| || 10 | VIEW | | 190K| 9M| 878 (1)|| 11 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6720_EB8EA | 190K| 9M| 878 (1)|-------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------3 - access("A"."CHANNEL_ID"="CHANNEL_RLAT")5 - access("A"."CHANNEL_ID"="B"."CHANNEL_ID")6 - filter("A"."USER_ID"=5002)Remote SQL Information (identified by operation id):----------------------------------------------------8 - SELECT "USER_NAME","INVEST_ID","STR_DAY","CHANNEL_ID","PLATFORM" FROM "BASE_DATA_LOGIN_INFO""A" WHERE "STR_DAY"<='20160304' AND "STR_DAY">='20160301'AND "PLATFORM" IS NOT NULL (accessing 'AGENT' )
解决问题之后,现在我们来查找 SQL 缓慢的根本原因。现在对比缓慢 SQL 的执行计划与秒出 SQL 的执行计划,缓慢 SQL 的执行计划如下。
----------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|----------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 130 | 754 (2)|| 1 | SORT GROUP BY | | 1 | 130 | ||* 2 | HASH JOIN | | 4067K| 504M| 754 (2)||* 3 | HASH JOIN | | 11535 | 360K| 258 (1)||* 4 | TABLE ACCESS FULL| TB_USER_CHANNEL | 11535 | 157K| 19 (0)|| 5 | TABLE ACCESS FULL| TB_CHANNEL_INFO | 11767 | 206K| 238 (0)|| 6 | REMOTE | BASE_DATA_LOGIN_INFO | 190K| 17M| 486 (1)|----------------------------------------------------------------------------------
秒出 SQL 的执行计划如下。
-------------------------------------------------------------------------------------| Id |Operation |Name | Rows | Bytes |Cost(%CPU)|-------------------------------------------------------------------------------------| 0 |SELECT STATEMENT | | 1 | 54 | 1621 (1)|| 1 | TEMP TABLE TRANSFORMATION| | | | || 2 | LOAD AS SELECT |SYS_TEMP_0FD9D6720_EB8EA| | | ||* 3 | HASH JOIN RIGHT SEMI | | 190K| 22M| 744 (1)|| 4 | VIEW |VW_NSO_1 | 11535 | 304K| 258 (1)||* 5 | HASH JOIN | | 11535 | 360K| 258 (1)||* 6 | TABLE ACCESS FULL |TB_USER_CHANNEL | 11535 | 157K| 19 (0)|| 7 | TABLE ACCESS FULL |TB_CHANNEL_INFO | 11767 | 206K| 238 (0)|| 8 | REMOTE |BASE_DATA_LOGIN_INFO | 190K| 17M| 486 (1)|| 9 | SORT GROUP BY | | 1 | 54 | || 10 | VIEW | | 190K| 9M| 878 (1)|| 11 | TABLE ACCESS FULL |SYS_TEMP_0FD9D6720_EB8EA| 190K| 9M| 878 (1)|-------------------------------------------------------------------------------------
我们注意仔细对比执行计划,缓慢 SQL 执行计划中 Id=2 是 HASH JOIN,而秒出 SQL 的执行计划中 Id=3 是 HASH JOIN RIGHT SEMI。SEMI 是半连接特有关键字,缓慢 SQL 的执行计划中没有 SEMI 关键字,这说明 CBO 将半连接等价改写成了内连接;秒出 SQL 的执行计划有 SEMI 关键字,这说明 CBO 没有将半连接等价改写成内连接。现在我们得到结论,该 SQL 查询缓慢是因为 CBO 内部将半连接改写为内连接导致。
大家还记得半连接与内连接接区别吗?半连只返回一个表的数据,关联之后数据量不会翻番,内连接表关联之后数据量可能会翻番。该 SQL 查询缓慢是被改成内连接导致,现在我们有充分理由怀疑内连接关联之后返回的数据量太大,因为如果关联返回的数据量很少是不可能出性能问题的。于是检查两个表连接列的数据分布。
select channel_id, count(*)from base_data_login_infogroup by channel_idorder by 2;
CHANNEL_ID COUNT(*)-------------------------------------------------- ----------011a1 2003a1 3021a1 3006a1 12024h2 16013a1 19007a1 24012a1 25005a1 27EPT01 36028h2 109008a1 139029a1 841009a1 921014a1 1583000a1 1975a0001 2724004a1 5482001a1 16329026h2 160162
in 子查询关联列数据分布如下。
select channel_rlat, count(*)from tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002group by channel_rlatorder by 2 desc;
channel_rlat count(*)026h2 10984024h2 7002h2 6023a2 2007s001022001 1007s001022002 1007s001024007 1007s001024009 1007s001022009 1001s001006 1001s001008 1001s001001001 1001s001001003 1001s001001007 1001s001001014 1007s001018003 1007s001018007 1007s001019005 1007s001019008 1001s001002011 1007s001011003 1007s001034 1007s001023005 1
两表的数据分布果然有问题,其中 026h2 这条数据倾斜特别明显。如果让两表进行内连接,026h2 这条数据关联之后返回结果应该是 16016210984,现在我们终于发现该 SQL 执行缓慢的根本原因,是因为*两个表的连接列中有部分数据倾斜非常严重。
最初采用的是 with as 子句加/*+ materialize */临时解决 SQL 的性能问题,我们也可以使用 rownum 优化 SQL,rownum 可以让一个查询被当成一个整体。
with t1 as(selecta.user_name, a.invest_idfrom base_data_login_info@agent awhere a.str_day <= '20160304'and a.str_day >= '20160301'and a.channel_id in (select channel_rlatfrom tb_user_channel a, tb_channel_info bwhere a.channel_id = b.channel_idand a.user_id = 5002)and a.platform = a.platform and rownum>0)select count(distinct user_name) ,count(distinct invest_id) from t1;
如果大家想模拟本案例,可以跟着下面实验步骤执行(请在 11g 中模拟)。
我们先创建如下两个测试表。
create table a as select * from dba_objects;create table b as select * from dba_objects;
要执行的缓慢的 SQL 如下。
select count(distinct owner), count(distinct object_name)from awhere owner in (select owner from b);
优化改写之后的 SQL 如下。
with t as(select owner, object_namefrom awhere owner in (select owner from b)and rownum > 0)select count(distinct owner), count(distinct object_name)from t;
我们也可以对子查询先去重,将子查询变成 1 的关系,这样也能优化 SQL。
select count(distinct owner), count(distinct object_name)from awhere owner in (select owner from b group by owner);
请思考为什么 Oracle11g CBO 会将 SQL 改写为内连接?大家是否还记得半连接的内容?
**select ... from 1** 的表 **where owner in (select owner from n 的表)**改写为内连接,需要加 distinct。
**select ... from n**的表 **where owner in (select owner from 1 的表)**改写为内连接,不需要加 distinct。
我们的 SQL 是select count(distinct ),count(distinct),所以 CBO 直接将 SQL 改写为 select count(distinct a.owner),count(distinct object_name)from a,b where a.owner=b.owner;这个问题在 12c 中已得到纠正。最后我们想说的就是,不管以后优化器进步有多大,我们始终不能依赖优化器,唯一可以依靠的就是自己所掌握的知识。
