[一起学Hive]之十九-使用Hive API分析HQL的执行计划、Job数量和表的血缘关系

关键字：Hive HQL Job数量、Hive执行计划、Hive LineageInfo

本文介绍使用Hive的API获取一条HQL的最终执行计划，从而获取这条HQL的Job数量，另外，介绍使用API分析一条HQL中所包含的输入表和输出表。这些信息在做元数据管理和Hive表的血缘分析时候很有用。

Hive在执行一条HQL的时候，会经过以下步骤：

语法解析：Antlr定义SQL的语法规则，完成SQL词法，语法解析，将SQL转化为抽象语法树AST Tree；
语义解析：遍历AST Tree，抽象出查询的基本组成单元QueryBlock；
生成逻辑执行计划：遍历QueryBlock，翻译为执行操作树OperatorTree；
优化逻辑执行计划：逻辑层优化器进行OperatorTree变换，合并不必要的ReduceSinkOperator，减少shuffle数据量；

生成物理执行计划：遍历OperatorTree，翻译为MapReduce任务；
优化物理执行计划：物理层优化器进行MapReduce任务的变换，生成最终的执行计划；

关于这几个步骤，在美团的技术博客上有一篇文章介绍的非常好，可以参考：http://tech.meituan.com/hive-sql-to-mapreduce.html

一般情况下，HQL中的每一个表或者子查询都会生成一个job，这是逻辑执行计划中生成的，但后面Hive还会优化，比如：使用MapJoin，最终一条HQL语句生成的job数量很难通过HQL观察出来。

获取HQL的执行计划和Job数量

直接看代码吧：

package com.lxw1234.test;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.ql.Context;
import org.apache.hadoop.hive.ql.QueryPlan;
import org.apache.hadoop.hive.ql.exec.Utilities;
import org.apache.hadoop.hive.ql.parse.ASTNode;
import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
import org.apache.hadoop.hive.ql.parse.ParseDriver;
import org.apache.hadoop.hive.ql.parse.ParseUtils;
import org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory;
import org.apache.hadoop.hive.ql.session.SessionState;

/**
 * lxw的大数据田地 -- lxw1234.com
 * @author lxw1234
 *
 */
public class HiveQueryPlan {
	public static void main(String[] args) throws Exception {
		HiveConf conf = new HiveConf();
		conf.addResource(new Path("file:///usr/local/apache-hive-0.13.1-bin/conf/hive-site.xml"));
		conf.addResource(new Path("file:///usr/local/apache-hive-0.13.1-bin/conf/hive-default.xml.template"));
		conf.set("javax.jdo.option.ConnectionURL",
				"jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf-8");
		conf.set("hive.metastore.local", "true");
		conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver");
		conf.set("javax.jdo.option.ConnectionUserName", "hive");
		conf.set("javax.jdo.option.ConnectionPassword", "hive");
		conf.set("hive.stats.dbclass", "jdbc:mysql");
		conf.set("hive.stats.jdbcdriver", "com.mysql.jdbc.Driver");
		conf.set("hive.exec.dynamic.partition.mode", "nonstrict");
		
		String command = args[0];
		SessionState.start(conf);
		Context ctx = new Context(conf);
		ParseDriver pd = new ParseDriver();
		ASTNode tree = pd.parse(command, ctx);
		tree = ParseUtils.findRootNonNullToken(tree);
		BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(conf, tree);
		sem.analyze(tree, ctx);
		sem.validate();
		QueryPlan queryPlan = new QueryPlan(command,sem,0l);
		int jobs = Utilities.getMRTasks(queryPlan.getRootTasks()).size();
		System.out.println("Total jobs = " + jobs);
		
	}
}

将上面的代码打包成testhive.jar，运行该类需要引入Hive的依赖包，在包含Hadoop和Hive客户端的机器上执行下面的命令：

for f in /usr/local/apache-hive-0.13.1-bin/lib/*.jar; do
  HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$f;
done
export HADOOP_CLASSPATH

分别解析下面三条HQL语句：

HQL1：SELECT COUNT(1) FROM liuxiaowen.lxw1;
HQL2：SELECT COUNT(1) FROM (SELECT url FROM liuxiaowen.lxw1 GROUP BY url) x;
HQL3：SELECT COUNT(1) FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain);

解析HQL1：

hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM liuxiaowen.lxw1"

结果如下：

Hive QueryPlan

解析HQL2：

hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM (SELECT url FROM liuxiaowen.lxw1 GROUP BY url) x"

结果如下：

Hive QueryPlan

解析HQL3：

hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain)"

结果如下：

Hive QueryPlan

在HQL3中，由于Hive自动优化使用了MapJoin，因此，两个表的join最终只用了一个job，在Hive中执行验证一下：

Hive QueryPlan

解析HQL中表的血缘关系

在元数据管理中，可能需要知道Hive中有哪些表，以及这些表之间的关联关系，比如：A表是由B表和C表统计汇总而来。

Hive中本身自带了一个工具，用来分析一条HQL中的源表和目标表，org.apache.hadoop.hive.ql.tools.LineageInfo

但该类中目标表只能是使用INSERT语句插入数据的目标表，对于使用CREATE TABLE AS语句创建的表分析不出来。

下面的代码只对org.apache.hadoop.hive.ql.tools.LineageInfo做了小小的修改：

package com.lxw1234.test;

import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Stack;
import java.util.TreeSet;

import org.apache.hadoop.hive.ql.lib.DefaultGraphWalker;
import org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher;
import org.apache.hadoop.hive.ql.lib.Dispatcher;
import org.apache.hadoop.hive.ql.lib.GraphWalker;
import org.apache.hadoop.hive.ql.lib.Node;
import org.apache.hadoop.hive.ql.lib.NodeProcessor;
import org.apache.hadoop.hive.ql.lib.NodeProcessorCtx;
import org.apache.hadoop.hive.ql.lib.Rule;
import org.apache.hadoop.hive.ql.parse.ASTNode;
import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
import org.apache.hadoop.hive.ql.parse.HiveParser;
import org.apache.hadoop.hive.ql.parse.ParseDriver;
import org.apache.hadoop.hive.ql.parse.ParseException;
import org.apache.hadoop.hive.ql.parse.SemanticException;

/**
 * lxw的大数据田地 -- lxw1234.com
 * @author lxw1234
 *
 */
public class HiveLineageInfo implements NodeProcessor {

  /**
   * Stores input tables in sql.
   */
  TreeSet inputTableList = new TreeSet();
  /**
   * Stores output tables in sql.
   */
  TreeSet OutputTableList = new TreeSet();

  /**
   *
   * @return java.util.TreeSet
   */
  public TreeSet getInputTableList() {
    return inputTableList;
  }

  /**
   * @return java.util.TreeSet
   */
  public TreeSet getOutputTableList() {
    return OutputTableList;
  }

  /**
   * Implements the process method for the NodeProcessor interface.
   */
  public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
      Object... nodeOutputs) throws SemanticException {
    ASTNode pt = (ASTNode) nd;

    switch (pt.getToken().getType()) {
    
    case HiveParser.TOK_CREATETABLE:
    	OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode)pt.getChild(0)));
	    break;
    case HiveParser.TOK_TAB:
      OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode)pt.getChild(0)));
      break;

    case HiveParser.TOK_TABREF:
      ASTNode tabTree = (ASTNode) pt.getChild(0);
      String table_name = (tabTree.getChildCount() == 1) ?
          BaseSemanticAnalyzer.getUnescapedName((ASTNode)tabTree.getChild(0)) :
            BaseSemanticAnalyzer.getUnescapedName((ASTNode)tabTree.getChild(0)) + "." + tabTree.getChild(1);
      inputTableList.add(table_name);
      break;
    }
    return null;
  }

  /**
   * parses given query and gets the lineage info.
   *
   * @param query
   * @throws ParseException
   */
  public void getLineageInfo(String query) throws ParseException,
      SemanticException {

    /*
     * Get the AST tree
     */
    ParseDriver pd = new ParseDriver();
    ASTNode tree = pd.parse(query);

    while ((tree.getToken() == null) && (tree.getChildCount() > 0)) {
      tree = (ASTNode) tree.getChild(0);
    }

    /*
     * initialize Event Processor and dispatcher.
     */
    inputTableList.clear();
    OutputTableList.clear();

    // create a walker which walks the tree in a DFS manner while maintaining
    // the operator stack. The dispatcher
    // generates the plan from the operator tree
    Map<Rule, NodeProcessor> rules = new LinkedHashMap<Rule, NodeProcessor>();

    // The dispatcher fires the processor corresponding to the closest matching
    // rule and passes the context along
    Dispatcher disp = new DefaultRuleDispatcher(this, rules, null);
    GraphWalker ogw = new DefaultGraphWalker(disp);

    // Create a list of topop nodes
    ArrayList topNodes = new ArrayList();
    topNodes.add(tree);
    ogw.startWalking(topNodes, null);
  }

  public static void main(String[] args) throws IOException, ParseException,
      SemanticException {
    String query = args[0];
    HiveLineageInfo lep = new HiveLineageInfo();
    lep.getLineageInfo(query);
    System.out.println("Input tables = " + lep.getInputTableList());
	System.out.println("Output tables = " + lep.getOutputTableList());
  }
}

将上面的程序打包成testhive.jar，同上面，执行时候需要引入Hive的依赖包：

分析下面两条HQL语句：

HQL1:CREATE TABLE liuxiaowen.lxw1234 AS SELECT * FROM liuxiaowen.lxw1;
HQL2:INSERT OVERWRITE TABLE liuxiaowen.lxw3 SELECT a.url FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain);

执行命令：

hadoop jar testhive.jar com.lxw1234.test.HiveLineageInfo "CREATE TABLE liuxiaowen.lxw1234 AS SELECT * FROM liuxiaowen.lxw1"
hadoop jar testhive.jar com.lxw1234.test.HiveLineageInfo "INSERT OVERWRITE TABLE liuxiaowen.lxw3 SELECT a.url FROM liuxiaowen.lxw1 a 
     join liuxiaowen.lxw2 b ON (a.url = b.domain)"

分析结果：

Hive LineageInfo

HQL中的Input table和Output table已经正确解析出来。

Hive相关文章（持续更新）：

一起学Hive系列

—-Hive概述，Hive是什么

—-Hive函数大全-完整版

—-Hive中的数据库(Database)和表(Table)

—-Hive整合HBase，操作HBase表

—-Hive的元数据表结构详解

—-分析Hive表和分区的统计信息(Statistics)

—-Hive的WEB页面接口-HWI

—-从Hive表中进行数据抽样-Sampling

—-Hive UDF开发

Hive分析函数系列

Hive索引

hive优化之——控制hive任务中的map数和reduce数

如果觉得本博客对您有帮助，请赞助作者。

转载请注明：lxw的大数据田地 » [一起学Hive]之十九-使用Hive API分析HQL的执行计划、Job数量和表的血缘关系

你好，楼主。我想问一下，如果我想进一步做到字段之间的血缘关系呢。比如想知道A表中的字段来自于哪个表的哪个字段？这个可以实现么？

sunnyhaizi2016-03-22 17:07 回复

请教你现在做了么
soso2018-07-05 22:02 回复

首先赞一个, 楼主好强大

!! 想尝试着学习hive中血缘关系解析,git下载了hive的源码, 其中缺少了org.apache.hadoop.hive.ql.parse.HiveParser这个关键类, 需要通过antlr编译得到, 但这个过程总是报错,楼主有什么好办法吗? 非常感谢!!

peijd2016-12-08 19:32 回复

看了您的博客真是收益不少～～～

xuechenxi2017-08-27 23:18 回复

ParseDriver的parse 方法无法解析from_unixtime 请问怎么解决

soso2018-07-05 22:02 回复

import org.apache.hadoop.hive.ql.tools.LineageInfo; public class TestLineageInfo { public static void main(String[] args) throws Exception { String query = "INSERT OVERWRITE TABLE cxy7_dw.tmp_zone_info PARTITION (dt='20171109') SELECT z.zoneid AS zone_id,z.zonename AS zone_name, c.cityid AS city_id, c.cityname AS city_name FROM dict_zoneinfo z LEFT JOIN dict_cityinfo c ON z.cityid = c.cityid AND z.dt='20171109' AND c.dt='20171109' WHERE z.dt='20171109' AND c.dt='20171109'"; LineageInfo.main(new String[] { query }); } } 以上代码执行报错： org.apache.hadoop.hive.ql.parse.ParseException: line 1:189 cannot recognize input near 'LEFT' 'JOIN' 'dict_cityinfo' in join type specifier

zzfang2018-10-12 16:25 回复

lxw的大数据田地 Hadoop/Hive/HBase/Spark/Java

[一起学Hive]之十九-使用Hive API分析HQL的执行计划、Job数量和表的血缘关系

获取HQL的执行计划和Job数量

解析HQL中表的血缘关系

Hive相关文章（持续更新）：

Hi，您需要填写昵称和邮箱！