关键字:Hive Orc、Java API 读写Hive OrcFile
接前面的文章 《Java API 读取Hive Orc文件》,本文中介绍使用Java API写Orc格式的文件。
下面的代码将三行数据:
张三,20
李四,22
王五,30
写入HDFS上的/tmp/lxw1234/orcoutput/lxw1234.com.orc文件中。
package com.lxw1234.test; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat; import org.apache.hadoop.hive.ql.io.orc.OrcSerde; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.OutputFormat; import org.apache.hadoop.mapred.RecordWriter; import org.apache.hadoop.mapred.Reporter; /** * lxw的大数据田地 -- http://lxw1234.com * @author lxw.com * */ public class TestOrcWriter { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(); FileSystem fs = FileSystem.get(conf); Path outputPath = new Path("/tmp/lxw1234/orcoutput/lxw1234.com.orc"); StructObjectInspector inspector = (StructObjectInspector) ObjectInspectorFactory .getReflectionObjectInspector(MyRow.class, ObjectInspectorFactory.ObjectInspectorOptions.JAVA); OrcSerde serde = new OrcSerde(); OutputFormat outFormat = new OrcOutputFormat(); RecordWriter writer = outFormat.getRecordWriter(fs, conf, outputPath.toString(), Reporter.NULL); writer.write(NullWritable.get(), serde.serialize(new MyRow("张三",20), inspector)); writer.write(NullWritable.get(), serde.serialize(new MyRow("李四",22), inspector)); writer.write(NullWritable.get(), serde.serialize(new MyRow("王五",30), inspector)); writer.close(Reporter.NULL); fs.close(); System.out.println("write success ."); } static class MyRow implements Writable { String name; int age; MyRow(String name,int age){ this.name = name; this.age = age; } @Override public void readFields(DataInput arg0) throws IOException { throw new UnsupportedOperationException("no write"); } @Override public void write(DataOutput arg0) throws IOException { throw new UnsupportedOperationException("no read"); } } }
将上面的程序打包成orc.jar,上传至Hadoop客户端机器,
执行命令:
export HADOOP_CLASSPATH=/usr/local/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar:$HADOOP_CLASSPATH
hadoop jar orc.jar com.lxw1234.test.TestOrcWriter
执行成功后,在HDFS上查看该文件:
[liuxiaowen@dev tmp]$ hadoop fs -ls /tmp/lxw1234/orcoutput/ Found 1 items -rw-r--r-- 2 liuxiaowen supergroup 312 2015-08-18 18:09 /tmp/lxw1234/orcoutput/lxw1234.com.orc
接下来在Hive中建立外部表,路径指向该目录,并设置文件格式为ORC:
CREATE EXTERNAL TABLE lxw1234( name STRING, age INT ) stored AS ORC location '/tmp/lxw1234/orcoutput/';
在Hive中查询该表数据:
hive> desc lxw1234; OK name string age int Time taken: 0.148 seconds, Fetched: 2 row(s) hive> select * from lxw1234; OK 张三 20 李四 22 王五 30 Time taken: 0.1 seconds, Fetched: 3 row(s) hive>
OK,数据正常显示。
注意:该程序只做可行性测试,如果Orc数据量太大,则需要改进,或者使用MapReduce;
后续将介绍使用MapReduce读写Hive Orc文件。
请关注我的博客。或者 加入邮件列表 ,随时接收博客更新的通知邮件。
如果觉得本博客对您有帮助,请 赞助作者 。
转载请注明:lxw的大数据田地 » Java API 写 Hive Orc文件