微信搜索lxw1234bigdata | 邀请体验:数阅–数据管理、OLAP分析与可视化平台 | 赞助作者:赞助作者

Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally

Spark lxw1234@qq.com 54611℃ 0评论

关键字:Spark算子、Spark RDD键值转换、groupByKey、reduceByKey、reduceByKeyLocally

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

该函数用于将RDD[K,V]中每个K对应的V值,合并到一个集合Iterable[V]中,

参数numPartitions用于指定分区数;

参数partitioner用于指定分区函数;

 

scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[89] at makeRDD at :21

scala> rdd1.groupByKey().collect
res81: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(2, 1)), (C,CompactBuffer(1)))

reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

该函数用于将RDD[K,V]中每个K对应的V值根据映射函数来运算。

参数numPartitions用于指定分区数;

参数partitioner用于指定分区函数;

scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21

scala> rdd1.partitions.size
res82: Int = 15

scala> var rdd2 = rdd1.reduceByKey((x,y) => x + y)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at reduceByKey at :23

scala> rdd2.collect
res85: Array[(String, Int)] = Array((A,2), (B,3), (C,1))

scala> rdd2.partitions.size
res86: Int = 15

scala> var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y) => x + y)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[95] at reduceByKey at :23

scala> rdd2.collect
res87: Array[(String, Int)] = Array((B,3), (A,2), (C,1))

scala> rdd2.partitions.size
res88: Int = 2

reduceByKeyLocally

def reduceByKeyLocally(func: (V, V) => V): Map[K, V]

该函数将RDD[K,V]中每个K对应的V值根据映射函数来运算,运算结果映射到一个Map[K,V]中,而不是RDD[K,V]。

scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21

scala> rdd1.reduceByKeyLocally((x,y) => x + y)
res90: scala.collection.Map[String,Int] = Map(B -> 3, A -> 2, C -> 1)


 

更多关于Spark算子的介绍,可参考 Spark算子

http://lxw1234.com/archives/tag/spark%E7%AE%97%E5%AD%90

 

 

如果觉得本博客对您有帮助,请 赞助作者

转载请注明:lxw的大数据田地 » Spark算子:RDD键值转换操作(3)–groupByKey、reduceByKey、reduceByKeyLocally

喜欢 (15)
分享 (0)
发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址