Distributed Computing 5 | Paired RDDs

Series: Distributed Computing

Distributed Computing 5 | Paired RDDs

  1. Paired RDDs

(1) Definition

Pair RDDs are key-value pairs that are commonly used for many operations including aggregations or ETL in Spark.

  • Key: can be any type (hashable or unhashable)
  • Value: can be any type (hashable or unhashable)

(2) Paired RDD Transformations

  • Return an RDD of just the keys.
rdd.keys()
  • Return an RDD of just the values.
rdd.values()
  • Return an RDD sorted by the key.
rdd.sortByKey()
  • Group values with the same key.
rdd.groupByKey()
  • Apply a function to each value of a pair RDD without changing the key.
rdd.mapValues(func)
  • Pass each value in the key-value pair RDD through a flatMap function without changing the keys.
rdd.flatMapValues(func)
  • Combine values with the same key.
rdd.reduceByKey(func)
  • Remove elements with a key that existed in the other RDD.
rdd.subtractByKey(otherDataset)
  • Perform an inner join between two RDDs.
rdd.join(otherDataset)
  • Perform a join between two RDDs where the key must be present in the first RDD.
rdd.leftOuterJoin(otherDataset)
  • Perform a join between two RDDs where the key must be present in the other RDD.
rdd.rightOuterJoin(otherDataset)

(3) Paired RDD Actions

  • Only available on pair RDDs. Returns (Key, Int) pairs with the count of each key.
rdd.countByKey()
  • Return all values associated with the provided key.
rdd.lookup(key)

2. Self Testing Questions

(1) Suppose we have the following RDD.

rdd = sc.parallelize([[5,2],[7,4],[3,6],[1,4],[9,7]],3)
  • What is the result of the following code?
rdd_pair = rdd.map(lambda x: ([x[0]], x[1]))
rdd_pair.take(1)

Ans: [([5], 2)] because the paired key can be a list

  • What is the result of the following code?
rdd_pair = rdd.map(lambda x: ({x[0]:1}, x[1]))
rdd_pair.take(1)

Ans: [({5: 1}, 2)] because the paired key can be a list

  • What is the result of the following code?
rdd_pair = rdd.map(lambda x: (x[0], {x[1]:1}))
rdd_pair.distinct().count()

Ans: ERROR because distinct() requires keys and values in paired RDDs to be hashable.

  • What is the result of the following code?
rdd_pair = rdd.map(lambda x: (x[0], str(x[1])))
rdd_pair.distinct().count()

Ans: 5

(2) Suppose we have the following paired RDD.

rdd = sc.parallelize([[7,2],[7,4],[3,6],[1,4],[3,7]],3)
rdd = rdd.map(lambda x: (x[0], x[1]))
  • What is the result of the following code?
rdd.flatMap(lambda x: x).collect()

Ans: [7, 2, 7, 4, 3, 6, 1, 4, 3, 7] the answer has no pairs

  • What is the result of the following code?
rdd.flatMapValues(lambda x: x).collect()

Ans: ERROR because the value for each pair is not iterable.

  • What is the result of the following code?
rdd.sortByKey().groupByKey().take(1)[0][1] == [6, 7]

Ans: False because groupByKey() returns ResultIterable type object as its value. The following code should return True ,

rdd.sortByKey().groupByKey().mapValues(list).take(1)[0][1] == [6, 7]
  • What is the result of the following code?
rdd.groupByKey().mapValues(list).flatMapValues(lambda x: x).collect() == rdd.collect()

Ans: False although two results have the same elements, their order can be different. The following code should return True.

set(rdd.groupByKey().mapValues(list).flatMapValues(lambda x: x).collect()) == set(rdd.collect())
  • What’s the result of the following code?
rdd.groupByKey().mapValues(list).reduceByKey(lambda x, y: x + y).take(1)

Ans: [(3, [6, 7])] the method reduceByKey is not useful because we have unique keys. Directly use reduceByKey if you want to sum them up.

rdd.reduceByKey(lambda x, y: x + y).take(1)
  • What’s the result of the following code?
rdd.lookup(3)

Ans: [6, 7] , should return a list.

  • What’s the result of the following code?
rdd.countByKey()[3]

Ans: 2, returns a dictionary with keys and the number of its values.

(3) Which of the following transformations don’t require data shuffle?

  • keys()
  • values()
  • sortByKey()
  • groupByKey()
  • mapValues(func)
  • flatMapValues(func)
  • reduceByKey(func)

Ans: keys(), values(), mapValues(func), flatMapValues(func)

(4) Which of the following code performs better?

Code 1: rdd.groupByKey().mapValues(sum).collect()
Code 2: rdd.reduceByKey(lambda x, y: x + y).collect()

Ans: Code 2 because groupByKey requires shuffle and aggregate.