Establish Spark Context for remaining tests
import pyspark
sc = pyspark.SparkContext('local[*]')
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
# Test processing of text file, based on quick start
textFile = sc.textFile("README.md")
Use a Spark action
textFile.count()
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
print("Lines with spark: %s" % linesWithSpark.count())
mostWordsInLine = textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
mostWordsInLine
Get count of distinct words in the README.md
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.collect()