Quotes

Thursday, September 5, 2019

PySpark


Download, Install, and Run PySpark


  1. For macbook users: Enable "Remote Login" ==========================================
System Preferences --> Sharing --> enable "Remote Login" service
  1. Make Sure Java is Installed Properly =======================================
java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
  1. Download =========== Download the latest binary Spark from the following URL:
http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
  1. Open the Downloaded File =========================== Assuming that I have downloaded my file in /Users/mparsian/spark-1.6.1-bin-hadoop2.6.tgz
cd /Users/mparsian

tar zvfx  spark-1.6.1-bin-hadoop2.6.tgz
x spark-1.6.1-bin-hadoop2.6/
x spark-1.6.1-bin-hadoop2.6/NOTICE
x spark-1.6.1-bin-hadoop2.6/CHANGES.txt
...
...
...
x spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar
x spark-1.6.1-bin-hadoop2.6/README.md

  1. Start the Spark Cluster ==========================
cd /Users/mparsian/spark-1.6.1-bin-hadoop2.6/

ls -l
total 2736
-rw-r--r--@  1 mparsian  897801646  1343562 Feb 26 21:02 CHANGES.txt
-rw-r--r--@  1 mparsian  897801646    17352 Feb 26 21:02 LICENSE
-rw-r--r--@  1 mparsian  897801646    23529 Feb 26 21:02 NOTICE
drwxr-xr-x@  3 mparsian  897801646      102 Feb 26 21:02 R
-rw-r--r--@  1 mparsian  897801646     3359 Feb 26 21:02 README.md
-rw-r--r--@  1 mparsian  897801646      120 Feb 26 21:02 RELEASE
drwxr-xr-x@ 25 mparsian  897801646      850 Feb 26 21:02 bin
drwxr-xr-x@  9 mparsian  897801646      306 Feb 26 21:02 conf
drwxr-xr-x@  3 mparsian  897801646      102 Feb 26 21:02 data
drwxr-xr-x@  6 mparsian  897801646      204 Feb 26 21:02 ec2
drwxr-xr-x@  3 mparsian  897801646      102 Feb 26 21:02 examples
drwxr-xr-x@  8 mparsian  897801646      272 Feb 26 21:02 lib
drwxr-xr-x@ 37 mparsian  897801646     1258 Feb 26 21:02 licenses
drwxr-xr-x@  9 mparsian  897801646      306 Feb 26 21:02 python
drwxr-xr-x@ 24 mparsian  897801646      816 Feb 26 21:02 sbin


./sbin/start-all.sh
  1. Check Master and Worker ========================== Make sure that Master and Worker processes are running:
jps
1347 Master
1390 Worker
  1. Check The Spark URL ======================
http://localhost:8080
  1. Define 2 Very Basic Python Programs ======================================
  • Python program: test.py
cat /Users/mparsian/spark-1.6.1-bin-hadoop2.6/test.py
#!/usr/bin/python

import sys

for line in sys.stdin:
 print "hello " + line
  • Python program: test2.py
cat /Users/mparsian/spark-1.6.1-bin-hadoop2.6/test2.py
#!/usr/bin/python

def fun2(str):
 str2 = str + " zaza"
 return str2
  1. Start and Run pyspark ========================
cd /Users/mparsian/spark-1.6.1-bin-hadoop2.6/
./bin/pyspark
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/04/04 11:18:01 INFO spark.SparkContext: Running Spark version 1.6.1
...
...
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Oct 23 2015 19:19:21)
SparkContext available as sc, HiveContext available as sqlContext.

>>> data = ["john","paul","george","ringo"]
>>> data
['john', 'paul', 'george', 'ringo']

>>> rdd = sc.parallelize(data)
>>> rdd.collect()
['john', 'paul', 'george', 'ringo']


>>> test = "/Users/mparsian/spark-1.6.1-bin-hadoop2.6/test.py"
>>> test2 = "/Users/mparsian/spark-1.6.1-bin-hadoop2.6/test2.py"
>>> import test
>>> import test2


>>> pipeRDD =  rdd.pipe(test)
>>> pipeRDD.collect()
[u'hello john', u'', u'hello paul', u'', u'hello george', u'', u'hello ringo', u'']


>>> rdd.collect()
['john', 'paul', 'george', 'ringo']


>>> rdd2 = rdd.map(lambda x : test2.fun2(x))
>>> rdd2.collect()
['john zaza', 'paul zaza', 'george zaza', 'ringo zaza']
>>>


PySpark is the Spark Python API.

Start PySpark

First make sure that you have started the Spark cluster. To start Spark, you execute:
cd $SPRAK_HOME
./sbin/start-all.sh
To start PySpark, execute the following:
cd $SPRAK_HOME
./bin/pyspark
Successful execution will give you the PySpark prompt:
./bin/pyspark
Python 2.7.10 (default, Aug 22 2015, 20:33:39) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/01/20 10:26:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.10 (default, Aug 22 2015 20:33:39)
SparkContext available as sc, HiveContext available as sqlContext.
>>> 

Note that the shell already have created a SparkContext (sc) object and you may use it to create RDDs.

Creating RDDs

You may create RDDs by reading files from data structures, local file system, HDFS, and other data sources.

Create RDD from a Data Structure (or Collection)

  • Example-1
>>> data = [1, 2, 3, 4, 5, 8, 9]
>>> data
[1, 2, 3, 4, 5, 8, 9]
>>> myRDD = sc.parallelize(data)
>>> myRDD.collect()
[1, 2, 3, 4, 5, 8, 9]
>>> myRDD.count()
7
>>> 
  • Example-2
>>> kv = [('a',7), ('a', 2), ('b', 2), ('b',4), ('c',1), ('c',2), ('c',3), ('c',4)]
>>> kv
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>> rdd2 = sc.parallelize(kv)
>>> rdd2.collect()
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>>
>>> rdd3 = rdd2.reduceByKey(lambda x, y : x+y)
>>> rdd3.collect()
[('a', 9), ('c', 10), ('b', 6)]
>>> 
  • Example-3
# ./bin/pyspark 
Python 2.7.10 (default, Aug 22 2015, 20:33:39) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/01/21 16:46:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.10 (default, Aug 22 2015 20:33:39)
SparkContext available as sc, HiveContext available as sqlContext.
>>> kv = [('a',7), ('a', 2), ('b', 2), ('b',4), ('c',1), ('c',2), ('c',3), ('c',4)]

>>> kv
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>> rdd2 = sc.parallelize(kv)
>>> rdd2.collect()
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]

>>> rdd3 = rdd2.groupByKey()
>>> rdd3.collect()
[
 ('a', <pyspark.resultiterable.ResultIterable object at 0x104ec4c50>), 
 ('c', <pyspark.resultiterable.ResultIterable object at 0x104ec4cd0>), 
 ('b', <pyspark.resultiterable.ResultIterable object at 0x104ce7290>)
]

>>> rdd3.map(lambda x : (x[0], list(x[1]))).collect()
[
 ('a', [7, 2]), 
 ('c', [1, 2, 3, 4]), 
 ('b', [2, 4])
]
>>> 


Create RDD from a Local File System

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
...
JavaSparkContext context = new JavaSparkContext();
...
final String inputPath ="file:///dir1/dir2/myinputfile.txt";
JavaRDD<String> rdd = context.textFile(inputPath);

Create RDD from HDFS

  • Example-1:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
...
JavaSparkContext context = new JavaSparkContext();
...
final String inputPath ="hdfs://myhadoopserver:9000/dir1/dir2/myinputfile.txt";
JavaRDD<String> rdd = context.textFile(inputPath);
...
  • Example-2:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
...
JavaSparkContext context = new JavaSparkContext();
...
final String inputPath ="/dir1/dir2/myinputfile.txt";
JavaRDD<String> rdd = context.textFile(inputPath);
...


https://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1


https://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9

http://www.learnbymarketing.com/618/pyspark-rdd-basics-examples/

https://github.com/apache/spark/blob/master/project/MimaBuild.scala

https://github.com/mahmoudparsian/pyspark-tutorial






































No comments:

Post a Comment