Download, Install, and Run PySpark
- For macbook users: Enable "Remote Login" ==========================================
System Preferences --> Sharing --> enable "Remote Login" service
- Make Sure Java is Installed Properly =======================================
java -version
java version "1.8.0_72"
Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
- Download =========== Download the latest binary Spark from the following URL:
http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
- Open the Downloaded File =========================== Assuming that I have downloaded my file in /Users/mparsian/spark-1.6.1-bin-hadoop2.6.tgz
cd /Users/mparsian
tar zvfx spark-1.6.1-bin-hadoop2.6.tgz
x spark-1.6.1-bin-hadoop2.6/
x spark-1.6.1-bin-hadoop2.6/NOTICE
x spark-1.6.1-bin-hadoop2.6/CHANGES.txt
...
...
...
x spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar
x spark-1.6.1-bin-hadoop2.6/README.md
- Start the Spark Cluster ==========================
cd /Users/mparsian/spark-1.6.1-bin-hadoop2.6/
ls -l
total 2736
-rw-r--r--@ 1 mparsian 897801646 1343562 Feb 26 21:02 CHANGES.txt
-rw-r--r--@ 1 mparsian 897801646 17352 Feb 26 21:02 LICENSE
-rw-r--r--@ 1 mparsian 897801646 23529 Feb 26 21:02 NOTICE
drwxr-xr-x@ 3 mparsian 897801646 102 Feb 26 21:02 R
-rw-r--r--@ 1 mparsian 897801646 3359 Feb 26 21:02 README.md
-rw-r--r--@ 1 mparsian 897801646 120 Feb 26 21:02 RELEASE
drwxr-xr-x@ 25 mparsian 897801646 850 Feb 26 21:02 bin
drwxr-xr-x@ 9 mparsian 897801646 306 Feb 26 21:02 conf
drwxr-xr-x@ 3 mparsian 897801646 102 Feb 26 21:02 data
drwxr-xr-x@ 6 mparsian 897801646 204 Feb 26 21:02 ec2
drwxr-xr-x@ 3 mparsian 897801646 102 Feb 26 21:02 examples
drwxr-xr-x@ 8 mparsian 897801646 272 Feb 26 21:02 lib
drwxr-xr-x@ 37 mparsian 897801646 1258 Feb 26 21:02 licenses
drwxr-xr-x@ 9 mparsian 897801646 306 Feb 26 21:02 python
drwxr-xr-x@ 24 mparsian 897801646 816 Feb 26 21:02 sbin
./sbin/start-all.sh
- Check Master and Worker ========================== Make sure that Master and Worker processes are running:
jps
1347 Master
1390 Worker
- Check The Spark URL ======================
http://localhost:8080
- Define 2 Very Basic Python Programs ======================================
- Python program:
test.py
cat /Users/mparsian/spark-1.6.1-bin-hadoop2.6/test.py
#!/usr/bin/python
import sys
for line in sys.stdin:
print "hello " + line
- Python program:
test2.py
cat /Users/mparsian/spark-1.6.1-bin-hadoop2.6/test2.py
#!/usr/bin/python
def fun2(str):
str2 = str + " zaza"
return str2
- Start and Run pyspark ========================
cd /Users/mparsian/spark-1.6.1-bin-hadoop2.6/
./bin/pyspark
Python 2.7.10 (default, Oct 23 2015, 19:19:21)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/04/04 11:18:01 INFO spark.SparkContext: Running Spark version 1.6.1
...
...
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
Using Python version 2.7.10 (default, Oct 23 2015 19:19:21)
SparkContext available as sc, HiveContext available as sqlContext.
>>> data = ["john","paul","george","ringo"]
>>> data
['john', 'paul', 'george', 'ringo']
>>> rdd = sc.parallelize(data)
>>> rdd.collect()
['john', 'paul', 'george', 'ringo']
>>> test = "/Users/mparsian/spark-1.6.1-bin-hadoop2.6/test.py"
>>> test2 = "/Users/mparsian/spark-1.6.1-bin-hadoop2.6/test2.py"
>>> import test
>>> import test2
>>> pipeRDD = rdd.pipe(test)
>>> pipeRDD.collect()
[u'hello john', u'', u'hello paul', u'', u'hello george', u'', u'hello ringo', u'']
>>> rdd.collect()
['john', 'paul', 'george', 'ringo']
>>> rdd2 = rdd.map(lambda x : test2.fun2(x))
>>> rdd2.collect()
['john zaza', 'paul zaza', 'george zaza', 'ringo zaza']
>>>
PySpark is the Spark Python API.
Start PySpark
First make sure that you have started the Spark cluster. To start Spark, you execute:
cd $SPRAK_HOME
./sbin/start-all.sh
To start PySpark, execute the following:
cd $SPRAK_HOME
./bin/pyspark
Successful execution will give you the PySpark prompt:
./bin/pyspark
Python 2.7.10 (default, Aug 22 2015, 20:33:39)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/01/20 10:26:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.10 (default, Aug 22 2015 20:33:39)
SparkContext available as sc, HiveContext available as sqlContext.
>>>
Note that the shell already have created a SparkContext (
sc
) object and you may use it to create RDDs.Creating RDDs
You may create RDDs by reading files from data structures, local file system, HDFS, and other data sources.
Create RDD from a Data Structure (or Collection)
- Example-1
>>> data = [1, 2, 3, 4, 5, 8, 9]
>>> data
[1, 2, 3, 4, 5, 8, 9]
>>> myRDD = sc.parallelize(data)
>>> myRDD.collect()
[1, 2, 3, 4, 5, 8, 9]
>>> myRDD.count()
7
>>>
- Example-2
>>> kv = [('a',7), ('a', 2), ('b', 2), ('b',4), ('c',1), ('c',2), ('c',3), ('c',4)]
>>> kv
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>> rdd2 = sc.parallelize(kv)
>>> rdd2.collect()
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>>
>>> rdd3 = rdd2.reduceByKey(lambda x, y : x+y)
>>> rdd3.collect()
[('a', 9), ('c', 10), ('b', 6)]
>>>
- Example-3
# ./bin/pyspark
Python 2.7.10 (default, Aug 22 2015, 20:33:39)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
16/01/21 16:46:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Python version 2.7.10 (default, Aug 22 2015 20:33:39)
SparkContext available as sc, HiveContext available as sqlContext.
>>> kv = [('a',7), ('a', 2), ('b', 2), ('b',4), ('c',1), ('c',2), ('c',3), ('c',4)]
>>> kv
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>> rdd2 = sc.parallelize(kv)
>>> rdd2.collect()
[('a', 7), ('a', 2), ('b', 2), ('b', 4), ('c', 1), ('c', 2), ('c', 3), ('c', 4)]
>>> rdd3 = rdd2.groupByKey()
>>> rdd3.collect()
[
('a', <pyspark.resultiterable.ResultIterable object at 0x104ec4c50>),
('c', <pyspark.resultiterable.ResultIterable object at 0x104ec4cd0>),
('b', <pyspark.resultiterable.ResultIterable object at 0x104ce7290>)
]
>>> rdd3.map(lambda x : (x[0], list(x[1]))).collect()
[
('a', [7, 2]),
('c', [1, 2, 3, 4]),
('b', [2, 4])
]
>>>
Create RDD from a Local File System
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
...
JavaSparkContext context = new JavaSparkContext();
...
final String inputPath ="file:///dir1/dir2/myinputfile.txt";
JavaRDD<String> rdd = context.textFile(inputPath);
Create RDD from HDFS
- Example-1:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
...
JavaSparkContext context = new JavaSparkContext();
...
final String inputPath ="hdfs://myhadoopserver:9000/dir1/dir2/myinputfile.txt";
JavaRDD<String> rdd = context.textFile(inputPath);
...
- Example-2:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
...
JavaSparkContext context = new JavaSparkContext();
...
final String inputPath ="/dir1/dir2/myinputfile.txt";
JavaRDD<String> rdd = context.textFile(inputPath);
...
https://www.slideshare.net/hkarau/a-really-really-fast-introduction-to-py-spark-lightning-fast-cluster-computing-with-python-1
https://www.slideshare.net/thegiivee/pysaprk?qid=81cf1b31-8b19-4570-89a5-21d03cad6ecd&v=default&b=&from_search=9
http://www.learnbymarketing.com/618/pyspark-rdd-basics-examples/
https://github.com/apache/spark/blob/master/project/MimaBuild.scala
https://github.com/mahmoudparsian/pyspark-tutorial
No comments:
Post a Comment