python 怎么远程调 spark...
from pyspark import SparkContext
sc = SparkContext( 'local', 'test')
textFile = sc.textFile("hdfs://master:9000/test/word.txt")
wordCount = textFile.flatMap(lambda line: line.split(" ")).map(lambda word:(word,1)).reduceByKey(lambda a, b : a + b)
wordCount.foreach(print)
用上面这个 demo 在 spark 的 master 节点上跑过了,然而在本地的虚拟机上跑下面这个,连服务器上的 spark 就报错,只有第二行 sc = SparkContext( 'spark://master:7077', 'test') 有区别,把 local 换成 spark://master:7077 了。下面这个在 master 节点上也是报错的。
from pyspark import SparkContext
sc = SparkContext( 'spark://master:7077', 'test')
textFile = sc.textFile("hdfs://master:9000/test/word.txt")
wordCount = textFile.flatMap(lambda line: line.split(" ")).map(lambda word:(word,1)).reduceByKey(lambda a, b : a + b)
wordCount.foreach(print)
报错是这个:
19/10/16 18:42:11 WARN Utils: Your hostname, node001 resolves to a loopback address: 127.0.1.1; using 192.168.63.127 instead (on interface ens33)
19/10/16 18:42:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/10/16 18:42:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 0:> (0 + 0) / 2]19/10/16 18:42:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
这个是我少了什么配置,还是啥原因。。。。
这个是 conf/spark-defaults.conf 的配置,只有一行:
spark.master spark://master:7077
这个是 spark-env.sh:
SPARK_LOCAL_IP=master
SPARK_LOCAL_DIRS=/home/spark-2.4.4-bin-hadoop2.7/tmp
export JAVA_HOME=/usr/local/java/jdk1.8.0_221
export STANDALONE_SPARK_MASTER_HOST=master
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST