5
Ho appena eseguito questi comand:L'incidente pyspark quando ho eseguito `FIRST` o` metodo take` in Windows 7
>>> lines = sc.textFile("C:\Users\elqstux\Desktop\dtop.txt")
>>> lines.count() // this work fine
>>> lines.first() // this crash
Ecco il rapporto di errore:
>>> lines.first()
15/11/18 17:33:35 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393
15/11/18 17:33:35 INFO DAGScheduler: Got job 21 (runJob at PythonRDD.scala:393)
with 1 output partitions
15/11/18 17:33:35 INFO DAGScheduler: Final stage: ResultStage 21(runJob at Pytho
nRDD.scala:393)
15/11/18 17:33:35 INFO DAGScheduler: Parents of final stage: List()
15/11/18 17:33:35 INFO DAGScheduler: Missing parents: List()
15/11/18 17:33:35 INFO DAGScheduler: Submitting ResultStage 21 (PythonRDD[28] at
RDD at PythonRDD.scala:43), which has no missing parents
15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=619
446, maxMem=555755765
15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24 stored as values in memor
y (estimated size 4.7 KB, free 529.4 MB)
15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(3067) called with curMem=624
270, maxMem=555755765
15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in
memory (estimated size 3.0 KB, free 529.4 MB)
15/11/18 17:33:35 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on
localhost:55487 (size: 3.0 KB, free: 529.9 MB)
15/11/18 17:33:35 INFO SparkContext: Created broadcast 24 from broadcast at DAGS
cheduler.scala:861
15/11/18 17:33:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
21 (PythonRDD[28] at RDD at PythonRDD.scala:43)
15/11/18 17:33:35 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks
15/11/18 17:33:35 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 33,
localhost, PROCESS_LOCAL, 2148 bytes)
15/11/18 17:33:35 INFO Executor: Running task 0.0 in stage 21.0 (TID 33)
15/11/18 17:33:35 INFO HadoopRDD: Input split: file:/C:/Users/elqstux/Desktop/dt
op.txt:0+112852
15/11/18 17:33:36 INFO PythonRunner: Times: total = 629, boot = 626, init = 3, f
inish = 0
15/11/18 17:33:36 ERROR PythonRunner: Python worker exited unexpectedly (crashed
)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
apply(PythonRDD.scala:283)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
cala:239)
15/11/18 17:33:36 ERROR PythonRunner: This may have been caused by a prior excep
tion:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
apply(PythonRDD.scala:283)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
cala:239)
15/11/18 17:33:36 ERROR Executor: Exception in task 0.0 in stage 21.0 (TID 33)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
apply(PythonRDD.scala:283)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
cala:239)
15/11/18 17:33:36 WARN TaskSetManager: Lost task 0.0 in stage 21.0 (TID 33, loca
lhost): java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
apply(PythonRDD.scala:283)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
cala:239)
15/11/18 17:33:36 ERROR TaskSetManager: Task 0 in stage 21.0 failed 1 times; abo
rting job
15/11/18 17:33:36 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have
all completed, from pool
15/11/18 17:33:36 INFO TaskSchedulerImpl: Cancelling stage 21
15/11/18 17:33:36 INFO DAGScheduler: ResultStage 21 (runJob at PythonRDD.scala:3
93) failed in 0.759 s
15/11/18 17:33:36 INFO DAGScheduler: Job 21 failed: runJob at PythonRDD.scala:39
3, took 0.810138 s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1317, in first
rs = self.take(1)
File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in ru
nJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partition
s)
File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g
ateway.py", line 538, in __call__
File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in d
eco
return f(*a, **kw)
File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc
ol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.
api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s
tage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID
33, localhost): java.net.SocketException: Connection reset by peer: socket write
error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
apply(PythonRDD.scala:283)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
cala:239)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA
GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D
AGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.
scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$
1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu
ler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(D
AGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG
Scheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567
)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset by peer: socket write erro
r
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.
apply(PythonRDD.scala:283)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s
cala:239)
quando corro take
, anche crash, non riesco a trovare la ragione, chi può aiutarmi?
Il conteggio restituisce il valore previsto? Trovo l'importazione di file di testo come risultati rdds nella formattazione non mi aspettavo. Stringhe generalmente grandi che devo formattare. – Jesse
Ho lo stesso problema. Tuttavia questo errore sembra casuale. L'esecuzione di 'take' o' first' volte a volte ha esito positivo, ma per lo più non riesce. Sono in esecuzione su Python 3.5 – Tim