2015-11-18 5 views
5

Ho appena eseguito questi comand:L'incidente pyspark quando ho eseguito `FIRST` o` metodo take` in Windows 7

>>> lines = sc.textFile("C:\Users\elqstux\Desktop\dtop.txt") 

>>> lines.count() // this work fine 

>>> lines.first() // this crash 

Ecco il rapporto di errore:

>>> lines.first() 

    15/11/18 17:33:35 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393 

    15/11/18 17:33:35 INFO DAGScheduler: Got job 21 (runJob at PythonRDD.scala:393) 
    with 1 output partitions 
    15/11/18 17:33:35 INFO DAGScheduler: Final stage: ResultStage 21(runJob at Pytho 
    nRDD.scala:393) 
    15/11/18 17:33:35 INFO DAGScheduler: Parents of final stage: List() 
    15/11/18 17:33:35 INFO DAGScheduler: Missing parents: List() 
    15/11/18 17:33:35 INFO DAGScheduler: Submitting ResultStage 21 (PythonRDD[28] at 
    RDD at PythonRDD.scala:43), which has no missing parents 
    15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=619 
    446, maxMem=555755765 
    15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24 stored as values in memor 
    y (estimated size 4.7 KB, free 529.4 MB) 
    15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(3067) called with curMem=624 
    270, maxMem=555755765 
    15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in 
    memory (estimated size 3.0 KB, free 529.4 MB) 
    15/11/18 17:33:35 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on 
    localhost:55487 (size: 3.0 KB, free: 529.9 MB) 
    15/11/18 17:33:35 INFO SparkContext: Created broadcast 24 from broadcast at DAGS 
    cheduler.scala:861 
    15/11/18 17:33:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 
    21 (PythonRDD[28] at RDD at PythonRDD.scala:43) 
    15/11/18 17:33:35 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks 
    15/11/18 17:33:35 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 33, 
    localhost, PROCESS_LOCAL, 2148 bytes) 
    15/11/18 17:33:35 INFO Executor: Running task 0.0 in stage 21.0 (TID 33) 
    15/11/18 17:33:35 INFO HadoopRDD: Input split: file:/C:/Users/elqstux/Desktop/dt 
    op.txt:0+112852 
    15/11/18 17:33:36 INFO PythonRunner: Times: total = 629, boot = 626, init = 3, f 
    inish = 0 
    15/11/18 17:33:36 ERROR PythonRunner: Python worker exited unexpectedly (crashed 
    ) 
    java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 
    15/11/18 17:33:36 ERROR PythonRunner: This may have been caused by a prior excep 
    tion: 
    java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 
    15/11/18 17:33:36 ERROR Executor: Exception in task 0.0 in stage 21.0 (TID 33) 
    java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 
    15/11/18 17:33:36 WARN TaskSetManager: Lost task 0.0 in stage 21.0 (TID 33, loca 
    lhost): java.net.SocketException: Connection reset by peer: socket write error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 

    15/11/18 17:33:36 ERROR TaskSetManager: Task 0 in stage 21.0 failed 1 times; abo 
    rting job 
    15/11/18 17:33:36 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have 
    all completed, from pool 
    15/11/18 17:33:36 INFO TaskSchedulerImpl: Cancelling stage 21 
    15/11/18 17:33:36 INFO DAGScheduler: ResultStage 21 (runJob at PythonRDD.scala:3 
    93) failed in 0.759 s 
    15/11/18 17:33:36 INFO DAGScheduler: Job 21 failed: runJob at PythonRDD.scala:39 
    3, took 0.810138 s 
    Traceback (most recent call last): 
     File "<stdin>", line 1, in <module> 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1317, in first 

     rs = self.take(1) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take 
     res = self.context.runJob(self, takeUpToNumLeft, p) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in ru 
    nJob 
     port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partition 
    s) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g 
    ateway.py", line 538, in __call__ 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in d 
    eco 
     return f(*a, **kw) 
     File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc 
    ol.py", line 300, in get_return_value 
    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark. 
    api.python.PythonRDD.runJob. 
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s 
    tage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 
    33, localhost): java.net.SocketException: Connection reset by peer: socket write 
    error 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 

    Driver stacktrace: 
      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA 
    GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D 
    AGScheduler.scala:1271) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D 
    AGScheduler.scala:1270) 
      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray. 
    scala:59) 
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala 
    :1270) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$ 
    1.apply(DAGScheduler.scala:697) 
      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$ 
    1.apply(DAGScheduler.scala:697) 
      at scala.Option.foreach(Option.scala:236) 
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu 
    ler.scala:697) 
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(D 
    AGScheduler.scala:1496) 
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG 
    Scheduler.scala:1458) 
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG 
    Scheduler.scala:1447) 
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567 
    ) 
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
      at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393) 
      at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) 
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. 
    java:62) 
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces 
    sorImpl.java:43) 
      at java.lang.reflect.Method.invoke(Method.java:483) 
      at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
      at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
      at py4j.Gateway.invoke(Gateway.java:259) 
      at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
      at py4j.commands.CallCommand.execute(CallCommand.java:79) 
      at py4j.GatewayConnection.run(GatewayConnection.java:207) 
      at java.lang.Thread.run(Thread.java:745) 
    Caused by: java.net.SocketException: Connection reset by peer: socket write erro 
    r 
      at java.net.SocketOutputStream.socketWrite0(Native Method) 
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 
    ) 
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) 
      at java.io.DataOutputStream.flush(DataOutputStream.java:123) 
      at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. 
    apply(PythonRDD.scala:283) 
      at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) 
      at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s 
    cala:239) 

quando corro take , anche crash, non riesco a trovare la ragione, chi può aiutarmi?

+0

Il conteggio restituisce il valore previsto? Trovo l'importazione di file di testo come risultati rdds nella formattazione non mi aspettavo. Stringhe generalmente grandi che devo formattare. – Jesse

+0

Ho lo stesso problema. Tuttavia questo errore sembra casuale. L'esecuzione di 'take' o' first' volte a volte ha esito positivo, ma per lo più non riesce. Sono in esecuzione su Python 3.5 – Tim

risposta

0

Sono rimasto bloccato per ore con lo stesso problema, su Windows 7 e Spark 1.5.0 (Python 2.7.11). Ho risolto solo il passaggio a Unix, usando esattamente la stessa build. Non è una soluzione elegante, ma non ho trovato nessun altro modo per risolverlo.

Problemi correlati