2016-04-15 16 views
8

Chiunque utilizza s3 su Francoforte utilizzando hadoop/spark 1.6.0?Utilizzo di S3 (Francoforte) con Spark

Sto cercando di memorizzare il risultato di un lavoro su S3, le mie dipendenze sono dichiarate come segue:

"org.apache.spark" %% "spark-core" % "1.6.0" exclude("org.apache.hadoop", "hadoop-client"), 
"org.apache.spark" %% "spark-sql" % "1.6.0", 
"org.apache.hadoop" % "hadoop-client" % "2.7.2", 
"org.apache.hadoop" % "hadoop-aws" % "2.7.2" 

ho impostato la seguente configurazione:

System.setProperty("com.amazonaws.services.s3.enableV4", "true") 
sc.hadoopConfiguration.set("fs.s3a.endpoint", ""s3.eu-central-1.amazonaws.com") 

Quando si chiama saveAsTextFile sul mio RDD inizia ok, salvando tutto su S3. Tuttavia dopo qualche tempo quando trasferisce dal _temporary al risultato finale provocare produce l'errore:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXXXXXXXXX, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method., S3 Extended Request ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX= 
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) 
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) 
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) 
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) 
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1507) 
at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:143) 
at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:131) 
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.copy(CopyMonitor.java:189) 
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:134) 
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:46) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 

Se uso hadoop-client dal pacchetto scintilla non ancora di iniziare il trasferimento. L'errore si verifica in modo casuale, a volte funziona e talvolta no.

+0

sembra un problema con la vostra chiave SSH. potresti controllare che stai usando la chiave giusta? – user1314742

+0

I dati iniziano a salvare su s3 e dopo un po 'di tempo si verifica l'errore. – flaviotruzzi

+0

@flaviotruzzi Hai risolto questo problema? – pangpang

risposta

3

prega di provare a impostare i valori di seguito:

System.setProperty("com.amazonaws.services.s3.enableV4", "true") 
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
hadoopConf.set("com.amazonaws.services.s3.enableV4", "true") 
hadoopConf.set("fs.s3a.endpoint", "s3." + region + ".amazonaws.com") 

impostare la regione in cui si trova quel secchio, nel mio caso è stato: eu-central-1

e aggiungere la dipendenza in Gradle o in qualche altro modo :

dependencies { 
    compile 'org.apache.hadoop:hadoop-aws:2.7.2' 
} 

spero che possa essere d'aiuto.

1

Nel caso in cui si utilizza pyspark, i seguenti ha lavorato per me

aws_profile = "your_profile" 
aws_region = "eu-central-1" 
s3_bucket = "your_bucket" 

# see https://github.com/jupyter/docker-stacks/issues/127#issuecomment-214594895 
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell" 

# If this doesn't work you might have to delete your ~/.ivy2 directory to reset your package cache. 
# (see https://github.com/databricks/spark-redshift/issues/244#issuecomment-239950148) 
import pyspark 
sc=pyspark.SparkContext() 
# see https://github.com/databricks/spark-redshift/issues/298#issuecomment-271834485 
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true") 

# see https://stackoverflow.com/questions/28844631/how-to-set-hadoop-configuration-values-from-pyspark 
hadoop_conf=sc._jsc.hadoopConfiguration() 
# see https://stackoverflow.com/questions/43454117/how-do-you-use-s3a-with-spark-2-1-0-on-aws-us-east-2 
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 
hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true") 
hadoop_conf.set("fs.s3a.access.key", access_id) 
hadoop_conf.set("fs.s3a.secret.key", access_key) 

# see https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region 
hadoop_conf.set("fs.s3a.endpoint", "s3." + aws_region + ".amazonaws.com") 

sql=pyspark.sql.SparkSession(sc) 
path = s3_bucket + "your_file_on_s3" 
dataS3=sql.read.parquet("s3a://" + path) 
Problemi correlati