Jupyter & PySpark: come eseguire più quaderni

Sto usando Spark 1.6.0 su tre VM, 1x Master (standalone), 2x worker w/8G RAM, 2CPU ciascuno.Jupyter & PySpark: come eseguire più quaderni

Sto usando la configurazione del kernel di seguito:

{ 
"display_name": "PySpark ", 
"language": "python3", 
"argv": [ 
    "/usr/bin/python3", 
    "-m", 
    "IPython.kernel", 
    "-f", 
    "{connection_file}" 
], 
"env": { 
    "SPARK_HOME": "<mypath>/spark-1.6.0", 
    "PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py", 
    "PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077 --conf spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar" 
} 
}

Attualmente, questo funziona. Posso usare il contesto scintilla sc & sqlContext senza importazione, come nella shell di pyspark.

problema nasce quando uso più notebook: Sul mio padrone scintilla vedo due 'pyspark-shell' app, che genere di senso, ma solo uno può correre alla volta. Ma qui "correre" non significa eseguire nulla, anche se non eseguo nulla su un notebook, questo verrà mostrato come "in esecuzione". Detto questo, non posso condividere le mie risorse tra i notebook, che è piuttosto triste (al momento devo uccidere la prima shell (= kernel del notebook) per eseguire la seconda).

Se avete qualche idea su come farlo, dimmelo! Inoltre, non sono sicuro che il modo in cui sto lavorando con i kernel sia "best practice", ho già avuto problemi a impostare la scintilla & jupyter per lavorare insieme.

Thx tutte

fonte

2016-03-30 pltrdy

@AlbertoBonsanto come sarà in grado di risolvere problemi di concorrenza? :) – eliasah

@eliasah questo di sicuro. Ancora bello per avere qualche consiglio: p – pltrdy

Stai cercando di condividere lo sparkcontext? – eliasah

Il problema è il database utilizzato da Spark per memorizzare metastore (Derby). Derby è un sistema di database leggero e può eseguire solo un'istanza Spark alla volta. La soluzione consiste nell'impostare un altro sistema di database per gestire istanze multi (postgres, mysql ...).

Ad esempio, è possibile utilizzare DB postgres.

Aggiungi postgres vaso scintille/barattoli
Aggiungere un file di configurazione (alveare-site.xml) in scintilla conf
Installare Postgres sulla vostra macchina
Aggiungere un utente, password e db scintilla/alveare in postgres (dipende i valori in dismissione site.xml)

Esempio su una shell linux:

# download postgres jar 
wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar 

# install postgres on your machine 
pip install postgres 

# add user, pass and db to postgres 
psql -d postgres -c "create user hive" 
psql -d postgres -c "alter user hive with password 'pass'" 
psql -d postgres -c "create database hive_metastore" 
psql -d postgres -c "grant all privileges on database hive_metastore to hive"

alveare-site.xml:

<configuration> 

<property> 
    <name>javax.jdo.option.ConnectionURL</name> 
    <value>jdbc:postgresql://localhost:5432/hive_metastore</value> 
</property> 

<property> 
    <name>javax.jdo.option.ConnectionDriverName</name> 
    <value>org.postgresql.Driver</value> 
</property> 

<property> 
<name>javax.jdo.option.ConnectionUserName</name> 
    <value>hive</value> 
</property> 

<property> 
    <name>javax.jdo.option.ConnectionPassword</name> 
    <value>pass</value> 
</property> 

</configuration>

fonte

2017-11-21 10:41:32 pcc

Jupyter & PySpark: come eseguire più quaderni

risposta

Problemi correlati