Il metodo BeautifulSoup .test restituisce testo senza separatori (\ n, \ r ecc.)

Ho provato ad analizzare i testi delle canzoni dal più grande sito di testi russi http://amalgama-lab.com e salvare i testi (tradotti e originali) nella lista audio dal mio account Vkontakte (purtroppo , Amalgama non ha alcuna API)Il metodo BeautifulSoup .test restituisce testo senza separatori ( n, r ecc.)

import urllib 
from BeautifulSoup import BeautifulSoup 
import vkontakte 
vk = vkontakte.API(token=<SECRET_TOKEN>) 
audios = vk.getAudios(count='2') 
#{u'artist': u'The Beatles', u'url': u'http://cs4519.vkontakte.ru/u4665445/audio/4241af71a888.mp3', u'title': u'Yesterday', u'lyrics_id': u'2365986', u'duration': 130, u'aid': 166194990, u'owner_id': 173505924} 
url = 'http://amalgama.mobi/songs/' 
for i in audios: 
    print i['artist'] 
    if i['artist'].startswith('The '): 
     url += i['artist'][4:5] + '/' + i['artist'][4:].replace(' ', '_') + '/'  +i['title'].replace(' ', '_') + '.html' 
    else: 
     url += i['artist'][:1] + '/' + i['artist'].replace(' ', '_') + '/' +i['title'].replace(' ', '_') + '.html' 
    url = url.lower() 
    page = urllib.urlopen(url) 
    soup = BeautifulSoup(page.read(), fromEncoding="utf-8") 
    texts = soup.findAll('ol',) 
    if len(texts) != 0: 
     en = texts[0].text #this! 
     ru = texts[1].text #this! 
     vk.get('audio.edit', aid=i['aid'], oid = i['owner_id'], artist=i['artist'], title = i['title'], text = ru, no_search = 0)

ma il metodo restituisce .text stringa senza separatori:

"Ieri, tutti i miei problemi sembravano così lontani awayNow sembrare come se fossero qui a stayOh, io credo in yesterdaySuddenly, non sto metà dell'uomo che ho usato per beThere di un ove ombra impiccagione R MeOH, ieri è venuto improvvisamente [Chorus:] Perché lei doveva andare non so, lei non Sayi detto qualcosa di sbagliato, ora sospiri la yesterdayYesterday, l'amore era un gioco così facile da PlayNow ho bisogno di un posto per nascondersi awayOh, io credo in"

E 'problema principale. Quindi, quale modo migliore per testi salvare come in questo modo:

Lyrics linea 1 (Original)

Lyrics linea 1 (Tradotto)

Lyrics linea 2 (originale)

Lyrics linea 2 (tradotto)

Lyrics linea 3 (Original)

Lyrics linea 3 (tradotto)

...

? Ottengo solo il codice disordinato. Grazie

fonte

2012-08-25 just so

prega di fornire un link alla pagina attuale si sta analisi. – BrenBarn

Esempio: http://amalgama.mobi/songs/b/beatles/yesterday.html –

Nota che * non * non sono newline nel testo del brano, solo tag '
', che l'OP sta eliminando .. –

Si può fare questo:

soup = BeautifulSoup(html) 
ols = soup.findAll('ol') # for the two languages 

for ol in ols: 
    ps = ol.findAll('p') 
    for p in ps: 
     for item in p.contents: 
      if str(item)!='<br />': 
       print str(item)

fonte

2012-08-25 18:19:26 Nasir

Vi suggerisco di guardare in the .strings generator found in Beautiful Soup 4.

fonte

2012-08-26 03:18:56

Prova il parametro separator del metodo get_text:

from bs4 import BeautifulSoup 
html = '''<p> Hi. This is a simple example.<br>Yet poweful one. <p>''' 
soup = Beautifulsoup(html) 
soup.get_text() 
# Output: u' Hi. This is a simple example.Yet poweful one. ' 
soup.get_text(separator=' ') 
# Output: u' Hi. This is a simple example. Yet poweful one. '

fonte

2016-11-02 15:14:05

Il metodo BeautifulSoup .test restituisce testo senza separatori (\ n, \ r ecc.)

risposta

Problemi correlati