Python, lxml e la rimozione di tag esterno da utilizzare lxml.html.tostring (el)

Sto usando il qui sotto per ottenere tutti i contenuti html di una sezione per salvare in un databasePython, lxml e la rimozione di tag esterno da utilizzare lxml.html.tostring (el)

el = doc.get_element_by_id('productDescription') 
lxml.html.tostring(el)

La descrizione del prodotto ha un tag che assomiglia a questo:

<div id='productDescription'> 

    <THE HTML CODE I WANT> 

</div>

il codice funziona alla grande, mi dà tutto il codice HTML ma come faccio a rimuovere lo strato esterno cioè il <div id='productDescription'> e il tag di chiusura </div>?

fonte

2012-02-14 Tampa

Si potrebbe convertire ogni bambino a stringa singolarmente:

text = el.text 
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

O in modo ancora più hacker:

el.attrib.clear() 
el.tag = '|||' 
text = lxml.html.tostring(el) 
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>') 
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]

fonte

2012-02-14 19:24:56 jfs

se il productDescriptiondiv div contiene mista testo/elementi contenuti, per esempio

<div id='productDescription'> 
    the 
    <b> html code </b> 
    i want 
</div>

è possibile ottenere il contenuto (in string) utilizzando xpath('node()') attraversamento:

s = '' 
for node in el.xpath('node()'): 
    if isinstance(node, basestring): 
     s += node 
    else: 
     s += lxml.html.tostring(node, with_tail=False)

fonte

2012-02-15 14:07:32 mykhal

Che cos'è 'basestring'? – nHaskins

Ecco una funzione che fa quello che si vuole.

def strip_outer(xml): 
    """ 
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML   http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"> 
    ... <mrow> 
    ...  <msup> 
    ...  <mi>x</mi> 
    ...  <mn>2</mn> 
    ...  </msup> 
    ...  <mo> + </mo> 
    ...  <mi>x</mi> 
    ... </mrow> 
    ... </math>''' 
    >>> so = strip_outer(xml) 
    >>> so.splitlines()[0]=='<mrow>' 
    True 

    """ 
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute 
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element 
    rx = lxml.etree.XML(xml) 
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes 
    uc=lxml.etree.tounicode(rx) 
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again 
    return uc.strip()

fonte

2013-04-20 16:22:12

Utilizzare regexp.

def strip_outer_tag(html_fragment): 
    import re 
    outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>$', re.DOTALL) 
    return outer_tag.search(html_fragment).group(1) 

html_fragment = strip_outer_tag(tostring(el, encoding='unicode')) # `encoding` is optionaly

fonte

2017-04-02 00:52:57 bl79

Python, lxml e la rimozione di tag esterno da utilizzare lxml.html.tostring (el)

risposta

Problemi correlati