Come analizzare solo testo da HTML

16

Da jsoup libro di cucina: http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; 
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

fonte

2010-08-17 22:13:45

+0

come escludere gli elementi invisibili? (ad es. display: nessuno) – Ehsan

0

Bene, ecco un metodo rapido ho buttato insieme una volta. Usa le espressioni regolari per portare a termine il lavoro. La maggior parte delle persone concorderà sul fatto che questo non è un buon modo per farlo. Quindi, usare a proprio rischio.

public static String getPlainText(String html) { 
    String htmlBody = html.replaceAll("<hr>", ""); // one off for horizontal rule lines 
    String plainTextBody = htmlBody.replaceAll("<[^<>]+>([^<>]*)<[^<>]+>", "$1"); 
    plainTextBody = plainTextBody.replaceAll("<br ?/>", ""); 
    return decodeHtml(plainTextBody); 
}

Questo è stato originariamente utilizzato nel mio API wrapper per le API Stack Overflow. Quindi, è stato testato solo sotto un piccolo sottoinsieme di tag html.

fonte

2010-08-17 22:15:07 jjnguy

+0

Hmmm ... perché non usi la semplice espressione regolare: 'replaceAll (" <[^>] +> "," ")'? – Crozin

+0

@Crozin, beh, stavo insegnando a me stesso come usare i back-reference. Sembra che anche il tuo probabilmente funzionerebbe. – jjnguy

+0

questo fa male! -> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – sleeplessnerd

1

Utilizzo di classi che fanno parte del JDK:

import java.io.*; 
import java.net.*; 
import javax.swing.text.*; 
import javax.swing.text.html.*; 

class GetHTMLText 
{ 
    public static void main(String[] args) 
     throws Exception 
    { 
     EditorKit kit = new HTMLEditorKit(); 
     Document doc = kit.createDefaultDocument(); 

     // The Document class does not yet handle charset's properly. 
     doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE); 

     // Create a reader on the HTML content. 

     Reader rd = getReader(args[0]); 

     // Parse the HTML. 

     kit.read(rd, doc, 0); 

     // The HTML text is now stored in the document 

     System.out.println(doc.getText(0, doc.getLength())); 
    } 

    // Returns a reader on the HTML data. If 'uri' begins 
    // with "http:", it's treated as a URL; otherwise, 
    // it's assumed to be a local filename. 

    static Reader getReader(String uri) 
     throws IOException 
    { 
     // Retrieve from Internet. 
     if (uri.startsWith("http:")) 
     { 
      URLConnection conn = new URL(uri).openConnection(); 
      return new InputStreamReader(conn.getInputStream()); 
     } 
     // Retrieve from file. 
     else 
     { 
      return new FileReader(uri); 
     } 
    } 
}

fonte

2010-08-17 23:14:11 camickr

Come analizzare solo testo da HTML

risposta

Problemi correlati