Sto lavorando a un programma che scarica pagine HTML e quindi seleziona alcune informazioni e le scrive in un altro file.Estrazione del testo da HTML Java

Voglio estrarre l'informazione che si trova tra i tag di paragrafo, ma posso ottenere solo una riga del paragrafo. Il mio codice è il seguente;

FileReader fileReader = new FileReader(file); 
BufferedReader buffRd = new BufferedReader(fileReader); 
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt)); 
String s; 

while ((s = br.readLine()) !=null) { 
    if(s.contains("<p>")) { 
     try { 
      out.write(s); 
     } catch (IOException e) { 
     } 
    } 
}

ho cercato di aggiungere un altro ciclo while, che dire al programma di continuare a scrivere su file fino a quando la riga contiene il tag </p>, dicendo;

while ((s = br.readLine()) !=null) { 
    if(s.contains("<p>")) { 
     while(!s.contains("</p>") { 
      try { 
       out.write(s); 
      } catch (IOException e) { 
      } 
     } 
    } 
}

Ma questo non funziona. Qualcuno potrebbe aiutarmi.

fonte

2009-09-06 Anonymous

Sicuramente stiamo riscontrando un errore nella fuga di SO dei tag HTML. – Yishai

Stai citandoli come codice con i backtick? – pjp

I parser HTML esistono e ce ne sono molti. –

provare (se non si desidera utilizzare una libreria di parser HTML):


     FileReader fileReader = new FileReader(file); 
     BufferedReader buffRd = new BufferedReader(fileReader); 
     BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt)); 
     String s; 
     int writeTo = 0; 
     while ((s = br.readLine()) !=null) 
     { 
       if(s.contains("<p>")) 
       { 
         writeTo = 1; 

         try 
         { 
          out.write(s); 
        } 
         catch (IOException e) 
         { 

        } 
       } 
       if(s.contains("</p>")) 
       { 
         writeTo = 0; 

         try 
         { 
          out.write(s); 
        } 
         catch (IOException e) 
         { 

        } 
       } 
       else if(writeTo==1) 
       { 
         try 
         { 
          out.write(s); 
        } 
         catch (IOException e) 
         { 

        } 
       } 
}

fonte

2009-09-06 17:02:04 Niall

Cosa succede se il '

' e '

' si trovano sulla stessa riga? In questo caso la stringa verrà scritta due volte.Immagino che dipenda davvero dall'input. – pjp

È possibile aggiungere uno stato per vedere se è già stata scritta la riga prima di scriverla di nuovo. – pjp

jericho è uno dei diversi parser HTML posible che potrebbero rendere questo compito sia facile e sicuro.

fonte

2009-09-06 17:02:08

JTidy può rappresentare un documento HTML (anche un formato non valido) come un modello di documento, rendendo il processo di estrazione dei contenuti di un tag <p> un processo più elegante rispetto al thunking manuale attraverso il testo non elaborato.

fonte

2009-09-06 17:08:15 skaffman

Sì, è meglio evitare di provare ad analizzare manualmente l'HTML – pjp

-2

Si può solo utilizzare lo strumento sbagliato per il lavoro:

perl -ne "print if m|<p>| .. m|</p>|" infile.txt >outfile.txt

fonte

2009-09-06 17:14:50 brianary

-1: risposta errata per la domanda –

Questo è un poliziotto imparziale. Un po 'tardi, però. – brianary

Gli hit in ritardo vanno in entrambe le direzioni :) –

Ho avuto successo utilizzando TagSoup & XPath per analizzare HTML.

http://home.ccil.org/~cowan/XML/tagsoup/

fonte

2009-09-06 17:32:18

Utilizzare un ParserCallback. È una classe semplice che è inclusa con il JDK. Ti avvisa ogni volta che viene trovato un nuovo tag e quindi puoi estrarre il testo del tag. Esempio semplice:

import java.io.*; 
import java.net.*; 
import javax.swing.text.*; 
import javax.swing.text.html.*; 
import javax.swing.text.html.parser.*; 

public class ParserCallbackTest extends HTMLEditorKit.ParserCallback 
{ 
    private int tabLevel = 1; 
    private int line = 1; 

    public void handleComment(char[] data, int pos) 
    { 
     displayData(new String(data)); 
    } 

    public void handleEndOfLineString(String eol) 
    { 
     System.out.println(line++); 
    } 

    public void handleEndTag(HTML.Tag tag, int pos) 
    { 
     tabLevel--; 
     displayData("/" + tag); 
    } 

    public void handleError(String errorMsg, int pos) 
    { 
     displayData(pos + ":" + errorMsg); 
    } 

    public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos) 
    { 
     displayData("mutable:" + tag + ": " + pos + ": " + a); 
    } 

    public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos) 
    { 
     displayData(tag + "::" + a); 
//  tabLevel++; 
    } 

    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos) 
    { 
     displayData(tag + ":" + a); 
     tabLevel++; 
    } 

    public void handleText(char[] data, int pos) 
    { 
     displayData(new String(data)); 
    } 

    private void displayData(String text) 
    { 
     for (int i = 0; i < tabLevel; i++) 
      System.out.print("\t"); 

     System.out.println(text); 
    } 

    public static void main(String[] args) 
    throws IOException 
    { 
     ParserCallbackTest parser = new ParserCallbackTest(); 

     // args[0] is the file to parse 

     Reader reader = new FileReader(args[0]); 
//  URLConnection conn = new URL(args[0]).openConnection(); 
//  Reader reader = new InputStreamReader(conn.getInputStream()); 

     try 
     { 
      new ParserDelegator().parse(reader, parser, true); 
     } 
     catch (IOException e) 
     { 
      System.out.println(e); 
     } 
    } 
}

Quindi, tutto quello che dovete fare è impostare un flag booleano quando viene trovato il tag di paragrafo. Quindi nel metodo handleText() estrai il testo.

fonte

2009-09-06 22:04:22 camickr

jsoup

Un altro parser HTML Mi è piaciuto usando era jsoup. È possibile ottenere tutti gli elementi <p> in 2 righe di codice.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); 
Elements ps = doc.select("p");

poi scrivere in un file in un altro linea

out.write(ps.text()); //it will append all of the p elements together in one long string

o se li volete su righe separate è possibile scorrere gli elementi e li scrivere separatamente.

fonte

2012-04-23 14:04:39 Danny

Se un documento non usa i tag 'p' (markup non-semantico), presumo che questo non funzionerà –

@ sinθ La domanda ha chiesto esplicitamente gli elementi' p'. Questa risposta è corretta. –

Grazie a @Danny, I ♥ questa zuppa! –

Prova questo.

public static void main(String[] args) 
{ 
    String url = "http://en.wikipedia.org/wiki/Big_data"; 

    Document document; 
    try { 
     document = Jsoup.connect(url).get(); 
     Elements paragraphs = document.select("p"); 

     Element firstParagraph = paragraphs.first(); 
     Element lastParagraph = paragraphs.last(); 
     Element p; 
     int i=1; 
     p=firstParagraph; 
     System.out.println("* " +p.text()); 
     while (p!=lastParagraph){ 
      p=paragraphs.get(i); 
      System.out.println("* " +p.text()); 
      i++; 
     } 
} catch (IOException e) { 
    // TODO Auto-generated catch block 
    e.printStackTrace(); 
} 
}

fonte

2013-06-20 05:33:12 Consultant

Che cos'è questo 'elemento' e 'documento'. Si tratta di un parser di terze parti? Mostra anche le linee di importazione – James

Estrazione del testo da HTML Java

risposta

jsoup

Problemi correlati