2013-07-02 4 views
12

Sto provando HTMLUnit per automatizzare il download dei dati da un'applicazione web. Tuttavia, sto ricevendo un sacco di avvertimenti su getPage() (la maggior parte dei quali sembra gestire script collegati che non credo nemmeno io abbia bisogno) e poi un fatale com.gargoylesoftware.htmlunit.ScriptException: Exception che invoca setOuterHTML quando Cerco di eseguire getByXPath per estrarre i dati che sto cercando. E dagli errori che ottengo, non posso per la vita di me capire cosa sta succedendo. Hai qualche idea?HTMLUnità: tonnellate di contenuto obsoleto e impossibile creare avvisi sugli oggetti su getPage() non riesce con eccezione invocando setOuterHTML su getByXPath()

Ecco il mio codice:

import java.util.List; 

import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.BrowserVersion; 
import com.gargoylesoftware.htmlunit.html.HtmlAnchor; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 

public class ScrapperApp { 

    private static void go() throws Exception { 
     HtmlPage nextPage; 
     String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate"; 

     final WebClient webclient = new WebClient(); 
     final HtmlPage page = webclient.getPage(url); 

     System.out.println("PULLING LINKS:"); 

     List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//div[@class='hform1']/a[@class='lblentrylink']"); 

     /*for(int x=0; x<articles.size(); x++) { 
      nextPage = articles.get(x).click(); 
      System.out.println(nextPage.getBody()); 
     }*/ 
    } 

    public static void main(String[] args) throws Exception { 
     go(); 
     System.out.println("COMPLETE"); 
    } 

} 

ed ecco la mia uscita della console:

Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'text/javascript'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[http://www.google-analytics.com/urchin.js] line=[443] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[http://www.google-analytics.com/urchin.js] line=[448] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'. 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[http://www.google-analytics.com/urchin.js] line=[456] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:52 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:53 PM com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument execCommand 
WARNING: Nothing done for execCommand(BackgroundImageCache, ...) (feature not implemented) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/search/theethics.css' [1621:72] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/search/theethics.css' [1621:72] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/search/theethics.css' [1722:1] Error in style sheet. (Invalid token ".123". Was expecting one of: <EOF>, <S>, <IDENT>, "<!--", "-->", <HASH>, <IMPORT_SYM>, <PAGE_SYM>, <MEDIA_SYM>, ".", ":", "*", "[", <ATKEYWORD>.) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [4:1] Error in style rule. (Invalid token ".". Was expecting one of: <S>, <LBRACE>, <COMMA>.) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [4:1] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [538:16] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=12a7FOCbnwgUAwtiPjKWh6wDEhgkTfdV9_FCfkqzSp1sZ_YdcvnAj941ZFWBBPCjl5RQqmB3TVerNjIRqn-QyCUV4dFAyyOktFPBtLE-ETB9nE-rPiQp_RNPyuD-NYO58_ngCw2&t=634516122000000000' [538:16] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [6:1] Error in style rule. (Invalid token ".". Was expecting one of: <S>, <LBRACE>, <COMMA>.) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [6:1] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [105:17] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [105:17] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 
WARNING: CSS error: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [160:16] Error in style rule. (Invalid token ":". Was expecting one of: <EOF>, <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <COMMA>, <HASH>, <IMPORTANT_SYM>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <DIMENSION>, <PERCENTAGE>, <URI>, <FUNCTION>, "}", ";", "/", "-".) 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 
WARNING: CSS warning: 'http://media.ethics.ga.gov/Search/WebResource.axd?d=P_qivaU1jkjGS6yiS47lVyoi52Pqy5e8DnncH3bigK8349gyQVvRTapoSdHm45oIHlJhLQAhH3tEXp29b5hNLTwX4AdAh7qPU9_lVIhmQjWu1Kvx6RDeUrTdN4UrhhDIdOIrpOYk5RJGCyYDSr8ky9HSOiU1&t=634516122000000000' [160:16] Ignoring the following declarations in this rule. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://media.ethics.ga.gov/Search/Telerik.Web.UI.WebResource.axd?_TSM_HiddenField_=ctl00_ContentPlaceHolder1_RadScriptManager1_TSM&compress=1&_TSM_CombinedScripts_=%3b%3bSystem.Web.Extensions%2c+Version%3d3.5.0.0%2c+Culture%3dneutral%2c+PublicKeyToken%3d31bf3856ad364e35%3aen-US%3a7263e9c6-5962-41bc-b839-88b704bfcf0d%3aea597d4b%3ab25378d2%3bTelerik.Web.UI%2c+Version%3d2011.2.915.35%2c+Culture%3dneutral%2c+PublicKeyToken%3d121fae78165ba3d4%3aen-US%3a168ec6eb-791b-4159-8a0f-6c601196f873%3a16e4e7cd%3af7645509%3a24ee1bba%3af46195d3%3a19620875%3a874f8ea2%3a490a9d4e%3abd8f85e4%3bAjaxControlToolkit%2c+Version%3d3.0.20820.16598%2c+Culture%3dneutral%2c+PublicKeyToken%3d28f01b0e84b6d53e%3aen-US%3a707835dd-fa4b-41d1-89e7-6df5d518ffb5%3ab14bb7d5%3a13f47f54%3a369ef9d0%3a1d056c78%3adc2d6e36%3a5acd2e8e%3af8a45328] line=[997] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:54 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'text/javascript'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject jsConstructor 
WARNING: Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'. 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 
SEVERE: runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[http://www.google-analytics.com/ga.js] line=[24] lineSource=[null] lineOffset=[0] 
Jul 2, 2013 6:19:55 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'application/x-javascript'. 
Jul 2, 2013 6:19:56 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 
WARNING: Obsolete content type encountered: 'text/javascript'. 
PULLING LINKS: 
Jul 2, 2013 6:19:56 PM com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl runSingleJob 
SEVERE: Job run failed with unexpected RuntimeException: Exception invoking setOuterHTML 
======= EXCEPTION START ======== 
Exception class=[java.lang.RuntimeException] 
com.gargoylesoftware.htmlunit.ScriptException: Exception invoking setOuterHTML 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:663) 
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:594) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:569) 
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:996) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:53) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:101) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:328) 
    at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:161) 
    at java.lang.Thread.run(Thread.java:680) 
Caused by: java.lang.RuntimeException: Exception invoking setOuterHTML 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:163) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$GetterSlot.setValue(ScriptableObject.java:287) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$RelinkedSlot.setValue(ScriptableObject.java:359) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putImpl(ScriptableObject.java:2659) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.put(ScriptableObject.java:509) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putProperty(ScriptableObject.java:2364) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1601) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1595) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1248) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:815) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:109) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:415) 
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:274) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3132) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:107) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:587) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:651) 
    ... 10 more 
Caused by: java.lang.IllegalStateException: Previous sibling for HtmlDivision[<div style="height: 0px; overflow: hidden; border-top: solid black; border-top-width: thick;">] is null. 
    at com.gargoylesoftware.htmlunit.html.DomNode.insertBefore(DomNode.java:1023) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement$ProxyDomNode.appendChild(HTMLElement.java:1091) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.handleCharacters(HTMLParser.java:710) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endDocument(HTMLParser.java:718) 
    at org.apache.xerces.parsers.AbstractSAXParser.endDocument(Unknown Source) 
    at org.cyberneko.html.HTMLTagBalancer.endDocument(HTMLTagBalancer.java:510) 
    at org.cyberneko.html.filters.DefaultFilter.endDocument(DefaultFilter.java:213) 
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2116) 
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) 
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:818) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:162) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:121) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.parseHtmlSnippet(HTMLElement.java:1048) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.setOuterHTML(HTMLElement.java:1035) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:137) 
    ... 26 more 
Enclosed exception: 
java.lang.RuntimeException: Exception invoking setOuterHTML 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:163) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$GetterSlot.setValue(ScriptableObject.java:287) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$RelinkedSlot.setValue(ScriptableObject.java:359) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putImpl(ScriptableObject.java:2659) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.put(ScriptableObject.java:509) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.putProperty(ScriptableObject.java:2364) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1601) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.setObjectProp(ScriptRuntime.java:1595) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1248) 
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:815) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:109) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:415) 
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:274) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3132) 
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:107) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:587) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:651) 
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:559) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:525) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:594) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:569) 
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:996) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:53) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:101) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:328) 
    at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:161) 
    at java.lang.Thread.run(Thread.java:680) 
Caused by: java.lang.IllegalStateException: Previous sibling for HtmlDivision[<div style="height: 0px; overflow: hidden; border-top: solid black; border-top-width: thick;">] is null. 
    at com.gargoylesoftware.htmlunit.html.DomNode.insertBefore(DomNode.java:1023) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement$ProxyDomNode.appendChild(HTMLElement.java:1091) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.handleCharacters(HTMLParser.java:710) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endDocument(HTMLParser.java:718) 
    at org.apache.xerces.parsers.AbstractSAXParser.endDocument(Unknown Source) 
    at org.cyberneko.html.HTMLTagBalancer.endDocument(HTMLTagBalancer.java:510) 
    at org.cyberneko.html.filters.DefaultFilter.endDocument(DefaultFilter.java:213) 
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2116) 
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) 
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:818) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:162) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseFragment(HTMLParser.java:121) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.parseHtmlSnippet(HTMLElement.java:1048) 
    at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.setOuterHTML(HTMLElement.java:1035) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
    at java.lang.reflect.Method.invoke(Method.java:597) 
    at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:137) 
    ... 26 more 
== CALLING JAVASCRIPT == 

    function() { 
     return b.apply(a, arguments); 
    } 

======= EXCEPTION END ======== 
COMPLETE 

risposta

21

L'errore deriva da un file MicrosoftAjax.js. Prova a simulare chrome:

final WebClient webclient = new WebClient(BrowserVersion.CHROME); 

Inoltre, è stato aggiunto un collegamento per eliminare gli avvisi di HtmlUnit.

Inoltre, il tuo XPath non trova nulla (ho provato in Chrome). Ne ho usato un altro a titolo di esempio:

import java.util.List; 

import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.BrowserVersion; 
import com.gargoylesoftware.htmlunit.html.HtmlAnchor; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 

public class ScrapperApp { 

    private static void go() throws Exception { 
     /* turn off annoying htmlunit warnings */ 
     java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); 

     HtmlPage nextPage; 
     String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate"; 

     final WebClient webclient = new WebClient(BrowserVersion.CHROME); 
     final HtmlPage page = webclient.getPage(url); 

     System.out.println("PULLING LINKS:"); 

     List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//a[@class='lblentrylink']"); 
     //List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//div[@class='hform1']/a[@class='lblentrylink']"); 

     for(int x=0; x<articles.size(); x++) { 
      System.out.println("Clicking "+articles.get(x).asText()); 
      //nextPage = articles.get(x).click(); 
      //System.out.println(nextPage.getBody()); 
     } 
    } 
    public static void main(String[] args) throws Exception { 
     go(); 
     System.out.println("COMPLETE"); 
    } 
} 
+3

Che funziona. Grazie mille! – Jeff

Problemi correlati