2014-09-23 20 views
12

Ho cercato di utilizzarePython non nltk.clean_html implementato

myNews=urlopen(url).read()  
myNews=nltk.clean_html(myNews) 

ottengo il seguente errore:

File "/usr/local/lib/python2.7/dist-packages/nltk -3.0.0-py2.7.egg/nltk/util.py ", riga 346, in clean_html raise NotImplementedError (" Per rimuovere il markup HTML, usa la funzione get_text() di BeautifulSoup ") NotImplementedError: per rimuovere il markup HTML, usa Funzione get_text() di BeautifulSoup

Quando guardo a e file util.py, posso vedere che non è implementato:

def clean_html(html): 
    raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function") 

Non dovrebbe essere implementato?

risposta

7

Come le altre note di risposta, ntlk dropped this feature e raccomanda che "Per rimuovere il markup HTML, utilizzare la funzione get_text() di BeautifulSoup." La bella zuppa probabilmente è la strada da percorrere se estrai il testo da un particolare elemento, ma se vuoi il testo per un'intera pagina, IMHO va con la funzione nltk. Ecco un confronto tra i due approcci:

import mechanize 
import nltk 
from bs4 import BeautifulSoup 
from html2text import html2text 
import re 


def clean_html(html): 
    """ 
    Copied from NLTK package. 
    Remove HTML markup from the given string. 

    :param html: the HTML string to be cleaned 
    :type html: str 
    :rtype: str 
    """ 

    # First we remove inline JavaScript/CSS: 
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip()) 
    # Then we remove html comments. This has to be done before removing regular 
    # tags since comments can contain '>' characters. 
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned) 
    # Next we can remove the remaining tags: 
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned) 
    # Finally, we deal with whitespace 
    cleaned = re.sub(r"&nbsp;", " ", cleaned) 
    cleaned = re.sub(r" ", " ", cleaned) 
    cleaned = re.sub(r" ", " ", cleaned) 
    return cleaned.strip() 

url = "http://www.nytimes.com/2015/08/31/business/challenged-on-left-and-right-the-fed-faces-a-decision-on-rates.html" 
br = mechanize.Browser() 
br.set_handle_robots(False) 
br.addheaders = [('User-agent', 'Firefox')] 
html = br.open(url).read().decode('utf-8') 
cleanhtml = clean_html(html) 
text = html2text(cleanhtml) 
soup = BeautifulSoup(html) 
text2 = soup.get_text() 

Con la funzione NLTK ottengo un bel risultato pulito (see here, dopo è andato oltre 30.000 caratteri max così ho dovuto metterlo in un pastebin per essere in grado di pubblicare) . E con Beautiful Soup:

u'\n \n\n\n\n\nChallenged on Left and Right, the Fed Faces a Decision on Rates - The New York Times\nwindow.NREUM||(NREUM={}),__nr_require=function(n,e,t){function r(t){if(!e[t]){var o=e[t]={exports:{}};n[t][0].call(o.exports,function(e){var o=n[t][1][e];return r(o?o:e)},o,o.exports)}return e[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({QJf3ax:[function(n,e){function t(n){function e(e,t,a){n&&n(e,t,a),a||(a={});for(var u=c(e),f=u.length,s=i(a,o,r),p=0;f>p;p++)u[p].apply(s,t);return s}function a(n,e){f[n]=c(n).concat(e)}function c(n){return f[n]||[]}function u(){return t(e)}var f={};return{on:a,emit:e,create:u,listeners:c,_events:f}}function r(){return{}}var o="[email protected]",i=n("gos");e.exports=t()},{gos:"7eSDFh"}],ee:[function(n,e){e.exports=n("QJf3ax")},{}],3:[function(n,e){function t(n){return function(){r(n,[(new Date).getTime()].concat(i(arguments)))}}var r=n("handle"),o=n(1),i=n(2);"undefined"==typeof window.newrelic&&(newrelic=window.NREUM);var a=["setPageViewName","addPageAction","setCustomAttribute","finished","addToTrace","inlineHit","noticeError"];o(a,function(n,e){window.NREUM[e]=t("api-"+e)}),e.exports=window.NREUM},{1:12,2:13,handle:"D5DuLP"}],gos:[function(n,e){e.exports=n("7eSDFh")},{}],"7eSDFh":[function(n,e){function t(n,e,t){if(r.call(n,e))return n[e];var o=t();if(Object.defineProperty&&Object.keys)try{return Object.defineProperty(n,e,{value:o,writable:!0,enumerable:!1}),o}catch(i){}return n[e]=o,o}var r=Object.prototype.hasOwnProperty;e.exports=t},{}],D5DuLP:[function(n,e){function t(n,e,t){return r.listeners(n).length?r.emit(n,e,t):(o[n]||(o[n]=[]),void o[n].push(e))}var r=n("ee").create(),o={};e.exports=t,t.ee=r,r.q=o},{ee:"QJf3ax"}],handle:[function(n,e){e.exports=n("D5DuLP")},{}],XL7HBI:[function(n,e){function t(n){var e=typeof n;return!n||"object"!==e&&"function"!==e?-1:n===window?0:i(n,o,function(){return r++})}var r=1,o="[email protected]",i=n("gos");e.exports=t},{gos:"7eSDFh"}],id:[function(n,e){e.exports=n("XL7HBI")},{}],loader:[function(n,e){e.exports=n("G9z0Bl")},{}],G9z0Bl:[function(n,e){function t(){var n=h.info=NREUM.info;if(n&&n.licenseKey&&n.applicationID&&f&&f.body){c(l,function(e,t){e in n||(n[e]=t)}),h.proto="https"===d.split(":")[0]||n.sslForHttp?"https://":"http://",a("mark",["onload",i()]);var e=f.createElement("script");e.src=h.proto+n.agent,f.body.appendChild(e)}}function r(){"complete"===f.readyState&&o()}function o(){a("mark",["domContent",i()])}function i(){return(new Date).getTime()}var a=n("handle"),c=n(1),u=(n(2),window),f=u.document,s="addEventListener",p="attachEvent",d=(""+location).split("?")[0],l={beacon:"bam.nr-data.net",errorBeacon:"bam.nr-data.net",agent:"js-agent.newrelic.com/nr-593.min.js"},h=e.exports={offset:i(),origin:d,features:{}};f[s]?(f[s]("DOMContentLoaded",o,!1),u[s]("load",t,!1)):(f[p]("onreadystatechange",r),u[p]("onload",t)),a("mark",["firstbyte",i()])},{1:12,2:3,handle:"D5DuLP"}],12:[function(n,e){function t(n,e){var t=[],o="",i=0;for(o in n)r.call(n,o)&&(t[i]=e(o,n[o]),i+=1);return t}var r=Object.prototype.hasOwnProperty;e.exports=t},{}],13:[function(n,e){function t(n,e,t){e||(e=0),"undefined"==typeof t&&(t=n?n.length:0);for(var r=-1,o=t-e||0,i=Array(0>o?0:o);++r<o;)i[r]=n[e+r];return i}e.exports=t},{}]},{},["G9z0Bl"]);\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n{"pageconfig":{"ledeMediaSize":"large","keywords":["article-medium","has-embedded-interactive"]}}\n\n   [] \n\nvar googletag=googletag||{};googletag.cmd=googletag.cmd||[],function(){var t=document.createElement("script");t.async=!0,t.type="text/javascript";t.src="http://www.googletagservices.com/tag/js/gpt.js";var o=document.getElementsByTagName("script")[0];o.parentNode.insertBefore(t,o)}();\n\n\n[\n {\n  "testId": "0012",\n  "testName": "tallWatchingModule",\n  "throttle": 1.0,\n  "allocation": 0.9,\n  "variants": 1,\n  "applications": ["homepage"]\n },\n {\n  "testId": "0033",\n  "testName": "recommendedLabelTest",\n  "throttle": 1,\n  "allocation": 0.833,\n  "variants": 5,\n  "applications": ["article"]\n },\n {\n  "testId": "0036",\n  "testName": "velcroSocialFollow",\n  "throttle": 0.1,\n  "allocation": 0.5,\n  "variants": 1,\n  "applications": ["article", "homepage"]\n },\n {\n  "testId": "0050",\n  "testName": "styledMostEmailed",\n  "throttle": 1,\n  "allocation": 0.667,\n  "variants": 2,\n  "applications": ["article"]\n },\n {\n  "testId": "0051",\n  "testName": "shuffleRecommendations",\n  "throttle": 1.0,\n  "allocation": 0.667,\n  "variants": 1,\n  "applications": ["article"]\n },\n {\n  "testId": "0052",\n  "testName": "paidPostDriver",\n  "throttle": 1.0,\n  "allocation": 0.875,\n  "variants": 7,\n  "applications": ["article"]\n },\n {\n  "testId": "0061",\n  "testName": "paidPostFivePack",\n  "throttle": 0,\n  "allocation": 0,\n  "variants": 1,\n  "applications": ["homepage"]\n }\n]\n\n\n\n{ "meta": {},\n "data": {\n "id": "0",\n "name": "",\n "subscription": ["","_RPV"],\n "demographics": {}\n }\n}\n\n\nvar require = {\n baseUrl: \'http://a1.nyt.com/assets/\',\n waitSeconds: 20,\n paths: {\n  \'foundation\': \'article/20150828-192044/js/foundation\',\n  \'shared\': \'article/20150828-192044/js/shared\',\n  \'article\': \'article/20150828-192044/js/article\',\n  \'application\': \'article/20150828-192044/js/article/article\',\n  \'videoFactory\': \'http://static01.nyt.com/js2/build/video/2.0/videofactoryrequire\',\n  \'videoPlaylist\': \'http://static01.nyt.com/js2/build/video/players/extended/2.0/appRequire\',\n  \'auth/mtr\': \'http://static01.nyt.com/js/mtr\',\n  \'auth/growl\': \'http://static01.nyt.com/js/auth/growl/default\',\n  \'vhs\': \'http://static01.nyt.com/video/vhs/build/vhs-2.x.min\'\n },\n map: {\n  \'*\': {\n   \'article/main\': \'article/article/main\'\n  }\n }\n};\n\n\n\n\n\n\nwindow.magnum.processFlags(["limitFabrikSave","moreFollowSuggestions","dfpAds","dfpWhitelist","criticsPickAdditionalInfo","restaurantAttributes","theaterAttributes","movieAttributes","followFeature","restaurantReviewAdditionalDetails","theaterReviewAdditionalDetails","restaurantReviewHideInfoBox","theaterReviewHideInfoBox","restaurantReviewShowRestaurantName","restaurantReviewShowGoogleMap","restaurantReviewShowNotes","restaurantReviewShowLastUpdated","styledMostEmailed","videoVHSCover","restaurantReviewShowMenuLink","allTheEmphases","androidDeepLinks","autoPlayVideos","restaurantOpenStatus","standaloneSlideshowPromo","showNewTMagLogo"]);\n\n\nrequire([\'foundation/main\'], function() {\n require([\'auth/mtr\', \'auth/growl\']);\n});\n\n\n\n\n .lt-ie10 .messenger.suggestions {\n  display: block !important;\n  height: 50px;\n }\n\n .lt-ie10 .messenger.suggestions .message-bed {\n  background-color: #f8e9d2;\n  border-bottom: 1px solid #ccc;\n }\n\n .lt-ie10 .messenger.suggestions .message-container {\n  padding: 11px 18px 11px 30px;\n }\n\n .lt-ie10 .messenger.suggestions .action-link {\n  font-family: "nyt-franklin", arial, helvetica, sans-serif;\n  font-size: 10px;\n  font-weight: bold;\n  color: #a81817;\n  text-transform: uppercase;\n }\n\n .lt-ie10 .messenger.suggestions .alert-icon {\n  background: url(\'http://i1.nyt.com/images/icons/icon-alert-12x12-a81817.png\') no-repeat;\n  width: 12px;\n  height: 12px;\n  display: inline-block;\n  margin-top: -2px;\n  float: none;\n }\n\n .lt-ie10 .masthead,\n .lt-ie10 .navigation,\n .lt-ie10 .comments-panel {\n  margin-top: 50px !important;\n }\n\n .lt-ie10 .ribbon {\n  margin-top: 97px !important;\n }\n\n\n\n\n\n\nNYTimes.com no longer supports Internet Explorer 9 or earlier. Please upgrade your browser.\nLEARN MORE \xbb\n\n\n\n\n\n\n\n\n\nSections\n\nHome\n\nSearch\nSkip to content\nSkip to navigation\nView mobile version\n\n\n\n\nThe New York Times\n\n\nwindow.magnum.writeLogo(\'small\', \'http://a1.nyt.com/assets/article/20150828-192044/images/foundation/logos/\', \'business\', \'masthead-theme-standard\', \'standard\', \'branding-heading-link\');\n\n\nEconomy|Challenged on Left and Right, the Fed Faces a Decision on Rates\n\n\n\nAdvertisement\n\n\n\n\n\n\n\nSearch\n\n\nLog In\n0\nSettings\n\n\n\n\nClose search\n\nsearch sponsored by\n\n\n\n\n\n\nSearch NYTimes.com\n\n\n\nClear this text input\n\n\n\nGo\n\n\n\n\n\n\nhttp://nyti.ms/1VpLa1D\n\n\n\n\nLoading...\n\n\n\n\nSee next articles\n\n\n\n\n\nSee previous articles\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\nAdvertisement\n\n\n\n\n\n\nEconomy \nChallenged on Left and Right, the Fed Faces a Decision on Rates\n\nBy BINYAMIN APPELBAUMAUG. 30, 2015\n\n\nInside\n\n\n\nSupported by\n\n\n\n\n\n\n\n\nPhoto\n\n\n\n\n\n\nJanet L. Yellen, the Federal Reserve chairwoman.\n\nCredit\n   Stephen Crowley/The New York Times  \n\n\n\n\nAdvertisement\n\nContinue reading the main story\n\n\n\n\n\nContinue reading the main story\nShare This Page\n\nContinue reading the main story\n\n\nContinue reading the main story\n\n\n\nJACKSON HOLE, Wyo. \u2014 Conservative activists who want the Federal Reserve to raise interest rates distributed chocolate coins in golden wrappers at the local airport last week as Fed officials arrived for their annual policy retreat.Liberal activists in green \u201cWhose Recovery?\u201d T-shirts formed a receiving line at the resort hotel in the heart of Grand Teton National Park where the meeting was held, to personalize their argument that the Fed should wait.Sometime soon \u2014 possibly as early as mid-September and probably no later than the end of the year \u2014 the Fed plans to raise its benchmark interest rate one-quarter of one percentage point, a mathematically minor move that has become a very big deal.Investors, who always pay attention to the Fed, are paying particular attention now. The central bank has held short-term rates near zero since December 2008; the impending end of that era is one cause of recent financial market turmoil. \n\nContinue reading the main story\n\n\n       Related Coverage\n     \n\n\n\n\n\n\n\n\n\n\nOptimistic About Inflation, Stanley Fischer Suggests That Fed Will Stick to Plan on RatesAUG. 29, 2015\n\n\n\n\n\n\n\nBut the Fed\u2019s plans have also become the latest point of contention in a broader debate about the government\u2019s management of the American economy, pitting liberals who see a need for more aggressive measures to bolster growth against conservatives concerned that Washington and the Fed are already doing much too much. \nContinue reading the main story\n\n\n\n    When Will the Fed Raise Rates?   \n\n    More than seven years ago the Federal Reserve put its benchmark interest rate close to zero, as a way to bolster the economy. But that policy is about to change.   \n\n\n\n\n\n\n\n\n\n\n\n\u201cThere shouldn\u2019t be this intense interest in a quarter-point increase, and there shouldn\u2019t be this intense interest in whether it comes in September or December,\u201d said Alan S. Blinder, a Princeton economist and the Fed\u2019s vice chairman in the mid-1990s. \u201cBut the Fed remains the center of the financial universe. People stare at it like they stare at the North Star.\u201dAnd so, as Fed officials conferred with other central bankers and academics, the liberal activists held two days of \u201cFed Up\u201d teach-ins in a room directly below the main conference, while the conservatives convened a \u201cJackson Hole Summit\u201d at a nearby dude ranch.In the decades before the financial crisis, policy makers generally agreed that central banks should focus on moderating inflation. Now, both that goal and the best way to achieve it are subjects of debate. Liberals argue that the Fed should aim more broadly to lower unemployment and encourage rising living standards. Conservatives want to strengthen the focus on inflation by requiring officials to follow rules in making policy.\nAdvertisement\n\nContinue reading the main story\nWith the critics lining up outside, central bankers found no escape inside the main conference, where a series of academics warned policy makers that their view of inflation was oversimplified, and that their policies were less effective as a consequence.\u201cThe conference was more about what we don\u2019t know, about a candid willingness to analyze what we don\u2019t know,\u201d said Lucrezia Reichlin, a professor at London Business School and former director general of research at the European Central Bank. \u201cIt did not really inspire confidence\u201d in monetary policy.The formal program, on \u201cInflation Dynamics and Monetary Policy,\u201d was devoted to the vexing reality that inflation in recent years has not behaved as economists predicted. The basic paradigm, known as the Phillips Curve, is that inflation falls as unemployment rises, and rises as unemployment falls. But inflation did not fall as much as expected during the Great Recession, and it has remained surprisingly weak during the recovery.\nAdvertisement\n\nContinue reading the main story\nOver the course of two days, the invited academics argued that the real story was more complicated. One study, for example, presented evidence that prices fall more slowly during recessions because cash-short firms actually tend to increase prices in the face of declining demand for their products.\u201cOnce you integrate all these dynamics, it may turn out that life is not that simple,\u201d said Eric M. Leeper, an economist at Indiana University and co-author of a paper arguing that central banks need better economic models.Central bankers, however, have shown little interest in paradigm shifts. Several said that the basic understanding of inflation, while obviously imperfect, remains more functional than any alternatives.\u201cI don\u2019t think the folks at the Fed are of a mind to redesign monetary policy just because of what happened during the crisis,\u201d said Jon Faust, a professor of economics at Johns Hopkins University and a former adviser to the Fed\u2019s chairwoman, Janet L. Yellen, and her predecessor, Ben S. Bernanke.Indeed, V\xedtor Const\xe2ncio, vice president of the European Central Bank, said the euro area was currently experiencing \u201ca renaissance of the Phillips Curve.\u201dStanley Fischer, vice chairman of the Federal Reserve, painted a somewhat more complicated picture of inflation, arguing that the role of labor market slack is easily overstated, and that exchange rates play an important role.\nContinue reading the main story\nVideo\n\nThe Fed\u2019s Button on the Economy\n\nWhen it comes to raising or lowering interest rates, what the Fed is really trying to do is balance growth and inflation. But they have a limited set of tools to accomplish their goal.\n\n     By Andrew Ross Sorkin, Aaron Byrd and Erica Berenstein on                Publish Date July 29, 2015.\n         \n\n           Photo by Aaron Byrd/The New York Times.\n         \nWatch in Times Video \xbb\n\n\n\nBut his bottom line, too, was that the Fed understands inflation well enough to predict its movements. While domestic inflation has been surprisingly sluggish for years now, Mr. Fischer said on Friday that his confidence in an eventual rebound remained \u201cpretty high.\u201dThe organizers of the fringe conferences acknowledged the odds against their more radical proposals.\u201cFed Up\u201d is mostly funded by the foundation of a Facebook co-founder, Dustin Moskovitz, which said: \u201cOur best guess is that the campaign is unlikely to have an impact on the Fed\u2019s monetary policy, but that if it does, the benefits would be very large.\u201dJim DeMint, president of the Heritage Foundation, spoke at the conservative conference of \u201ca long and difficult battle that we can and must win.\u201dThe Center for Public Democracy, which organized the \u201cFed Up\u201d campaign, wants the Fed to keep rates near zero even as overall unemployment falls, to spur wage gains and help members of minorities, in particular, find jobs. It brought about 50 people to Jackson Hole as part of an effort to engage community groups that generally focus on civil rights or local issues like minimum wage laws.Dawn O\u2019Neal, 48, makes $8.50 an hour as a day care worker in suburban Atlanta; her husband has not found regular construction work in a year. When Ms. O\u2019Neal needs a refill on her asthma medication, she cuts back on food, buying hot dogs instead of beef and canned vegetables instead of fresh vegetables.\u201cI don\u2019t feel like anyone at the Fed has ever had to make a decision about whether to eat or get medication, and so when I hear that they\u2019re going to raise interest rates in September, it angers me and it scares me,\u201d Ms. O\u2019Neal said.\nAdvertisement\n\nContinue reading the main story\n\n\nAdvertisement\n\nContinue reading the main story\nThe protesters struck a chord with some officials at the main meeting. Jason Furman, President Obama\u2019s chief economic adviser, went downstairs and delivered an impromptu speech. \u201cWe don\u2019t comment on monetary policy, but what I can say is that monetary policy matters,\u201d he told the activists. The prosperity of the late 1990s, he added, resulted in part from \u201ca set of decisions made by the Federal Reserve that allowed that to happen.\u201dOther officials, however, said the push for low rates was misguided.\u201cThe biggest risk for those that are less fortunate is that we would go back into recession,\u201d said James Bullard, president of the Federal Reserve Bank of St. Louis, who said he leaned toward raising rates in September. \u201cI\u2019m hoping my policy would lengthen out the expansion longer.\u201dThe conservative conference was aligned with efforts by congressional Republicans to impose new restrictions on the Fed\u2019s conduct of monetary policy. A leading proposal would require the Fed to choose a formula for setting rates and stick with it.This view has few fans among the central bankers, who see their own judgment as an essential part of policy making.Mr. Blinder said part of the disconnect between the officials and the activists may reflect that broader concerns motivate liberals and conservatives. Conservatives see the Fed as enabling the growth of the federal debt, while liberals see the Fed as contributing to the rise of inequality.Mr. Blinder said the central bank had little power to reverse either trend. \u201cThey overstate the importance and power of the Federal Reserve,\u201d he said. All it can do, he added, is \u201caddress these problems around the edges.\u201d\n\n\nA version of this article appears in print on August 31, 2015, on page A1 of the New York edition with the headline: Left and Right Work to Shift Fed\u2019s Direction. Order Reprints| Today\'s Paper|Subscribe\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n\n\n\n\n\n\n\n\n\nGo to Home Page \xbb\n\nSite Index\n\nThe New York Times\n\n\nwindow.magnum.writeLogo(\'small\', \'http://a1.nyt.com/assets/article/20150828-192044/images/foundation/logos/\', \'\', \'\', \'standard\', \'site-index-branding-link\');\n\n\n\n\nNews\n\n\nWorld\n\n\nU.S.\n\n\nPolitics\n\n\nN.Y.\n\n\nBusiness\n\n\nTech\n\n\nScience\n\n\nHealth\n\n\nSports\n\n\nEducation\n\n\nObituaries\n\n\nToday\'s Paper\n\n\nCorrections\n\n\n\n\nOpinion\n\n\nToday\'s Opinion\n\n\nOp-Ed Columnists\n\n\nEditorials\n\n\nContributing Writers\n\n\nOp-Ed Contributors\n\n\nOpinionator\n\n\nLetters\n\n\nSunday Review\n\n\nTaking Note\n\n\nRoom for Debate\n\n\nPublic Editor\n\n\nVideo: Opinion\n\n\n\n\nArts\n\n\nToday\'s Arts\n\n\nArt & Design\n\n\nArtsBeat\n\n\nBooks\n\n\nDance\n\n\nMovies\n\n\nMusic\n\n\nN.Y.C. Events Guide\n\n\nTelevision\n\n\nTheater\n\n\nVideo Games\n\n\nVideo: Arts\n\n\n\n\nLiving\n\n\nAutomobiles\n\n\nCrossword\n\n\nFood\n\n\nEducation\n\n\nFashion & Style\n\n\nHealth\n\n\nJobs\n\n\nMagazine\n\n\nN.Y.C. Events Guide\n\n\nReal Estate\n\n\nT Magazine\n\n\nTravel\n\n\nWeddings & Celebrations\n\n\n\n\nListings & More\n\n\nClassifieds\n\n\nTools & Services\n\n\nTimes Topics\n\n\nPublic Editor\n\n\nN.Y.C. Events Guide\n\n\nTV Listings\n\n\nBlogs\n\n\nCartoons\n\n\nMultimedia\n\n\nPhotography\n\n\nVideo\n\n\nNYT Store\n\n\nTimes Journeys\n\n\nSubscribe\n\n\nManage My Account\n\n\n\n\nSubscribe\n\nSubscribe\n\n\nTimes Premier\n\n\n\nHome Delivery\n\n\n\nDigital Subscriptions\n\n\n\nNYT Opinion\n\n\n\nCrossword\n\n\n\n\nEmail Newsletters\n\n\nAlerts\n\n\nGift Subscriptions\n\n\nCorporate Subscriptions\n\n\nEducation Rate\n\n\n\n\nMobile Applications\n\n\nReplica Edition\n\n\nInternational New York Times\n\n\n\n\n\n\n\n\n\n\n\n     \xa9 2015 The New York Times Company\n\n\nHome\nSearch\nContact Us\nWork With Us\nAdvertise\nYour Ad Choices\nPrivacy\nTerms of Service\nTerms of Sale\n\n\n\n\nSite Map\nHelp\nSite Feedback\nSubscriptions\n\n\n\n\n\n\nrequire([\'foundation/main\'], function() {\n require([\'article/main\']);\n require([\'jquery/nyt\', \'foundation/views/page-manager\'], function ($, pageManager) {\n  if (window.location.search.indexOf(\'disable_tagx\') > 0) {\n   return;\n  }\n  $(document).ready(function() {\n   require([\'http://static01.nyt.com/bi/js/tagx/tagx.js\'], function() {\n    pageManager.trackingFireEventQueue();\n   });\n  });\n });\n});\n\n\n\n\n\n\n\n\n\n\nwindow.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","licenseKey":"b5bcf2eba4","applicationID":"4491457","transactionName":"YwFXZhRYVhAEVUZcX1pLYEAPFlkTFRhCXUA=","queueTime":0,"applicationTime":305,"ttGuid":"","agentToken":"","userAttributes":"","errorBeacon":"bam.nr-data.net","agent":"js-agent.newrelic.com\\/nr-593.min.js"}\n\n' 

Come si può vedere se si scorre attraverso di essa, la bella versione Soup comprende un sacco di testo non visibile. Non molto carina.

+0

Ma come si va con la funzione nltk se l'hanno rilasciata? –

Problemi correlati