python - scrapy: Download html for external links -

- August 15, 2010

need crawl internal external urls on site. once first external url crawled not want crawl external site. url might youtube.com result in crawling youtube.com completely. below code written want. not able add item (external url) in resultant set.

import urlparse import scrapy  scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor  class propertiesitem(scrapy.item):     url = scrapy.field()     #html = scrapy.field()    class examplespider(crawlspider):     name = "examplespider"     allowed_domains = ["9value.customer360.co"]     start_urls = ["http://9value.customer360.co/"]      rules = (        rule(sgmllinkextractor(allow=(), restrict_xpaths=('//a',)), callback="parse_items", process_links="filter_links", follow= true),     )      def filter_links(self, links):         link in links:             if self.allowed_domains[0] not in link.url:                 item = propertiesitem()                 print '----------------------------- ' + link.url                 item["url"] = link.url                 #item["html"] = response.body         return links      def parse_items(self, response):         item = propertiesitem()         item["url"] = response.url         #item["html"] = response.body         return(item)

filter_links function filter out external link not add value in output json file. want html same link.

Search This Blog

Look

python - scrapy: Download html for external links -

Comments

Post a Comment

Popular posts from this blog

filehandler - java open files not cleaned, even when the process is killed -

java - Suppress Jboss version details from HTTP error response -

gridview - Yii2 DataPorivider $totalSum for a column -