python - scrapy: Download html for external links -


need crawl internal external urls on site. once first external url crawled not want crawl external site. url might youtube.com result in crawling youtube.com completely. below code written want. not able add item (external url) in resultant set.

import urlparse import scrapy  scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor  class propertiesitem(scrapy.item):     url = scrapy.field()     #html = scrapy.field()    class examplespider(crawlspider):     name = "examplespider"     allowed_domains = ["9value.customer360.co"]     start_urls = ["http://9value.customer360.co/"]      rules = (        rule(sgmllinkextractor(allow=(), restrict_xpaths=('//a',)), callback="parse_items", process_links="filter_links", follow= true),     )      def filter_links(self, links):         link in links:             if self.allowed_domains[0] not in link.url:                 item = propertiesitem()                 print '----------------------------- ' + link.url                 item["url"] = link.url                 #item["html"] = response.body         return links      def parse_items(self, response):         item = propertiesitem()         item["url"] = response.url         #item["html"] = response.body         return(item) 

filter_links function filter out external link not add value in output json file. want html same link.


Comments

Popular posts from this blog

java - Suppress Jboss version details from HTTP error response -

gridview - Yii2 DataPorivider $totalSum for a column -

Sass watch command compiles .scss files before full sftp upload -