python - scrapy: Download html for external links -
need crawl internal external urls on site. once first external url crawled not want crawl external site. url might youtube.com result in crawling youtube.com completely. below code written want. not able add item (external url) in resultant set.
import urlparse import scrapy scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor class propertiesitem(scrapy.item): url = scrapy.field() #html = scrapy.field() class examplespider(crawlspider): name = "examplespider" allowed_domains = ["9value.customer360.co"] start_urls = ["http://9value.customer360.co/"] rules = ( rule(sgmllinkextractor(allow=(), restrict_xpaths=('//a',)), callback="parse_items", process_links="filter_links", follow= true), ) def filter_links(self, links): link in links: if self.allowed_domains[0] not in link.url: item = propertiesitem() print '----------------------------- ' + link.url item["url"] = link.url #item["html"] = response.body return links def parse_items(self, response): item = propertiesitem() item["url"] = response.url #item["html"] = response.body return(item)
filter_links function filter out external link not add value in output json file. want html same link.
Comments
Post a Comment