柚子快報(bào)邀請(qǐng)碼778899分享:scrapy之參數(shù)傳遞和啟動(dòng)
柚子快報(bào)邀請(qǐng)碼778899分享:scrapy之參數(shù)傳遞和啟動(dòng)
scrapy之參數(shù)傳遞和啟動(dòng)
start_requests
scrapy可以通過(guò)設(shè)計(jì)start_requests函數(shù)來(lái)自定義函數(shù)的啟動(dòng)流程,比如從某個(gè)鏈接啟動(dòng),或啟動(dòng)時(shí)傳遞某些特定值
如果不需要自定義,只需要用某個(gè)鏈接開始傳遞,可以不定義start_requests函數(shù),通過(guò)start_urls列表進(jìn)行啟動(dòng)
from typing import Any
import scrapy
from scrapy.http import Response
# 使用start_requests函數(shù)
class BaiDuApi(scrapy.Spider):
name = 'baiduapi'
def start_requests(self):
urls = [
'https://httpbin.org/get?params=1',
'https://httpbin.org/get?params=2'
]
for url in urls:
yield scrapy.Request(
url
)
def parse(self, response: Response, **kwargs: Any) -> Any:
pass
from typing import Any
import scrapy
from scrapy.http import Response
# 使用start_urls,效果與上面代碼一致
class BaiDuApi(scrapy.Spider):
name = 'baiduapi'
start_urls = [
'https://httpbin.org/get?params=1',
'https://httpbin.org/get?params=2'
]
def parse(self, response: Response, **kwargs: Any) -> Any:
pass
scrapy參數(shù)傳遞
scrapy可以在啟動(dòng)時(shí)加入-a參數(shù),-a可以將某些參數(shù)傳遞到待執(zhí)行的spider中
import scrapy
class WangYiNew(scrapy.Spider):
name = 'wangyinews'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
}
def start_requests(self):
article = getattr(self, "article", None)
if article is not None:
base_url = "https://www.163.com/news/article/" + article + '.html'
yield scrapy.Request(
base_url,
headers=self.headers
)
def parse(self, response):
items = {
'title':response.xpath("http://h1[@class='post_title']/text()").get(),
'content':''.join(response.xpath("http://div[@class='post_body']//p//text()").getall()),
'pubtime':''.join(response.xpath("http://div[@class='post_info']/text()").getall())
}
self.log(items)
yield items
scrapy crawl wangyinews -a article = IFJ1RHSS000189FH
柚子快報(bào)邀請(qǐng)碼778899分享:scrapy之參數(shù)傳遞和啟動(dòng)
文章鏈接
本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理,出于傳遞更多信息之目的,不代表金鑰匙跨境贊同其觀點(diǎn)和立場(chǎng)。
轉(zhuǎn)載請(qǐng)注明,如有侵權(quán),聯(lián)系刪除。