柚子快報(bào)激活碼778899分享：Scrapy網(wǎng)絡(luò)爬蟲基礎(chǔ)

Tokopedia印尼跨境購綜合2025-05-05220

http://yzkb.51969.com/

使用Spider提取數(shù)據(jù)

Scarpy網(wǎng)絡(luò)爬蟲編程的核心就是爬蟲Spider組件，它其實(shí)是一個(gè)繼承與Spider的類，主要功能設(shè)計(jì)封裝一個(gè)發(fā)送給網(wǎng)站服務(wù)器的HTTP請(qǐng)求，解析網(wǎng)站返回的網(wǎng)頁及提取數(shù)據(jù)

執(zhí)行步驟

1、Spider生成初始頁面請(qǐng)求（封裝于Request對(duì)象中），提交給引擎 2、引擎通知下載按照Request的要求，下載網(wǎng)頁文檔，再將文檔封裝成Response對(duì)象作為參數(shù)傳回給Spider 3、Spider解析Response中的網(wǎng)頁內(nèi)容，生成結(jié)構(gòu)化數(shù)據(jù)Item，或者產(chǎn)生新的請(qǐng)求（比如爬取下一頁），再次發(fā)送給引擎 4、如果發(fā)送給引擎的是新的Request，就繼續(xù)第2步。如果發(fā)送的是結(jié)構(gòu)化數(shù)據(jù)Item，則引擎通知其他組件處理該數(shù)據(jù)（保存的文件或數(shù)據(jù)庫中）

class DingdianXuanhuanSpider(scrapy.Spider):

# 爬蟲名稱

name = "dingdian_xuanhuan"

# 允許的域名

allowed_domains = ["www.xiaoshuopu.com"]

# 起始URL列表

start_urls = ["https://www.xiaoshuopu.com/class_1/"]

def parse(self, response):

# 小說列表

novel_list = response.xpath("http://table/tr[@bgcolor='#FFFFFF']")

print("小說數(shù)量是：", len(novel_list))

# 循環(huán)獲取小說名稱、最新章節(jié)、作者、字?jǐn)?shù)、更新、狀態(tài)

for novel in novel_list:

# 小說名稱

name = novel.xpath("./td[1]/a[2]/text()").extract_first()

# 最新章節(jié)

new_chapter = novel.xpath("./td[2]/a/text()").extract_first()

# 作者

author = novel.xpath("./td[3]/text()").extract_first()

# 字?jǐn)?shù)

word_count = novel.xpath("./td[4]/text()").extract_first()

# 更新

update_time = novel.xpath("./td[5]/text()").extract_first()

# 狀態(tài)

status = novel.xpath("./td[6]/text()").extract_first()

# 將小說內(nèi)容保存到字典中

novel_info = {

"name": name,

"new_chapter": new_chapter,

"author": author,

"word_count": word_count,

"update_time": update_time,

"status": status

}

print("小說信息：",novel_info)

# 使用yield返回?cái)?shù)據(jù)

yield novel_info

name：必填項(xiàng)，用于區(qū)分不同的爬蟲。一個(gè)Scrapy項(xiàng)目中可以有多個(gè)爬蟲。不同的爬蟲，name值不能相同start_urls：存放要爬取的模板網(wǎng)頁地址的列表start_request()：爬蟲啟動(dòng)時(shí)，引擎自動(dòng)調(diào)用該方法，并且只會(huì)被調(diào)用一次，用于生成初始的請(qǐng)求對(duì)象，代碼中沒有是因?yàn)橹苯邮褂昧嘶惖墓δ躳arse():Spider類的核心方法。引擎將下載好的頁面作為參數(shù)傳遞給parse方法，parse方法執(zhí)行從頁面中解析數(shù)據(jù)的功能

重寫start_request方法

如何避免爬蟲被網(wǎng)站識(shí)別出來導(dǎo)致被禁用呢？通過重寫start_request方法，手動(dòng)生成一個(gè)功能更強(qiáng)大的Request對(duì)象。偽裝瀏覽器、自動(dòng)登錄等功能都是在Request對(duì)象中設(shè)置的

將爬蟲偽裝成瀏覽器設(shè)置新的解析數(shù)據(jù)的回調(diào)函數(shù)，不使用默認(rèn)的parse()

class QdYuepiaoSpider(scrapy.Spider):

name = "qd_yuepiao"

allowed_domains = ["www.qidian.com"]

start_urls = ["https://www.qidian.com/rank/yuepiao/"]

# 設(shè)置代理

headers = {

"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0"

}

# 重寫請(qǐng)求

def start_requests(self):

for url in self.start_urls:

yield scrapy.Request(url=url, headers=self.headers, callback=self.parse)

def parse(self, response):

print("數(shù)據(jù)：", response.xpath("http://div"))

注：上面簡單設(shè)置headers還是會(huì)被一些反爬的網(wǎng)站給識(shí)別出來。

更好的方式是在settings中啟用并設(shè)置user-agent，這樣項(xiàng)目下的所用爬蟲都能使用到該設(shè)置

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0"

Request對(duì)象

request對(duì)象用來描述一個(gè)HTTP請(qǐng)求，它通常在Spider中生成并由下載器執(zhí)行

class Request(

url: str,

callback: ((...) -> Any) | None = None,

method: str = "GET",

headers: dict | None = None,

body: bytes | str | None = None,

cookies: dict | List[dict] | None = None,

meta: dict | None = None,

encoding: str = "utf-8",

priority: int = 0,

dont_filter: bool = False,

errback: ((...) -> Any) | None = None,

flags: List[str] | None = None,

cb_kwargs: dict | None = None

)

url ：HTTP請(qǐng)求的網(wǎng)址callback：指定回調(diào)函數(shù)，即確定頁面的解析函數(shù)，默認(rèn)為parse。在解析期間如果發(fā)生異常會(huì)調(diào)用errbackmethod：請(qǐng)求方式。默認(rèn)為GET，必須大寫英文字母headers：HTTP請(qǐng)求頭body：HTTP請(qǐng)求體cookies：請(qǐng)求的Cookie值，可以實(shí)現(xiàn)自動(dòng)登錄的效果meta：字典類型，用于數(shù)據(jù)的傳遞，可以將數(shù)據(jù)傳遞給其他組件，也可以傳遞給Response對(duì)象encoding：請(qǐng)求的編碼方式。默認(rèn)UTF-8priority：請(qǐng)求的優(yōu)先級(jí)，優(yōu)先級(jí)高的優(yōu)先下載dont_filter：默認(rèn)值為False，避免對(duì)同一個(gè)url的重復(fù)請(qǐng)求。設(shè)置True，即使是重復(fù)的請(qǐng)求也會(huì)強(qiáng)制下載errback：在處理請(qǐng)求時(shí)引發(fā)任何異常時(shí)調(diào)用的函數(shù)

多頁數(shù)據(jù)爬取大多數(shù)網(wǎng)站都會(huì)存在分頁條，進(jìn)行多個(gè)頁面數(shù)據(jù)爬取需要：

在解析函數(shù)中，提取完本頁數(shù)據(jù)并提交給引擎后，設(shè)法提取到下一頁的URL地址，使用這個(gè)地址生成新的請(qǐng)求對(duì)象，再提交給引擎。

import scrapy

class DangaoSpider(scrapy.Spider):

name = "dangao"

allowed_domains = ["sc.chinaz.com"]

start_urls = ["https://sc.chinaz.com/tupian/dangaotupian.html"]

def parse(self, response):

# 定位到圖片的元素，并保存到列表中

img_list = response.xpath("http://div[@class='item']/img")

for img in img_list:

name = img.xpath("./@alt").extract_first()

src = img.xpath("./@data-original").extract_first()

img_info = {"name": name, "src": src}

yield img_info

# 獲取下一頁的url

next_url = response.xpath("http://a[@class='nextpage']/@href").extract_first()

if next_url != None:

next_url = "https://sc.chinaz.com/tupian/" + next_url

print("下一頁地址是：", next_url)

# 生成新的請(qǐng)求對(duì)象，并交給引擎執(zhí)行

yield scrapy.Request(url=next_url, callback=self.parse)

使用Item封裝數(shù)據(jù)

Item對(duì)象是一個(gè)簡單的容器，用于收集抓取到的數(shù)據(jù)，其提供了類似于字典的API，并具有用于聲明可用字段的簡單語法

定義Item 和 Field 在items.py中創(chuàng)建對(duì)應(yīng)的類

class DingdianItem(scrapy.Item):

# 小說名稱、作者、最新、字?jǐn)?shù)、更新時(shí)間、狀態(tài)

name = scrapy.Field()

author = scrapy.Field()

new_chapter = scrapy.Field()

word_count = scrapy.Field()

update_time = scrapy.Field()

status = scrapy.Field()

在相應(yīng)爬蟲中使用

import scrapy

from qidian_yuepiao.items import DingdianItem

class DingdianXuanhuanSpider(scrapy.Spider):

# 爬蟲名稱

name = "dingdian_xuanhuan"

# 允許的域名

allowed_domains = ["www.xiaoshuopu.com"]

# 起始URL列表

start_urls = ["https://www.xiaoshuopu.com/class_1/"]

def parse(self, response):

# 小說列表

novel_list = response.xpath("http://table/tr[@bgcolor='#FFFFFF']")

print("小說數(shù)量是：", len(novel_list))

# 循環(huán)獲取小說名稱、最新章節(jié)、作者、字?jǐn)?shù)、更新、狀態(tài)

for novel in novel_list:

# 小說名稱

name = novel.xpath("./td[1]/a[2]/text()").extract_first()

# 最新章節(jié)

new_chapter = novel.xpath("./td[2]/a/text()").extract_first()

# 作者

author = novel.xpath("./td[3]/text()").extract_first()

# 字?jǐn)?shù)

word_count = novel.xpath("./td[4]/text()").extract_first()

# 更新

update_time = novel.xpath("./td[5]/text()").extract_first()

# 狀態(tài)

status = novel.xpath("./td[6]/text()").extract_first()

# 將小說內(nèi)容保存到Item中

novel_info = DingdianItem()

novel_info["name"] = name

novel_info["new_chapter"] = new_chapter

novel_info["author"] = author

novel_info["word_count"] = word_count

novel_info["update_time"] = update_time

novel_info["status"] = status

print("小說信息：", novel_info)

# 使用yield返回?cái)?shù)據(jù)

yield novel_info

使用ItemLoader填充容器

在項(xiàng)目很大、提取的字段數(shù)很多時(shí)，數(shù)據(jù)提取規(guī)則也會(huì)越來越多，再加上還要對(duì)提取到的數(shù)據(jù)做轉(zhuǎn)換處理，代碼就會(huì)變得臃腫，維護(hù)起來困難。

為了解決這個(gè)問題，Scrapy提供了項(xiàng)目加載器(ItemLoader)這樣一個(gè)填充容器。通過填充容器，可以配置Item中各個(gè)字段的提取規(guī)則，并通過函數(shù)分析原始數(shù)據(jù)，最后進(jìn)行賦值

Item 和ItemLoader 的區(qū)別在于：

Item提供了保存數(shù)據(jù)的容器，需要手動(dòng)將數(shù)據(jù)保存于容器中 ItemLoader提供的是填充容器的機(jī)制，提供了3種方法

add_xpath：使用xpath選擇器提取數(shù)據(jù)add_css：使用css選擇器提取數(shù)據(jù)add_value：直接傳值

import scrapy

from qidian_yuepiao.items import DingdianItem

from scrapy.loader import ItemLoader

class DingdianXuanhuanSpider(scrapy.Spider):

# 爬蟲名稱

name = "dingdian_xuanhuan"

# 允許的域名

allowed_domains = ["www.xiaoshuopu.com"]

# 起始URL列表

start_urls = ["https://www.xiaoshuopu.com/class_1/"]

def parse(self, response):

# 小說列表

novel_list = response.xpath("http://table/tr[@bgcolor='#FFFFFF']")

print("小說數(shù)量是：", len(novel_list))

# 循環(huán)獲取小說名稱、最新章節(jié)、作者、字?jǐn)?shù)、更新、狀態(tài)

for novel in novel_list:

# 生成ItemLoader對(duì)象

novel_info = ItemLoader(item=DingdianItem(),selector=novel)

# 小說名稱

novel_info.add_xpath("name","./td[1]/a[2]/text()")

# 最新章節(jié)

novel_info.add_xpath("author","./td[2]/a/text()")

# 作者

novel_info.add_xpath("new_chapter","./td[3]/text()")

# 字?jǐn)?shù)

novel_info.add_xpath("word_count","./td[4]/text()")

# 更新

novel_info.add_xpath("update_time","./td[5]/text()")

# 狀態(tài)

novel_info.add_xpath("status","./td[6]/text()")

print("小說信息：", novel_info)

處理數(shù)據(jù) 使用ItemLoader提取出的數(shù)據(jù)也是保存于列表中，以前可以通過extract_first()獲取列表數(shù)據(jù)，現(xiàn)在呢？需要使用輸入處理器input_processor和輸出處理器out_processor

import scrapy

from scrapy.loader.processors import TakeFirst

class DingdianItem(scrapy.Item):

# 定義一個(gè)轉(zhuǎn)換函數(shù)

def change_status(status):

if status[0] == "連載中":

return 1

else:

return 2

# 小說名稱、作者、最新、字?jǐn)?shù)、更新時(shí)間、狀態(tài)

# 使用內(nèi)置函數(shù)，獲取列表中第一個(gè)非空數(shù)據(jù)

name = scrapy.Field(output_processor=TakeFirst)

author = scrapy.Field(output_processor=TakeFirst)

new_chapter = scrapy.Field(output_processor=TakeFirst)

word_count = scrapy.Field(output_processor=TakeFirst)

update_time = scrapy.Field(output_processor=TakeFirst)

status = scrapy.Field(input_processor=change_status, output_processor=TakeFirst)

使用Pipeline封裝數(shù)據(jù)

當(dāng)Spider將收集的數(shù)據(jù)封裝成Item后，將會(huì)被傳遞到Item Pipeline 項(xiàng)目管道組件中等待進(jìn)一步處理。Scrapy猶如一個(gè)爬蟲流水線，Item Pipeline是流水線的最后一道工序，它是可選的，默認(rèn)關(guān)閉，使用時(shí)需要將它激活。如果需要，也可以定義多個(gè) Item Pipeline組件，數(shù)據(jù)會(huì)依次訪問每個(gè)組件，執(zhí)行相應(yīng)的數(shù)據(jù)處理功能

典型應(yīng)用

清理數(shù)據(jù)驗(yàn)證數(shù)據(jù)的有效性查重并丟棄將數(shù)據(jù)按照自定義的格式存儲(chǔ)到文件中將數(shù)據(jù)保存的數(shù)據(jù)庫中

當(dāng)創(chuàng)建項(xiàng)目后，會(huì)字段生成一個(gè)pipelines.py文件，在里面編寫自己的Item Pipeline

# 默認(rèn)生成的

class QidianYuepiaoPipeline:

# process_item 是必須實(shí)現(xiàn)的，用于處理每一條數(shù)據(jù)Item

# item 是待處理的Item對(duì)象，spider是爬取此數(shù)據(jù)的spider對(duì)象

def process_item(self, item, spider):

# 編寫相應(yīng)的處理邏輯

if item["status"] == 1:

item["status"] = "連載"

else:

item["status"] = "完結(jié)"

return item

# 自定義的

class DingdianPipeline:

def __init__(self):

# 類初始化函數(shù)

pass

def process_item(self, item, spider):

# 編寫相應(yīng)的處理邏輯

item["status"] = item["status"].replace("連載", "1").replace("完結(jié)", "2")

return item

啟用 Item Pipeline 在配置文件settings.py中啟用被注釋掉的代碼

ITEM_PIPELINES = {

"qidian_yuepiao.pipelines.DingdianItemPipeline": 100,

"qidian_yuepiao.pipelines.QidianYuepiaoPipeline": 300,

}

格式為項(xiàng)目名.pipelines.對(duì)應(yīng)的類：優(yōu)先級(jí)，數(shù)值越小優(yōu)先級(jí)越高。在settings.py中設(shè)置后會(huì)對(duì)所有爬蟲都生效。如果想針對(duì)每一個(gè)爬蟲使用某一個(gè)，可以在爬蟲內(nèi)部進(jìn)行指定，例如

class ScrapyASpider(scrapy.Spider):

name = 'scrapyA'

custom_settings = {

'ITEM_PIPELINES': {

'myproject.pipelines.MyCustomPipelineForScrapyA': 300,

# 其他可能需要的Pipelines...

}

# 爬蟲的具體邏輯...

保存為其他文件

# 默認(rèn)生成的

class QidianYuepiaoPipeline:

# 文件名稱

file_name = datetime.now().strftime("%Y%m%d%H%M%S") + ".txt"

# 文件對(duì)象

file = None

# Spider開啟時(shí)，執(zhí)行打開文件操作

def open_spider(self, spider):

# 以追加形式打開文件

self.file = open(self.file_name, "a", encoding="utf-8")

# process_item 是必須實(shí)現(xiàn)的，用于處理每一條數(shù)據(jù)Item

# item 是待處理的Item對(duì)象，spider是爬取此數(shù)據(jù)的spider對(duì)象

def process_item(self, item, spider):

# 寫入文件

self.file.write("名稱："+item["name"] + "\n")

return item

# 爬蟲關(guān)閉時(shí)，執(zhí)行關(guān)閉文件操作

def close_spider(self, spider):

self.file.close()

案例

還是以獲取上面獲取蛋糕的案例為基礎(chǔ)，上面我們獲取了蛋糕圖片的名稱和地址，我們?cè)俅位A(chǔ)上獲取圖片的簡介內(nèi)容

import scrapy

from scarpy_study.items import DanGaoItem

class DangaoSpider(scrapy.Spider):

name = "dangao"

allowed_domains = ["sc.chinaz.com"]

start_urls = ["https://sc.chinaz.com/tupian/dangaotupian.html"]

def parse(self, response):

# 定位到圖片的元素，并保存到列表中

img_list = response.xpath("http://div[@class='item']")

for img_item in img_list:

# 獲取圖片名稱和圖片地址

name = img_item.xpath("./img/@alt").extract_first()

src = img_item.xpath("./img/@data-original").extract_first()

img_info = {

"name": name,

"url": src

}

# 獲取圖片詳情地址

detail_url = img_item.xpath(

"./div[@class='bot-div']/a/@href").extract_first()

# print("詳情地址：", detail_url)

if detail_url != None:

detail_url = 'https://sc.chinaz.com' + detail_url

# 生成新的請(qǐng)求，并使用meta傳遞信息

yield scrapy.Request(url=detail_url, callback=self.parse_detail,

meta={"img_info": img_info})

# 獲取下一頁的url

next_url = response.xpath(

"http://a[@class='nextpage']/@href").extract_first()

if next_url != None:

next_url = "https://sc.chinaz.com/tupian/" + next_url

print("下一頁地址是：", next_url)

# 生成新的請(qǐng)求對(duì)象，并交給引擎執(zhí)行

yield scrapy.Request(url=next_url, callback=self.parse)

# 用來解析詳情頁

def parse_detail(self, response):

img_info = response.meta["img_info"]

print("img_info:", img_info)

# 記錄圖片的名稱、地址

dangaoItem = DanGaoItem()

dangaoItem["name"] = img_info["name"]

dangaoItem["url"] = img_info["url"]

# 獲取描述

desc = response.xpath("http://p[@class='all-c']/text()").extract_first()

dangaoItem["desc"] = desc

print("蛋糕圖片信息：", dangaoItem)

yield dangaoItem

這里獲取詳情的核心是，在生成新的請(qǐng)求對(duì)象時(shí)使用callback指定詳情信息的解析函數(shù)，使用meta來傳遞之前獲取到的圖片名稱和圖片地址

# 生成新的請(qǐng)求，并使用meta傳遞信息

yield scrapy.Request(url=detail_url, callback=self.parse_detail,

meta={"img_info": img_info})

柚子快報(bào)激活碼778899分享：Scrapy網(wǎng)絡(luò)爬蟲基礎(chǔ)

http://yzkb.51969.com/

好文鏈接

評(píng)論可見，查看隱藏內(nèi)容

本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理，出于傳遞更多信息之目的，不代表金鑰匙跨境贊同其觀點(diǎn)和立場。

轉(zhuǎn)載請(qǐng)注明，如有侵權(quán)，聯(lián)系刪除。

本文鏈接：http://gantiao.com.cn/post/19535237.html

發(fā)布評(píng)論

取消回復(fù)

您暫未設(shè)置收款碼

請(qǐng)?jiān)谥黝}配置——文章設(shè)置里上傳

金鑰匙跨境

掃描二維碼手機(jī)訪問

文章目錄

欧美free性护士vide0shd,老熟女,一区二区三区,久久久久夜夜夜精品国产,久久久久久综合网天天,欧美成人护士h版

柚子快報(bào)激活碼778899分享：Scrapy網(wǎng)絡(luò)爬蟲基礎(chǔ)

隨便看看

特朗普要求美國最高法院暫停執(zhí)行TikTok強(qiáng)制出售令

最新留言

您暫未設(shè)置收款碼