欧美free性护士vide0shd,老熟女,一区二区三区,久久久久夜夜夜精品国产,久久久久久综合网天天,欧美成人护士h版

首頁綜合正文

評論

柚子快報邀請碼778899分享：scrapy的建模及管道的使用

Flipkart印度出海王綜合2025-05-05380

柚子快報邀請碼778899分享：scrapy的建模及管道的使用

http://yzkb.51969.com/

一、數(shù)據(jù)建模

通常在做項目的過程中，在items.py中進(jìn)行數(shù)據(jù)建模

為什么建模

定義item即提前規(guī)劃好哪些字段需要抓，防止手誤，因為定義好之后，在運行過程中，系統(tǒng)會自動檢查，配合注釋一起可以清晰的知道要抓取哪些字段，沒有定義的字段不能抓取，在目標(biāo)字段少的時候可以使用字典代替。使用scrapy的一些特定組件需要Item做支持，如scrapy的ImagesPipeline管道類，百度搜索了解更多

如何建模

在items.py文件中定義要提取的字段：

class MyspiderItem(scrapy.Item):

name = scrapy.Field() # 講師的名字

title = scrapy.Field() # 講師的職稱

desc = scrapy.Field() # 講師的介紹

如何使用模板類

模板類定義以后需要在爬蟲中導(dǎo)入并且實例化，之后的使用方法和使用字典相同

from myspider.items import MyspiderItem # 導(dǎo)入Item，注意路徑

...

def parse(self, response)

item = MyspiderItem() # 實例化后可直接使用

item['name'] = node.xpath('./h3/text()').extract_first()

item['title'] = node.xpath('./h4/text()').extract_first()

item['desc'] = node.xpath('./p/text()').extract_first()

print(item)

注意：

from myspider.items import MyspiderItem這一行代碼中注意item的正確導(dǎo)入路徑，忽略pycharm標(biāo)記的錯誤 python中的導(dǎo)入路徑要訣：從哪里開始運行，就從哪里開始導(dǎo)入

二、管道的使用

pipeline中常用的方法：

process_item(self,item,spider): 管道類中必須有的函數(shù) 實現(xiàn)對item數(shù)據(jù)的處理必須return itemopen_spider(self, spider): 在爬蟲開啟的時候僅執(zhí)行一次close_spider(self, spider): 在爬蟲關(guān)閉的時候僅執(zhí)行一次

管道文件的修改

在pipelines.py代碼中完善

import json

from pymongo import MongoClient

class BaiduFilePipeline(object):

def open_spider(self, spider): # 在爬蟲開啟的時候僅執(zhí)行一次

if spider.name == 'baidu':

self.f = open('json.txt', 'a', encoding='utf-8')

def close_spider(self, spider): # 在爬蟲關(guān)閉的時候僅執(zhí)行一次

if spider.name == 'baidu':

self.f.close()

def process_item(self, item, spider):

if spider.name == 'baidu':

self.f.write(json.dumps(dict(item), ensure_ascii=False, indent=2) + ',\n')

# 不return的情況下，另一個權(quán)重較低的pipeline將不會獲得item

return item

class WangyiMongoPipeline(object):

def open_spider(self, spider): # 在爬蟲開啟的時候僅執(zhí)行一次

if spider.name == 'baidu':

# 也可以使用isinstanc函數(shù)來區(qū)分爬蟲類:

con = MongoClient(host='127.0.0.1', port=27017) # 實例化mongoclient

self.collection = con.baidu.teachers # 創(chuàng)建數(shù)據(jù)庫名為baidu,集合名為teachers的集合操作對象

def process_item(self, item, spider):

if spider.name == 'baidu':

self.collection.insert(item)

# 此時item對象必須是一個字典,再插入

# 如果此時item是BaseItem則需要先轉(zhuǎn)換為字典：dict(BaseItem)

# 不return的情況下，另一個權(quán)重較低的pipeline將不會獲得item

return item

開啟管道

在settings.py設(shè)置開啟pipeline

ITEM_PIPELINES = {

'myspider.pipelines.ItcastFilePipeline': 400, # 400表示權(quán)重

'myspider.pipelines.ItcastMongoPipeline': 500, # 權(quán)重值越小，越優(yōu)先執(zhí)行！

}

思考：在settings中能夠開啟多個管道，為什么需要開啟多個？

不同的pipeline可以處理不同爬蟲的數(shù)據(jù)，通過spider.name屬性來區(qū)分，不同的pipeline能夠?qū)σ粋€或多個爬蟲進(jìn)行不同的數(shù)據(jù)處理的操作，比如一個進(jìn)行數(shù)據(jù)清洗，一個進(jìn)行數(shù)據(jù)的保存同一個管道類也可以處理不同爬蟲的數(shù)據(jù)，通過spider.name屬性來區(qū)分

pipeline使用注意點

使用之前需要在settings中開啟ipeline在setting中鍵表示位置(即pipeline在項目中的位置可以自定義)，值表示距離引擎的遠(yuǎn)近，越近數(shù)據(jù)會越先經(jīng)過：權(quán)重值小的優(yōu)先執(zhí)行有多個pipeline的時候，process_item的方法必須return item,否則后一個pipeline取到的數(shù)據(jù)為None值pipeline中process_item的方法必須有，否則item沒有辦法接受和處理process_item方法接受item和spider，其中spider表示當(dāng)前傳遞item過來的spideropen_spider(spider) :能夠在爬蟲開啟的時候執(zhí)行一次close_spider(spider):能夠在爬蟲關(guān)閉的時候執(zhí)行一次上述倆個方法經(jīng)常用于爬蟲和數(shù)據(jù)庫的交互，在爬蟲開啟的時候建立和數(shù)據(jù)庫的連接，在爬蟲關(guān)閉的時候斷開和數(shù)據(jù)庫的連接

三、編寫位置

柚子快報邀請碼778899分享：scrapy的建模及管道的使用

http://yzkb.51969.com/