柚子快報(bào)邀請(qǐng)碼778899分享:scrapy實(shí)現(xiàn)增量式爬取
柚子快報(bào)邀請(qǐng)碼778899分享:scrapy實(shí)現(xiàn)增量式爬取
novel_ImageUrl = novel_ImageUrl,
_id = novel_ID, #小說id作為唯一標(biāo)識(shí)符
novel_Writer = novel_Writer,
novel_Status = novel_Status,
novel_Words = novel_Words,
novel_UpdateTime = novel_UpdateTime,
novel_AllClick = novel_AllClick,
novel_MonthClick = novel_MonthClick,
novel_WeekClick = novel_WeekClick,
novel_AllComm = novel_AllComm,
novel_MonthComm = novel_MonthComm,
novel_WeekComm = novel_WeekComm,
novel_Url = novel_Url,
novel_Introduction = novel_Introduction,
)
return bookitem
2.解析章節(jié)信息
def parse_chapter_content(self,response):
if not response.body:
print(response.url+“已經(jīng)被爬取過了,跳過”)
return;
ht = response.body.decode(‘utf-8’)
text = html.fromstring(ht)
soup = BeautifulSoup(ht)
novel_ID = response.url.split(“/”)[-2]
novel_Name = text.xpath(“.//p[@class=‘fr’]/following-sibling::a[3]/text()”)[0]
chapter_Name = text.xpath(“.//h1[1]/text()”)[0]
‘’’
chapter_Content = “”.join(“”.join(text.xpath(“.//dd[@id=‘contents’]/text()”)).split())
if len(chapter_Content) < 25:
chapter_Content = “”.join(“”.join(text.xpath(“.//dd[@id=‘contents’]//*/text()”)))
pattern = re.compile(‘dd id=“contents”.?>(.?)’)
match = pattern.search(ht)
chapter_Content = “”.join(match.group(1).replace("?“,”").split()) if match else “爬取錯(cuò)誤”
‘’’
result,number = re.subn(“<.*?>”,“”,str(soup.find(“dd”,id=‘contents’)))
chapter_Content = “”.join(result.split())
print(len(chapter_Content))
novel_ID = response.url.split(“/”)[-2]
return ChapterItem(
chapter_Url = response.url,
_id=int(response.url.split(“/”)[-1].split(“.”)[0]),
novel_Name=novel_Name,
chapter_Name=
柚子快報(bào)邀請(qǐng)碼778899分享:scrapy實(shí)現(xiàn)增量式爬取
精彩內(nèi)容
本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理,出于傳遞更多信息之目的,不代表金鑰匙跨境贊同其觀點(diǎn)和立場(chǎng)。
轉(zhuǎn)載請(qǐng)注明,如有侵權(quán),聯(lián)系刪除。