您好,登錄后才能下訂單哦!
小編給大家分享一下def中如何使用協程方式爬取小紅書數據,希望大家閱讀完這篇文章之后都有所收獲,下面讓我們一起去探討吧!
from gevent import monkey # 猴子補丁 monkey.patch_all() from gevent.pool import Pool from queue import Queue import requests import json from lxml import etree class RedBookSpider(): """小紅書爬蟲""" def __init__(self, pages): """初始化""" self.url = 'https://www.xiaohongshu.com/web_api/sns/v2/trending/page/brand?page={}&page_size=20' self.headers = { "User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Mobile Safari/537.36" } self.url_queue = Queue() self.pool = Pool(5) self.pages = pages pass def get_url(self): """獲取url""" for page in range(1, self.pages): url = self.url.format(page) self.url_queue.put(url) def save_data(self, items): """數據保存""" with open('./redbook.txt', 'a', encoding='utf-8') as f: f.write(str(items) + '\n') def deal_detail(self, detail_url, items, data): """詳情頁內容提取""" resp = requests.get(url=detail_url, headers=self.headers) eroot = etree.HTML(resp.text) items['fans'] = eroot.xpath('//div[@data-v-64bff0ce]/div[@class="extra"]/text()') items['articles'] = eroot.xpath('//div/span[@class="stats"]/text()') items['introduce'] = eroot.xpath('//div[@class="desc"]/div[@class="content"]/text()') items['detail_url'] = detail_url items['image'] = data['page_info']['banner'] print(items) self.save_data(items) def deal_response(self, resp): """數據提取""" dict_data = json.loads(resp.text) dict_data = dict_data['data'] for data in dict_data: items = {} items['name'] = data['page_info']['name'] detail_url = 'https://www.xiaohongshu.com/page/brands/' + data['page_id'] self.deal_detail(detail_url, items, data) def execute_task(self): """處理響應""" url = self.url_queue.get() resp = requests.get(url=url, headers=self.headers) # print(resp.text) self.deal_response(resp) self.url_queue.task_done() def execute_task_finished(self, result): """任務回調""" self.pool.apply_async(self.execute_task, callback=self.execute_task_finished) def run(self): """啟動程序""" self.get_url() for i in range(3): self.pool.apply_async(self.execute_task, callback=self.execute_task_finished) self.url_queue.join() pass if __name__ == '__main__': user = RedBookSpider(4) # 需要爬取幾頁數據 就改為多少 user.run()
看完了這篇文章,相信你對“def中如何使用協程方式爬取小紅書數據”有了一定的了解,如果想了解更多相關知識,歡迎關注億速云行業資訊頻道,感謝各位的閱讀!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。