您好,登錄后才能下訂單哦!
這篇文章主要介紹了python3爬蟲怎么用多線程獲取素材,具有一定借鑒價值,需要的朋友可以參考下。希望大家閱讀完這篇文章后大有收獲。下面讓小編帶著大家一起了解一下。
幾個關鍵點:
1.素材圖片的獲取
素材圖存在不少情況,無圖,單圖,多圖等都有可能存在
xpath獲取圖片
imgs = req.xpath('//div[@class="contentinfo"]/table//@src')
條件判斷,看是否存在圖片
if imgs:
遍歷獲取圖片
for img in imgs:
圖片后綴的獲取
suffix = os.path.splitext(img)[1]
2.是否能夠下載,素材是否有下載權限
如果能夠下載,獲取到下載相關數據,下載路徑鏈接以及素材包名,不能下載則返回為空
if int(req.xpath('//div[@class="download"]/dl[@class="downlink"]/dd[1]/b/text()')[0]) == 0: down_url = req.xpath('//div[@class="download"]/dl[@class="downlink"]/dt/li/a/@href')[0] down_name = f'{h3}/{h3}.rar' down_data=down_url, down_name print(down_data) else: down_data=[]
3.隊列的使用,老生常談
隊列只能傳遞一個參數!!!
data=text_data,img_data,down_data self.down_queue.put(data)
4.一些能程序能夠運行的設置
a.sleep操作延遲
time.sleep(1)
b.隊列容器的設定
一般而言,往大了設置,哪個隊列數據更多,則設置更多
page_queue = Queue(1000) down_queue = Queue(1500)
采集效果:
附源碼:
# -*- coding: UTF-8 -*- # 20200516 by 微信公眾號:二爺記 import requests, os,time from lxml import etree from fake_useragent import UserAgent import threading from queue import Queue # 生產者模式 class Procuder(threading.Thread): def __init__(self, page_queue, down_queue, *args, **kwargs): super(Procuder, self).__init__(*args, **kwargs) self.page_queue = page_queue self.down_queue = down_queue self.ua = UserAgent() self.headers = {"User-Agent": self.ua.random} def run(self): while True: if self.page_queue.empty(): break url = self.page_queue.get() self.parse(url) def parse(self, url): print(f'>>> 正在抓取列表頁 {url} 數據...') response = requests.get(url, headers=self.headers, timeout=6).content.decode("gbk") time.sleep(1) req = etree.HTML(response) urllist = req.xpath('//dl[@class="imglist"]/dt/ul[@class="listimg"]/li/span[@class="listpic"]/a/@href') print(len(urllist)) print(urllist) for href in urllist: try: self.parse_page(href) except Exception as e: print(f'獲取詳情數據失敗,錯誤代碼:{e}') def parse_page(self, url): print(f'>>> 正在抓取詳情頁 {url} 數據...') response = requests.get(url, headers=self.headers, timeout=6).content.decode("gbk") time.sleep(1) req = etree.HTML(response) h3 = req.xpath('//div[@class="arcinfo"]/h3/text()')[0] print(h3) article = req.xpath('//div[@class="contentinfo"]/table//text()') article = ''.join(article) article = article.strip() print(article) texts = f'{h3}\n{article}' text_data=h3,texts imgs = req.xpath('//div[@class="contentinfo"]/table//@src') if imgs: i = 1 for img in imgs: img_url = f'http://www.uimaker.com{img}' suffix = os.path.splitext(img)[1] img_name = f'{i}{suffix}' img_data=img_url, img_name print(img_data) i = i + 1 if int(req.xpath('//div[@class="download"]/dl[@class="downlink"]/dd[1]/b/text()')[0]) == 0: down_url = req.xpath('//div[@class="download"]/dl[@class="downlink"]/dt/li/a/@href')[0] down_name = f'{h3}/{h3}.rar' down_data=down_url, down_name print(down_data) else: down_data=[] data=text_data,img_data,down_data self.down_queue.put(data) # 消費者模式 class Consumer(threading.Thread): def __init__(self, page_queue, down_queue, *args, **kwargs): super(Consumer, self).__init__(*args, **kwargs) self.page_queue = page_queue self.down_queue = down_queue self.ua = UserAgent() self.headers = {"User-Agent": self.ua.random} def run(self): while True: if self.page_queue.empty() and self.down_queue.empty(): break text_data,img_data,down_data=self.down_queue.get() h3,texts=text_data os.makedirs(f'{h3}/', exist_ok=True) #創建目錄 self.get_text(h3, texts) img_url,img_name=img_data self.get_downimg(h3, img_url, img_name) if down_data !=[]: down_url, down_name = down_data self.down(down_url, down_name) # 保存文本內容 def get_text(self, h3, texts): print("開始保存文本內容...") with open(f'{h3}/{h3}.txt', 'w', encoding="utf-8") as f: f.write(texts) print(">>>保存文本內容完成!") # 下載圖片 def get_downimg(self, h3, img_url, img_name): print("開始下載圖片...") r = requests.get(img_url, headers=self.headers, timeout=6) time.sleep(1) with open(f'{h3}/{img_name}', 'wb') as f: f.write(r.content) print(">>>下載圖片完成!") # 下載素材 def down(self, down_url, down_name): print("開始下載素材...") r = requests.get(down_url, headers=self.headers, timeout=6) time.sleep(1) with open(down_name, 'wb') as f: f.write(r.content) print(">>>下載素材完成!") def main(): page_queue = Queue(1000) down_queue = Queue(1500) for i in range(1, 71): url = f"http://www.uimaker.com/uimakerdown/list_36_{i}.html" print(f'>>> 正在爬取 第{i + 1}頁 列表頁,鏈接:{url} ...') page_queue.put(url) for x in range(3): t = Procuder(page_queue, down_queue) t.start() for x in range(6): t = Consumer(page_queue, down_queue) t.start() if __name__ == '__main__': main()
感謝你能夠認真閱讀完這篇文章,希望小編分享python3爬蟲怎么用多線程獲取素材內容對大家有幫助,同時也希望大家多多支持億速云,關注億速云行業資訊頻道,遇到問題就找億速云,詳細的解決方法等著你來學習!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。