python中怎么實現一個多線程爬蟲

發布時間：2021-07-02 15:22:08 來源：億速云閱讀：155 作者：Leah 欄目：大數據

這篇文章給大家介紹python中怎么實現一個多線程爬蟲，內容非常詳細，感興趣的小伙伴們可以參考借鑒，希望對大家能有所幫助。

開發環境：

ubuntu16.04，python3.6,bs4,virtualenv(虛擬環境)

創建虛擬環境：

創建項目文件夾，并為項目創建虛擬環境，利用pip安裝相關包

mkdir mutiThreadCrawier
cd  mutiThreadCrawier
mkdir content #存爬下來的頁面
virtualenv env --python =python3.6 #創建虛擬環境
source env/bin/activate   #使虛擬環境生效

導包：

import time
import re
import threading
import urllib
import requests
from bs4 import BeautifulSoup

定義變量

g_mutex = threading.Condition()  # 可以加鎖/釋放鎖
print(g_mutex)
print(type(g_mutex))
g_urls = []  # 存放解析出來的url對應的網頁源碼
g_queue_urls = [] # 待爬取的url
g_exist_urls = []  # 已經爬過的url
g_failed_urls = [] # 失敗的鏈接
g_total_count = 0  # 已經下載的頁面的計數器

定義線程類：

創建一個線程類，繼承于threading.Thread,并進構造，在run函數中根據url路徑請求網絡連接，并保存頁面html文檔保存到本地，如果下載失敗則拋出異常。并將下載過頁面的路由添加到g_exist_urls

class CrawlerThread(threading.Thread):


    def __init__(self,url,filename,tid):
        threading.Thread.__init__(self)
        self.filename=filename
        self.url =url
        self.tid=tid
    def run(self):
        try:
            resp=urllib.request.urlopen(self.url)
            html=resp.read()
            with open('content/'+self.filename,'wb') as f:

                f.write(html)
        except Exception as e:
            g_exist_urls.append(self.url)
            g_failed_urls.append(self.url)
            print(f'頁面{self.url}下載失敗！')
        g_mutex.acquire()
        g_urls.append(html)
        g_exist_urls.append(self.url)
        g_mutex.release()

定義爬蟲類：

對其進行構造，創建日志，download（）函數創建線程，update_queque_url對連接的列表進行更新,get_url()根據bs4進行匹配獲取連接，download_all()通過調用download（）函數實現批量下載。spider作為一個入口函數進行爬取

class Crawler:
    def __init__(self,name,domain,thread_number):
        self.name=name

        self.domain=domain
        self.thread_number=thread_number

        self.logfile=open('log.txt','w')
        self.thread_pool=[]
        self.url = 'http://'+domain

     def spider(self):# 內容會隨著爬蟲的進行而更新
        global g_queue_urls# 初始，隊列中僅有一個url
        g_queue_urls.append(self.url)# 爬取的深度
        depth =0
        print(f'爬蟲{self.name}開始啟動........')
        while g_queue_urls:
            depth +=1
            print(f'當前爬取深度是{depth}')
            self.logfile.write(f'URL:{g_queue_urls[0]}')
            self.download_all() # 下載所有
            self.update_queque_url() # 更新 url隊列
            self.logfile.write(f">>>Depth:{depth}")
            count = 0
            while count <len(g_queue_urls):
                self.logfile.write(f"累計爬取{g_total_count}條，爬取是第{g_queue_urls[count]}個")
                count+=1



    def download_all(self):
        global g_queue_urls
        global g_total_count
        i=0
        while i < len(g_queue_urls):
            j=0
            while j<self.thread_number and i+j <len(g_queue_urls):
                g_total_count +=1
                print(g_queue_urls[i+j])
                thread_result=self.download(g_queue_urls[i+j],f"{g_total_count}.html",j)
                if thread_result  is not None:
                      print(f'線程{i+j}啟動')
                j +=1
            i=i+j
            for thread in self.thread_pool:
                thread.join(25)
        g_queue_urls=[]



    def download(self,url,filename,tid):
        print(url,filename,tid)
        creawler_thread= CrawlerThread(url,filename,tid)
        self.thread_pool.append(creawler_thread)
        creawler_thread.start()
    def update_queque_url(self):
        global g_queue_urls
        global g_exist_urls#已經爬過的url
        new_urls=[]#新發現的url
        for url_content in g_urls:
            new_urls +=self.get_Url(url_content)# 從頁面中提取新url
        g_queue_urls=list(set(new_urls)  -set(g_exist_urls)) # 去除重復的和已經爬過的

    def get_Url(self,content):
    	'''
        從網頁源代碼中提取url
        '''
        links =[] # 保存提取到的href
        try:

            soup =BeautifulSoup(content)
            for link in soup.findAll('a'):
                if link is not None and link.get('href') is not None:
                    if self.domain in link['href']:
 			# 如果link是本網站的絕對地址
                        links.append(link)
                    elif  len(link['href']) >10 and  'http://' not in link['href']:
				 # 如果link是相對地址
                        links.append(self.url +link['href'])



        except Exception as e:
            print("fail to get url",e)
        return links

主函數

主函數調用爬蟲函數的spider()方法

if __name__=="__main__":
    domain ="www.geyanw.com"
    thread_number=10
    name="geyan"
    crawler =Crawler(name,domain,thread_number)
    crawler.spider()

關于python中怎么實現一個多線程爬蟲就分享到這里了，希望以上內容可以對大家有一定的幫助，可以學到更多知識。如果覺得文章不錯，可以把它分享出去讓更多的人看到。

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

python中怎么實現一個多線程爬蟲

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

python中怎么實現一個多線程爬蟲

猜你喜歡

最新資訊

相關推薦

相關標簽