Scrapy 可以通過使用多線程或多進程來提高爬取效率。以下是一些實現方式:
import threading
def start_crawl(url):
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider, start_urls=[url])
process.start()
urls = ['http://example.com/page1', 'http://example.com/page2', 'http://example.com/page3']
threads = []
for url in urls:
thread = threading.Thread(target=start_crawl, args=(url,))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
import multiprocessing
def start_crawl(url):
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider, start_urls=[url])
process.start()
urls = ['http://example.com/page1', 'http://example.com/page2', 'http://example.com/page3']
processes = []
for url in urls:
process = multiprocessing.Process(target=start_crawl, args=(url,))
process.start()
processes.append(process)
for process in processes:
process.join()
需要注意的是,多線程和多進程爬取都會增加系統資源消耗,尤其是內存和 CPU 使用率。因此,需要根據實際情況選擇合適的方式來提高爬取效率。