您好,登錄后才能下訂單哦!
今天就跟大家聊聊有關Python爬蟲BeautifulSoup4的使用方法,可能很多人都不太了解,為了讓大家更加了解,小編給大家總結了以下內容,希望大家根據這篇文章可以有所收獲。
爬蟲——BeautifulSoup4解析器
BeautifulSoup用來解析HTML比較簡單,API非常人性化,支持CSS選擇器、Python標準庫中的HTML解析器,也支持lxml的XML解析器。
其相較與正則而言,使用更加簡單。
示例:
首先必須要導入bs4庫
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 格式化輸出 soup 對象的內容 print(soup.prettify())
運行結果
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --> </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
四大對象種類
BeautifulSoup將復雜的HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:
(1)Tag
(2)NavigableString
(3)BeautifulSoup
(4)Comment
1.Tag
Tag 通俗點講就是HTML中的一個個標簽,例如:
<head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
上面title head a p 等等HTML標簽加上里面包括的內容就是Tag,那么試著使用BeautifulSoup來獲取Tags:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # # 打印title標簽 print(soup.title) # 打印head標簽 print(soup.head) # 打印a標簽 print(soup.a) # 打印p標簽 print(soup.p) # 打印soup.p的類型 print(type(soup.p))
運行結果
<title>The Dormouse's story</title> <head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <class 'bs4.element.Tag'>
我們可以利用soup加標簽名輕松地獲取這些標簽內容,這些對象的類型是bs4.element.Tag。但是注意,它查找的是在所有內容中的第一個符合要求的標簽。如果需要查詢所有的標簽,后面會進行介紹。
對于Tag,它有兩個重要的屬性,就是name和attrs。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # soup對象比較特殊,它的name為[document] print(soup.name) # 對于其他內部標簽,輸出的值便為標簽本身的名稱 print(soup.head.name) # 打印p標簽的所有屬性,其類型是一個字典 print(soup.p.attrs) # 打印p標簽的class屬性 print(soup.p['class']) # 還可以利用get方法獲取屬性,傳入屬性的名稱,與上面的方法等價 print(soup.p.get('class')) print(soup.p) # 修改屬性 soup.p['class'] = "newClass" print(soup.p) # 刪除屬性 del soup.p['class'] print(soup.p)
運行結果
[document] head {'class': ['title'], 'name': 'dromouse'} ['title'] ['title'] <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p> <p name="dromouse"><b>The Dormouse's story</b></p>
2.NavigableString
既然我們已經得到了標簽的內容,那么問題來了,我們想要獲取標簽內部的文字怎么辦呢?很簡單,用.string即可,例如:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 打印p標簽的內容 print(soup.p.string) # 打印soup.p.string的類型 print(type(soup.p.string))
運行結果
The Dormouse's story <class 'bs4.element.NavigableString'>
3.BeautifulSoup
BeautifulSoup對象表示的是一個文檔的內容。大部分時候,可以把它當作Tag對象,是一個特殊的Tag,我們可以分別獲取它的類型,名稱,以及屬性。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 類型 print(type(soup.name)) # 名稱 print(soup.name) # 屬性 print(soup.attrs)
運行結果
<class 'str'> [document] {}
4.Comment
Comment對象是一個特殊類型的NavigableString對象,其輸出的內容不包括注釋符號。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") print(soup.a) print(soup.a.string) print(type(soup.a.string))
運行結果
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> Elsie <class 'bs4.element.Comment'>
a標簽里的內容實際上是注釋,但是如果我們利用.string來輸出它的內容時,注釋符號已經去掉了。
看完上述內容,你們對Python爬蟲BeautifulSoup4的使用方法有進一步的了解嗎?如果還想了解更多知識或者相關內容,請關注億速云行業資訊頻道,感謝大家的支持。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。