您好,登錄后才能下訂單哦!
小編給大家分享一下Python爬蟲中遍歷文檔樹的方法,希望大家閱讀完這篇文章后大所收獲,下面讓我們一起去探討吧!
遍歷文檔樹
1.直接子節點:.contents .children屬性
.content
Tag的.content屬性可以將Tag的子節點以列表的方式輸出
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 輸出方式為列表 print(soup.head.contents) print(soup.head.contents[0])
運行結果
[<title>The Dormouse's story</title>] <title>The Dormouse's story</title>
.children
它返回的不是一個列表,不過我們可以通過遍歷獲取所有的子節點。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 輸出方式為列表生成器對象 print(soup.head.children) # 通過遍歷獲取所有子節點 for child in soup.head.children: print(child)
運行結果
<list_iterator object at 0x008FF950> <title>The Dormouse's story</title>
2.所有子孫節點:.descendants屬性
上面講的.contents和.children屬性僅包含Tag的直接子節點,.descendants屬性可以對所有Tag的子孫節點進行遞歸循環,和children類似,我們也需要通過遍歷的方式獲取其中的內容。
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") # 輸出方式為列表生成器對象 print(soup.head.descendants) # 通過遍歷獲取所有子孫節點 for child in soup.head.descendants: print(child)
運行結果
<generator object descendants at 0x00519AB0> <title>The Dormouse's story</title> The Dormouse's story
3.節點內容:.string屬性
如果Tag只有一個NavigableString類型子節點,那么這個Tag可以使用.string得到子節點。如果一個Tag僅有一個子節點,那么這個Tab也可以使用.string方法,輸出結果與當前唯一子節點的.string結果相同。
通俗點來講就是:如果一個標簽里面沒有標簽了,那么.string就會返回標簽里面的內容。如果標簽里面只有唯一的一個標簽了,那么.string也會返回里面的內容。例如:
#!/usr/bin/python3 # -*- coding:utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創建 Beautiful Soup 對象,指定lxml解析器 soup = BeautifulSoup(html, "lxml") print(soup.head.string) print(soup.head.title.string)
運行結果
The Dormouse's story The Dormouse's story
看完了這篇文章,相信你對Python爬蟲中遍歷文檔樹的方法有了一定的了解,想了解更多相關知識,歡迎關注億速云行業資訊頻道,感謝各位的閱讀!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。