Python爬蟲中遍歷文檔樹的方法

發布時間：2020-08-07 13:46:16 來源：億速云閱讀：151 作者：小新欄目：編程語言

小編給大家分享一下Python爬蟲中遍歷文檔樹的方法，希望大家閱讀完這篇文章后大所收獲，下面讓我們一起去探討吧！

遍歷文檔樹

1.直接子節點：.contents .children屬性

.content

Tag的.content屬性可以將Tag的子節點以列表的方式輸出

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 輸出方式為列表
print(soup.head.contents)
 
print(soup.head.contents[0])

運行結果

[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>

.children

它返回的不是一個列表，不過我們可以通過遍歷獲取所有的子節點。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 輸出方式為列表生成器對象
print(soup.head.children)
 
# 通過遍歷獲取所有子節點
for child in soup.head.children:
    print(child)

運行結果

<list_iterator object at 0x008FF950>
<title>The Dormouse's story</title>

2.所有子孫節點：.descendants屬性

上面講的.contents和.children屬性僅包含Tag的直接子節點，.descendants屬性可以對所有Tag的子孫節點進行遞歸循環，和children類似，我們也需要通過遍歷的方式獲取其中的內容。

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
# 輸出方式為列表生成器對象
print(soup.head.descendants)
 
# 通過遍歷獲取所有子孫節點
for child in soup.head.descendants:
    print(child)

運行結果

<generator object descendants at 0x00519AB0>
<title>The Dormouse's story</title>
The Dormouse's story

3.節點內容：.string屬性

如果Tag只有一個NavigableString類型子節點，那么這個Tag可以使用.string得到子節點。如果一個Tag僅有一個子節點，那么這個Tab也可以使用.string方法，輸出結果與當前唯一子節點的.string結果相同。

通俗點來講就是：如果一個標簽里面沒有標簽了，那么.string就會返回標簽里面的內容。如果標簽里面只有唯一的一個標簽了，那么.string也會返回里面的內容。例如：

#!/usr/bin/python3
# -*- coding:utf-8 -*-
 
from bs4 import BeautifulSoup
 
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
 
# 創建 Beautiful Soup 對象，指定lxml解析器
soup = BeautifulSoup(html, "lxml")
 
print(soup.head.string)
 
print(soup.head.title.string)

運行結果

The Dormouse's story
The Dormouse's story

看完了這篇文章，相信你對Python爬蟲中遍歷文檔樹的方法有了一定的了解，想了解更多相關知識，歡迎關注億速云行業資訊頻道，感謝各位的閱讀！

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

Python爬蟲中遍歷文檔樹的方法

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

Python爬蟲中遍歷文檔樹的方法

猜你喜歡

最新資訊

相關推薦

相關標簽