Python自然語言處理 NLTK 庫用法入門教程【經典】

發布時間：2020-10-10 13:25:35 來源：腳本之家閱讀：245 作者：hzp666 欄目：開發技術

本文實例講述了Python自然語言處理 NLTK 庫用法。分享給大家供大家參考，具體如下：

在這篇文章中，我們將基于 Python 討論自然語言處理（NLP）。本教程將會使用 Python NLTK 庫。NLTK 是一個當下流行的，用于自然語言處理的 Python 庫。

那么 NLP 到底是什么？學習 NLP 能帶來什么好處？

簡單的說，自然語言處理（ NLP ）就是開發能夠理解人類語言的應用程序和服務。

我們生活中經常會接觸的自然語言處理的應用，包括語音識別，語音翻譯，理解句意，理解特定詞語的同義詞，以及寫出語法正確，句意通暢的句子和段落。

NLP的作用

正如大家所知，每天博客，社交網站和網頁會產生數億字節的海量數據。

有很多公司熱衷收集所有這些數據，以便更好地了解他們的用戶和用戶對產品的熱情，并對他們的產品或者服務進行合適的調整。

這些海量數據可以揭示很多現象，打個比方說，巴西人對產品 A 感到滿意，而美國人卻對產品 B 更感興趣。通過NLP，這類的信息可以即時獲得（即實時結果）。例如，搜索引擎正是一種 NLP，可以在正確的時間給合適的人提供適當的結果。

但是搜索引擎并不是自然語言處理（NLP）的唯一應用。還有更好更加精彩的應用。

NLP的應用

以下都是自然語言處理（NLP）的一些成功應用：

搜索引擎，比如谷歌，雅虎等等。谷歌等搜索引擎會通過NLP了解到你是一個科技發燒友，所以它會返回科技相關的結果。
社交網站信息流，比如 Facebook 的信息流。新聞饋送算法通過自然語言處理了解到你的興趣，并向你展示相關的廣告以及消息，而不是一些無關的信息。
語音助手，諸如蘋果 Siri。
垃圾郵件程序，比如 Google 的垃圾郵件過濾程序，這不僅僅是通常會用到的普通的垃圾郵件過濾，現在，垃圾郵件過濾器會對電子郵件的內容進行分析，看看該郵件是否是垃圾郵件。

NLP庫

現在有許多開源的自然語言處理（NLP）庫。比如：

Natural language toolkit (NLTK)
Apache OpenNLP
Stanford NLP suite
Gate NLP library

自然語言工具包（NLTK）是最受歡迎的自然語言處理（NLP）庫。它是用 Python 語言編寫的，背后有強大的社區支持。

NLTK 也很容易入門，實際上，它將是你用到的最簡單的自然語言處理（NLP）庫。

在這個 NLP 教程中，我們將使用 Python NLTK 庫。在開始安裝 NLTK 之前，我假設你知道一些 Python入門知識。

安裝 NLTK

如果你使用的是 Windows , Linux 或 Mac，你可以使用PIP 安裝NLTK： # pip install nltk。

在本文撰寫之時，你可以在 Python 2.7 , 3.4 和 3.5 上都可以使用NLTK。或者可以通過獲取tar 進行源碼安裝。

要檢查 NLTK 是否正確地安裝完成，可以打開你的Python終端并輸入以下內容：Import nltk。如果一切順利，這意味著你已經成功安裝了 NLTK 庫。

一旦你安裝了 NLTK，你可以運行下面的代碼來安裝 NLTK 包：

import nltk
nltk.download()

這將打開 NLTK 下載器來選擇需要安裝的軟件包。

你可以選擇安裝所有的軟件包，因為它們的容量不大，所以沒有什么問題。現在，我們開始學習吧！

使用原生 Python 來對文本進行分詞

首先，我們將抓取一些網頁內容。然后來分析網頁文本，看看爬下來的網頁的主題是關于什么。我們將使用 urllib模塊來抓取網頁：

import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
print (html)

從打印輸出中可以看到，結果中包含許多需要清理的HTML標記。我們可以用這個 BeautifulSoup 庫來對抓取的文本進行處理：

from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

現在，我們能將抓取的網頁轉換為干凈的文本。這很棒，不是么？

最后，讓我們通過以下方法將文本分詞：

from bs4 import BeautifulSoup
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = [t for t in text.split()]
print (tokens)

詞頻統計

現在的文本相比之前的 html 文本好多了。我們再使用 Python NLTK 來計算每個詞的出現頻率。NLTK 中的FreqDist( ) 函數可以實現詞頻統計的功能：

from bs4 import BeautifulSoup
import urllib.request
import nltk
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = [t for t in text.split()]
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val))

如果你查看輸出結果，會發現最常用的詞語是PHP。

你可以用繪圖函數為這些詞頻繪制一個圖形： freq.plot(20, cumulative=False)。

從圖中，你可以肯定這篇文章正在談論 PHP。這很棒！有一些詞，如"the," "of," "a," "an," 等等。這些詞是停止詞。一般來說，停止詞語應該被刪除，以防止它們影響我們的結果。

使用 NLTK 刪除停止詞

NLTK 具有大多數語言的停止詞表。要獲得英文停止詞，你可以使用以下代碼：

from nltk.corpus import stopwords
stopwords.words('english')

現在，讓我們修改我們的代碼，并在繪制圖形之前清理標記。首先，我們復制一個列表。然后，我們通過對列表中的標記進行遍歷并刪除其中的停止詞：

clean_tokens = tokens[:]
sr = stopwords.words('english')
for token in tokens:
  if token in stopwords.words('english'):
    clean_tokens.remove(token)

你可以在這里查看Python List 函數，了解如何處理列表。

最終的代碼應該是這樣的：

from bs4 import BeautifulSoup
import urllib.request
import nltk
from nltk.corpus import stopwords
response = urllib.request.urlopen('http://php.net/')
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
tokens = [t for t in text.split()]
clean_tokens = tokens[:]
sr = stopwords.words('english')
for token in tokens:
  if token in stopwords.words('english'):
    clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val))

如果你現在檢查圖表，會感覺比之前那張圖標更加清晰，因為沒有了停止詞的干擾。

freq.plot(20,cumulative=False)

使用 NLTK 對文本分詞

我們剛剛了解了如何使用 split( ) 函數將文本分割為標記。現在，我們將看到如何使用 NLTK 對文本進行標記化。對文本進行標記化是很重要的，因為文本無法在沒有進行標記化的情況下被處理。標記化意味著將較大的部分分隔成更小的單元。

你可以將段落分割為句子，并根據你的需要將句子分割為單詞。NLTK 具有內置的句子標記器和詞語標記器。

假設我們有如下的示例文本：

Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude.

為了將這個文本標記化為句子，我們可以使用句子標記器：

from nltk.tokenize import sent_tokenize
mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))

輸出如下：

['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

你可能會說，這是一件容易的事情。我不需要使用 NLTK 標記器，并且我可以使用正則表達式來分割句子，因為每個句子前后都有標點符號或者空格。

那么，看看下面的文字：

Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.

呃！Mr. 是一個詞,雖然帶有一個符號。讓我們來試試使用 NLTK 進行分詞：

from nltk.tokenize import sent_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(sent_tokenize(mytext))

輸出如下所示：

['Hello Mr. Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

Great！結果棒極了。然后我們嘗試使用詞語標記器來看看它是如何工作的：

from nltk.tokenize import word_tokenize
mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."
print(word_tokenize(mytext))

輸出如下：

['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']

正如所料，Mr. 是一個詞，也確實被 NLTK 當做一個詞。NLTK使用 nltk.tokenize.punkt module 中的 PunktSentenceTokenizer 進行文本分詞。這個標記器經過了良好的訓練，可以對多種語言進行分詞。

標記非英語語言文本

為了標記其他語言，可以像這樣指定語言：

from nltk.tokenize import sent_tokenize
mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."
print(sent_tokenize(mytext,"french"))

結果將是這樣的：

['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]

NLTk 對其他非英語語言的支持也非常好！

從 WordNet 獲取同義詞

如果你還記得我們使用 nltk.download( ) 安裝 NLTK 的擴展包時。其中一個擴展包名為 WordNet。WordNet 是為自然語言處理構建的數據庫。它包括部分詞語的一個同義詞組和一個簡短的定義。

通過 NLTK 你可以得到給定詞的定義和例句：

from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
print(syn[0].definition())
print(syn[0].examples())

結果是：

a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']

WordNet 包含了很多詞的定義：

from nltk.corpus import wordnet
syn = wordnet.synsets("NLP")
print(syn[0].definition())
syn = wordnet.synsets("Python")
print(syn[0].definition())

結果是：

the branch of information science that deals with natural language information
large Old World boas

您可以使用 WordNet 來獲得同義詞：

from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets('Computer'):
  for lemma in syn.lemmas():
    synonyms.append(lemma.name())
print(synonyms)

輸出是：

['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']

Cool！

從 WordNet 獲取反義詞

你可以用同樣的方法得到單詞的反義詞。你唯一要做的是在將 lemmas 的結果加入數組之前，檢查結果是否確實是一個正確的反義詞。

from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets("small"):
  for l in syn.lemmas():
    if l.antonyms():
      antonyms.append(l.antonyms()[0].name())
print(antonyms)

輸出是：

['large', 'big', 'big']

這就是 NLTK 在自然語言處理中的力量。

NLTK詞干提取

單詞詞干提取就是從單詞中去除詞綴并返回詞根。（比方說 working 的詞干是 work。）搜索引擎在索引頁面的時候使用這種技術，所以很多人通過同一個單詞的不同形式進行搜索，返回的都是相同的，有關這個詞干的頁面。

詞干提取的算法有很多，但最常用的算法是 Porter 提取算法。NLTK 有一個 PorterStemmer 類，使用的就是 Porter 提取算法。

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('working'))

結果是：

work

結果很清楚。

還有其他一些提取算法，如 Lancaster 提取算法。這個算法的輸出同 Porter 算法的結果在幾個單詞上不同。你可以嘗試他們兩個算法來查看有哪些不同結果。

提取非英語單詞詞干

SnowballStemmer 類，除了英語外，還可以適用于其他 13 種語言。支持的語言如下：

from nltk.stem import SnowballStemmer
print(SnowballStemmer.languages)
'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish'

你可以使用 SnowballStemmer 類的 stem()函數來提取非英語單詞，如下所示：

from nltk.stem import SnowballStemmer
french_stemmer = SnowballStemmer('french')
print(french_stemmer.stem("French word"))

來自法國的朋友歡迎在評論區 poll 出你們測試的結果！

使用 WordNet 引入詞匯

詞匯的詞匯化與提取詞干類似，但不同之處在于詞匯化的結果是一個真正的詞匯。與詞干提取不同，當你試圖提取一些詞干時，有可能會導致這樣的情況：

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('increases'))

結果是：

increas

現在，如果我們試圖用NLTK WordNet來還原同一個詞，結果會是正確的：

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('increases'))

結果是：

increase

結果可能是同義詞或具有相同含義的不同詞語。有時，如果你試圖還原一個詞，比如 playing,還原的結果還是 playing。這是因為默認還原的結果是名詞，如果你想得到動詞，可以通過以下的方式指定。

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))

結果是：

play

實際上，這是一個非常好的文本壓縮水平。最終壓縮到原文本的 50％到 60％左右。結果可能是動詞，名詞，形容詞或副詞：

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('playing', pos="v"))
print(lemmatizer.lemmatize('playing', pos="n"))
print(lemmatizer.lemmatize('playing', pos="a"))
print(lemmatizer.lemmatize('playing', pos="r"))

結果是：

play
playing
playing
playing

詞干化和詞化差異

好吧，讓我們分別嘗試一些單詞的詞干提取和詞形還原：

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(stemmer.stem('stones'))
print(stemmer.stem('speaking'))
print(stemmer.stem('bedroom'))
print(stemmer.stem('jokes'))
print(stemmer.stem('lisa'))
print(stemmer.stem('purple'))
print('----------------------')
print(lemmatizer.lemmatize('stones'))
print(lemmatizer.lemmatize('speaking'))
print(lemmatizer.lemmatize('bedroom'))
print(lemmatizer.lemmatize('jokes'))
print(lemmatizer.lemmatize('lisa'))
print(lemmatizer.lemmatize('purple'))

結果是：

stone
speak
bedroom
joke
lisa
purpl
----------------------
stone
speaking
bedroom
joke
lisa
purple

詞干提取的方法可以在不知道語境的情況下對詞匯使用，這就是為什么它相較詞形還原方法速度更快但準確率更低。

在我看來，詞形還原比提取詞干的方法更好。詞形還原，如果實在無法返回這個詞的變形，也會返回另一個真正的單詞;這個單詞可能是一個同義詞，但不管怎樣這是一個真正的單詞。當有時候，你不關心準確度，需要的只是速度。在這種情況下，詞干提取的方法更好。

我們在本 NLP 教程中討論的所有步驟都涉及到文本預處理。在以后的文章中，我們將討論使用Python NLTK進行文本分析。

更多關于Python相關內容感興趣的讀者可查看本站專題：《Python數學運算技巧總結》、《Python數據結構與算法教程》、《Python函數使用技巧總結》、《Python字符串操作技巧匯總》、《Python入門與進階經典教程》及《Python文件與目錄操作技巧匯總》

希望本文所述對大家Python程序設計有所幫助。

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

Python自然語言處理 NLTK 庫用法入門教程【經典】

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

Python自然語言處理 NLTK 庫用法入門教程【經典】

猜你喜歡

最新資訊

相關推薦

相關標簽