雖然寫了一兩年的python,但還是很新手,有鑑於有些太過於高深的程式碼往往讓新手如我難以駕馭,
便決定分享一下,可能比較不高深的寫法,也許可以讓其他新手參考,也對自己的學習做一個紀錄。
-------------------------------程式碼如下--------------------------------------
import nltk
from nltk.stem.porter import *
from nltk.corpus import stopwords
# 第一次跑stopwords可能需要執行這一行 download
# nltk.download('stopwords')
stemmer = PorterStemmer()
# 小寫 + tokenization
text = "And Yugoslav authorities are planning the arrest of eleven coal miners and two opposition politicians on suspicion of sabotage, that's in connection with strike action against President Slobodan Milosevic. You are listening to BBC news for The World."
lower_text = text.lower()
words = lower_text.split(' ')
# porter's stemmer
por_stem_result = [stemmer.stem(word) for word in words]
result = ' '.join(por_stem_result)
# 過濾掉 stop words
words = result.split(' ')
filtered_words = [word for word in words if word not in stopwords.words('english')]
# 存成 txt 檔
text_file = open("Output.txt", "w")
for item in filtered_words:
if "." or "," in item:
item = item.replace(".", "").replace(",","")
text_file.write(item + "\n")
text_file.close()