您好,登錄后才能下訂單哦!
小編給大家分享一下.NET Core如何實現定時抓取網站文章并發送到郵箱,希望大家閱讀完這篇文章之后都有所收獲,下面讓我們一起去探討吧!
作為一個持續運行的工具,沒有日志記錄怎么行,我準備使用的是NLog來記錄日志,它有個日志歸檔功能非常不錯。在http請求中,由于網絡問題吧可能會出現失敗的情況,這里我使用Polly來進行Retry。使用HtmlAgilityPack來解析網頁,需要對xpath有一定了解。下面是詳細說明:
組件名 | 用途 | github |
---|---|---|
NLog | 記錄日志 | https://github.com/NLog/NLog |
Polly | 當http請求失敗,進行重試 | https://github.com/App-vNext/Polly |
HtmlAgilityPack | 網頁解析 | https://github.com/zzzprojects/html-agility-pack |
MailKit | 發送郵件 | https://github.com/jstedfast/MailKit |
有不了解的組件,可以通過訪問github獲取資料。
參考文章
https://www.jb51.net/article/112595.htm
獲取&解析博客園首頁數據
我是用的是HttpWebRequest來進行http請求,下面分享一下我簡單封裝的類庫:
using System; using System.IO; using System.Net; using System.Text; namespace CnBlogSubscribeTool { /// <summary> /// Simple Http Request Class /// .NET Framework >= 4.0 /// Author:stulzq /// CreatedTime:2017-12-12 15:54:47 /// </summary> public class HttpUtil { static HttpUtil() { //Set connection limit ,Default limit is 2 ServicePointManager.DefaultConnectionLimit = 1024; } /// <summary> /// Default Timeout 20s /// </summary> public static int DefaultTimeout = 20000; /// <summary> /// Is Auto Redirect /// </summary> public static bool DefalutAllowAutoRedirect = true; /// <summary> /// Default Encoding /// </summary> public static Encoding DefaultEncoding = Encoding.UTF8; /// <summary> /// Default UserAgent /// </summary> public static string DefaultUserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" ; /// <summary> /// Default Referer /// </summary> public static string DefaultReferer = ""; /// <summary> /// httpget request /// </summary> /// <param name="url">Internet Address</param> /// <returns>string</returns> public static string GetString(string url) { var stream = GetStream(url); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// httppost request /// </summary> /// <param name="url">Internet Address</param> /// <param name="postData">Post request data</param> /// <returns>string</returns> public static string PostString(string url, string postData) { var stream = PostStream(url, postData); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// Create Response /// </summary> /// <param name="url"></param> /// <param name="post">Is post Request</param> /// <param name="postData">Post request data</param> /// <returns></returns> public static WebResponse CreateResponse(string url, bool post, string postData = "") { var httpWebRequest = WebRequest.CreateHttp(url); httpWebRequest.Timeout = DefaultTimeout; httpWebRequest.AllowAutoRedirect = DefalutAllowAutoRedirect; httpWebRequest.UserAgent = DefaultUserAgent; httpWebRequest.Referer = DefaultReferer; if (post) { var data = DefaultEncoding.GetBytes(postData); httpWebRequest.Method = "POST"; httpWebRequest.ContentType = "application/x-www-form-urlencoded;charset=utf-8"; httpWebRequest.ContentLength = data.Length; using (var stream = httpWebRequest.GetRequestStream()) { stream.Write(data, 0, data.Length); } } try { var response = httpWebRequest.GetResponse(); return response; } catch (Exception e) { throw new Exception(string.Format("Request error,url:{0},IsPost:{1},Data:{2},Message:{3}", url, post, postData, e.Message), e); } } /// <summary> /// http get request /// </summary> /// <param name="url"></param> /// <returns>Response Stream</returns> public static Stream GetStream(string url) { var stream = CreateResponse(url, false).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } /// <summary> /// http post request /// </summary> /// <param name="url"></param> /// <param name="postData">post data</param> /// <returns>Response Stream</returns> public static Stream PostStream(string url, string postData) { var stream = CreateResponse(url, true, postData).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } } }
獲取首頁數據
string res = HttpUtil.GetString(https://www.cnblogs.com);
解析數據
我們成功獲取到了html,但是怎么提取我們需要的信息(文章標題、地址、摘要、作者、發布時間)呢。這里就亮出了我們的利劍HtmlAgilityPack,他是一個可以根據xpath來解析網頁的組件。
載入我們前面獲取的html:
HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html);
從上圖中,我們可以看出,每條文章所有信息都在一個class為post_item的div里,我們先獲取所有的class=post_item的div
//獲取所有文章數據項 var itemBodys = doc.DocumentNode.SelectNodes("//div[@class='post_item_body']");
我們繼續分析,可以看出文章的標題在class=post_item_body的div下面的h4標簽下的a標簽,摘要信息在class=post_item_summary的p標簽里面,發布時間和作者在class=post_item_foot的div里,分析完畢,我們可以取出我們想要的數據了:
foreach (var itemBody in itemBodys) { //標題元素 var titleElem = itemBody.SelectSingleNode("h4/a"); //獲取標題 var title = titleElem?.InnerText; //獲取url var url = titleElem?.Attributes["href"]?.Value; //摘要元素 var summaryElem = itemBody.SelectSingleNode("p[@class='post_item_summary']"); //獲取摘要 var summary = summaryElem?.InnerText.Replace("\r\n", "").Trim(); //數據項底部元素 var footElem = itemBody.SelectSingleNode("div[@class='post_item_foot']"); //獲取作者 var author = footElem?.SelectSingleNode("a")?.InnerText; //獲取文章發布時間 var publishTime = Regex.Match(footElem?.InnerText, "\\d+-\\d+-\\d+ \\d+:\\d+").Value; Console.WriteLine($"標題:{title}"); Console.WriteLine($"網址:{url}"); Console.WriteLine($"摘要:{summary}"); Console.WriteLine($"作者:{author}"); Console.WriteLine($"發布時間:{publishTime}"); Console.WriteLine("--------------華麗的分割線---------------"); }
運行一下:
我們成功的獲取了我們想要的信息。現在我們定義一個Blog對象將它們裝起來。
public class Blog { /// <summary> /// 標題 /// </summary> public string Title { get; set; } /// <summary> /// 博文url /// </summary> public string Url { get; set; } /// <summary> /// 摘要 /// </summary> public string Summary { get; set; } /// <summary> /// 作者 /// </summary> public string Author { get; set; } /// <summary> /// 發布時間 /// </summary> public DateTime PublishTime { get; set; } }
http請求失敗重試
我們使用Polly在我們的http請求失敗時進行重試,設置為重試3次。
//初始化重試器 _retryTwoTimesPolicy = Policy .Handle<Exception>() .Retry(3, (ex, count) => { _logger.Error("Excuted Failed! Retry {0}", count); _logger.Error("Exeption from {0}", ex.GetType().Name); });
測試一下:
可以看到當遇到exception是Polly會幫我們重試三次,如果三次重試都失敗了那么會放棄。
發送郵件
使用MailKit來進行郵件發送,它支持IMAP,POP3和SMTP協議,并且是跨平臺的十分優秀。下面是根據前面園友的分享自己封裝的一個類庫:
using System.Collections.Generic; using CnBlogSubscribeTool.Config; using MailKit.Net.Smtp; using MimeKit; namespace CnBlogSubscribeTool { /// <summary> /// send email /// </summary> public class MailUtil { private static bool SendMail(MimeMessage mailMessage,MailConfig config) { try { var smtpClient = new SmtpClient(); smtpClient.Timeout = 10 * 1000; //設置超時時間 smtpClient.Connect(config.Host, config.Port, MailKit.Security.SecureSocketOptions.None);//連接到遠程smtp服務器 smtpClient.Authenticate(config.Address, config.Password); smtpClient.Send(mailMessage);//發送郵件 smtpClient.Disconnect(true); return true; } catch { throw; } } /// <summary> ///發送郵件 /// </summary> /// <param name="config">配置</param> /// <param name="receives">接收人</param> /// <param name="sender">發送人</param> /// <param name="subject">標題</param> /// <param name="body">內容</param> /// <param name="attachments">附件</param> /// <param name="fileName">附件名</param> /// <returns></returns> public static bool SendMail(MailConfig config,List<string> receives, string sender, string subject, string body, byte[] attachments = null,string fileName="") { var fromMailAddress = new MailboxAddress(config.Name, config.Address); var mailMessage = new MimeMessage(); mailMessage.From.Add(fromMailAddress); foreach (var add in receives) { var toMailAddress = new MailboxAddress(add); mailMessage.To.Add(toMailAddress); } if (!string.IsNullOrEmpty(sender)) { var replyTo = new MailboxAddress(config.Name, sender); mailMessage.ReplyTo.Add(replyTo); } var bodyBuilder = new BodyBuilder() { HtmlBody = body }; //附件 if (attachments != null) { if (string.IsNullOrEmpty(fileName)) { fileName = "未命名文件.txt"; } var attachment = bodyBuilder.Attachments.Add(fileName, attachments); //解決中文文件名亂碼 var charset = "GB18030"; attachment.ContentType.Parameters.Clear(); attachment.ContentDisposition.Parameters.Clear(); attachment.ContentType.Parameters.Add(charset, "name", fileName); attachment.ContentDisposition.Parameters.Add(charset, "filename", fileName); //解決文件名不能超過41字符 foreach (var param in attachment.ContentDisposition.Parameters) param.EncodingMethod = ParameterEncodingMethod.Rfc2047; foreach (var param in attachment.ContentType.Parameters) param.EncodingMethod = ParameterEncodingMethod.Rfc2047; } mailMessage.Body = bodyBuilder.ToMessageBody(); mailMessage.Subject = subject; return SendMail(mailMessage, config); } } }
測試一下:
說明
關于抓取數據和發送郵件的調度,程序異常退出的數據處理等等,在此我就不詳細說明了,有興趣的看源碼(文末有github地址)
抓取數據是增量更新的。不用RSS訂閱的原因是RSS更新比較慢。
完整的程序運行截圖:
每發送一次郵件,程序就會將記錄時間調整到今天的9點,然后每次抓取數據之后就會判斷當前時間減去記錄時間是否大于等于24小時,如果符合就發送郵件并且更新記錄時間。
收到的郵件截圖:
截圖中的郵件標題為13日但是郵件內容為14日,是因為我為了演示效果,將今天(14日)的數據copy到了13日的數據里面,不要被誤導了。
還提供一個附件便于收集整理:
看完了這篇文章,相信你對“.NET Core如何實現定時抓取網站文章并發送到郵箱”有了一定的了解,如果想了解更多相關知識,歡迎關注億速云行業資訊頻道,感謝各位的閱讀!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。