Web scraping is fun and very useful. There is a lot of information on the internet and creating applications is easy with C# and Nuget package HTML agility pack.
Basically Html Agility Pack is an HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT.
Code example
Scrape Title, H1 and all Paragraphs
-First add Html agility pack to your project trough NuGet.
-In codebehing add: using HtmlAgilityPack;
string scrapeData = TextBoxScrapeData.Text;
try
{
var doc1 = new HtmlDocument();
doc1.LoadHtml(scrapeData);
//DOM elements to scrape
string title = "";
string h1 = "";
string h2 = "";
string p = "";
//title
try
{
title =doc1.DocumentNode.SelectSingleNode("//head/title").InnerText;
}
catch (Exception ec)
{}
//h1
try
{
h1 =doc1.DocumentNode.SelectSingleNode("//h1").InnerText;
}
catch (Exception ec)
{}
//h2
try
{
h2 =doc1.DocumentNode.SelectSingleNode("//h2").InnerText;
}
catch (Exception ec)
{ }
//
try
{
foreach (HtmlNode r in doc1.DocumentNode.SelectNodes("//p"))
{
p +="
" + stripHTML(r.InnerHtml) + "
";
}
}
catch (Exception ec)
{}
//Now you can use the scraped values on page
TextBoxPageH1.Text = h1;
TextBoxPageH2.Text = h2;
TextBoxPageContent.Text = p;
}
catch (Exception ex)
{
//error
messageScrape.Text = "Error scraping: " + ex;
}