If you ever had to parse html, you know that it can be rather complicated to do using regular expressions, and the code can be very hard to read. .NET only include classes for parsing xml and as most pages are not valid xhtml you won't get very far using these classes. Fortunatly there is the Html Agility Pack for parsing html, this pack is really easy to use, the code is easy to read, and the html does not even have to be valid for it to work. The Html Agility Pack is already included in Sitecore, so you can start using it right away.
The Html Agility Pack (HAP) can be found at codeplex on this url:
http://htmlagilitypack.codeplex.com/
Their description is:
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Using the Html Agility Pack
So let us take a look at how we can use HAP. First you need to include a reference to HtmlAgilityPack.dll in your project. The dll is located in the bin folder.
Then we create a new HtmlDocument like this.
HtmlDocument doc = new HtmlDocument();
So now we have an empty HtmlDocument. Lets get some html into it.
The HAP provides several methods for loading in html so I will only mention three.
If we already have the html in a string we just use the LoadHtml(string html) method.
If the html is located as a file on our server we can use the Load(string path) method.
Last we can load the html through a stream like this
WebRequest webRequest = HttpWebRequest.Create("http://learnsitecore.cmsuniverse.net");
Stream stream = webRequest.GetResponse().GetResponseStream();
doc.Load(stream);
stream.Close();
Parsing the html
Now that the document has been loaded we can start working with the html.
We might want to make a table of content from the html. We could do that by finding all the h2's in the html. This is done by the following code.
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2"))
{
Response.Write(node.InnerText);
}
We could also make our own little crawler by finding the content of the href attribute on all the links on the page.
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];
Response.Write(att.Value + "<br/>");
}
Changing the html
The HAP also makes it possible to make changes to a html document.
Maybe you need to insert a css class on all link elements in a rich text field when saving an item. This can be done like this.
HtmlNodeCollection aTags = doc.DocumentNode.SelectNodes("//a");
if (aTags != null)
{
foreach (HtmlNode aTag in aTags)
{
aTag.SetAttributeValue("class", "ClassName");
}
}
and then save doc.DocumentNode.OuterHtml back to the rich text field.
Conclusion
There is a lot more functionallity in the HAP than I have described here, but I hope you have seen how easy it is to work with and use it the next time you have to manipulate html.