Level: 200
 

Working programmatically with html in Sitecore

If you ever had to parse html, you know that it can be rather complicated to do using regular expressions, and the code can be very hard to read. .NET only include classes for parsing xml and as most pages are not valid xhtml you won't get very far using these classes. Fortunatly there is the Html Agility Pack for parsing html, this pack is really easy to use, the code is easy to read, and the html does not even have to be valid for it to work. The Html Agility Pack is already included in Sitecore, so you can start using it right away.

Written by: Jimmi Lyhne Andersen
Thu, Aug 5 2010

The Html Agility Pack (HAP) can be found at codeplex on this url:
http://htmlagilitypack.codeplex.com/

 

Their description is:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Using the Html Agility Pack


So let us take a look at how we can use HAP. First you need to include a reference to HtmlAgilityPack.dll in your project. The dll is located in the bin folder.

Then we create a new HtmlDocument like this.

 

HtmlDocument doc = new HtmlDocument();

So now we have an empty HtmlDocument. Lets get some html into it.

The HAP provides several methods for loading in html so I will only mention three.


If we already have the html in a string we just use the LoadHtml(string html) method.


If the html is located as a file on our server we can use the Load(string path) method.


Last we can load the html through a stream like this

WebRequest webRequest = HttpWebRequest.Create("http://learnsitecore.cmsuniverse.net");
Stream stream = webRequest.GetResponse().GetResponseStream();
doc.Load(stream);
stream.Close();

Parsing the html

 

Now that the document has been loaded we can start working with the html.
We might want to make a table of content from the html. We could do that by finding all the h2's in the html. This is done by the following code.

 

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2"))
{
 Response.Write(node.InnerText);
}

 We could also make our own little crawler by finding the content of the href attribute on all the links on the page.

 

 

 
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
 HtmlAttribute att = link.Attributes["href"];
 Response.Write(att.Value + "<br/>");
}

 

Changing the html

 

The HAP also makes it possible to make changes to a html document.
Maybe you need to insert a css class on all link elements in a rich text field when saving an item. This can be done like this.

 
HtmlNodeCollection aTags = doc.DocumentNode.SelectNodes("//a");
if (aTags != null)
{
 foreach (HtmlNode aTag in aTags)
 {
  aTag.SetAttributeValue("class", "ClassName");
 }
}

 

and then save doc.DocumentNode.OuterHtml back to the rich text field.

 

Conclusion

 

There is a lot more functionallity in the HAP than I have described here, but I hope you have seen how easy it is to work with and use it the next time you have to manipulate html.

 

Please rate this article


4 rates / 3,5 avg.

  • Picture of Jimmi Lyhne Andersen

    About the author:

    Jimmi Lyhne Andersen

    Jimmi Lyhne Andersen is the co-founder of LearnSitecore, and is a partner at Inmento Solutions.
    Jimmi has been lead developer and architect on several large public websites with 200+ editors, and is therefore focused on usability and accessibility.
    Jimmi is a certified Sitecore developer and a Microsoft Certified Professional Developer (MCPD).
    He has a bachelors degree in Computer Science and in Mathematics from the University of Copenhagen.

     

2 responses to "Working programmatically with html in Sitecore"

At YouSee we are using HAP when rendering items such as article-items, news-items eg. That way we can parse html and manipulate it as needed primarily through xslt-extensions. We are also using HAP in Sitecore ECM to parse html-emails so we can add custom tracking to all links such as GA or Omniture. HAP is not Sitecore specific, but it is a nice little tool :)
Posted: Friday, April 01, 2011 7:33 PM
Hej Morten,

You are absolutely right! It is not a Sitecore specific feature, but shipped with Sitecore.

If you have extensive experience with it, you should consider writing an article for LearnSitecore about it. We would appreciate that.

Cheers
Jens Mikkelsen
Posted: Saturday, April 02, 2011 5:11 PM

Leave a reply


Notify me of follow-up comments via email.
 
 
#nbsp;