Friday, March 30, 2012

Parsing HTML into an XElement

I have been working with TinyMCE in my ASP.NET project. Mostly I really like it, but I ran into a problem today. I am doing some work where I parse the HTML on the server side and insert dynamic content before passing it up to TinyMCE. This was working until TinyMCE started inserting   tags into my code. It turns out that HTML is not, strictly speaking, compliant with XML. I kept getting the exception: System.Xml.XmlException : Reference to undeclared entity 'nbsp'. Line 6, position 393.

So, it turns out that this will work if you specify in the HTML tag the proper document type so it knows to treat these entities a certain way. Here particularly are some references I went through on my way to a solution:

Ultimately I did not want to manually insert the doctype into the header of my HTML, so I had to find a way to force .NET to load those DTDs for me. The solution ultimately came from this post on StackOverflow. One of the comments links to a blog post which no longer exists which the commenter claimed explained how to fix it. Using the Wayback Machine I found the archived post from 2005, which has some decent explanations as to what's going on, how to fix it and why it works. Unfortunately, it uses some obsolete concepts that don't really apply in .NET 4.0, so I made my own version of the code which I am sharing here for posterity:

        private XElement ParseXhtml(string html)
{
// Set up the proper XHTML Parsing context.
XmlNameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
XmlParserContext context = new XmlParserContext(null, nsmgr, null, XmlSpace.None);
context.DocTypeName = "html";
context.PublicId = "-//W3C//DTD XHTML 1.0 Strict//EN";
context.SystemId = "xhtml1-strict.dtd";
XmlParserContext xhtmlContext = context;

// Create a string reader for the HTML.
StringReader stringReader = new StringReader(html);

// Create XmlReaderSettings that will properly process the HTML using the DTD we specified
// in the context above.
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.ValidationType = ValidationType.DTD;
settings.XmlResolver = new XmlPreloadedResolver(XmlKnownDtds.All);

// Create the XmlReader with the appropriate settings and context.
XmlReader reader = XmlReader.Create(stringReader, settings, xhtmlContext);

// Load the xml into an XElement.
return XElement.Load(reader);
}
I know the best blog posts are the ones that do a good job explaining all the parts and why they work, but to my shame, all I really know is that we're forcing the XmlReader to assume a certain Xml DOCTYPE and that it works.

No comments:

Post a Comment