Friday, March 30, 2012

Parsing HTML into an XElement

I have been working with TinyMCE in my ASP.NET project. Mostly I really like it, but I ran into a problem today. I am doing some work where I parse the HTML on the server side and insert dynamic content before passing it up to TinyMCE. This was working until TinyMCE started inserting   tags into my code. It turns out that HTML is not, strictly speaking, compliant with XML. I kept getting the exception: System.Xml.XmlException : Reference to undeclared entity 'nbsp'. Line 6, position 393.

So, it turns out that this will work if you specify in the HTML tag the proper document type so it knows to treat these entities a certain way. Here particularly are some references I went through on my way to a solution:

Ultimately I did not want to manually insert the doctype into the header of my HTML, so I had to find a way to force .NET to load those DTDs for me. The solution ultimately came from this post on StackOverflow. One of the comments links to a blog post which no longer exists which the commenter claimed explained how to fix it. Using the Wayback Machine I found the archived post from 2005, which has some decent explanations as to what's going on, how to fix it and why it works. Unfortunately, it uses some obsolete concepts that don't really apply in .NET 4.0, so I made my own version of the code which I am sharing here for posterity:

        private XElement ParseXhtml(string html)
{
// Set up the proper XHTML Parsing context.
XmlNameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
XmlParserContext context = new XmlParserContext(null, nsmgr, null, XmlSpace.None);
context.DocTypeName = "html";
context.PublicId = "-//W3C//DTD XHTML 1.0 Strict//EN";
context.SystemId = "xhtml1-strict.dtd";
XmlParserContext xhtmlContext = context;

// Create a string reader for the HTML.
StringReader stringReader = new StringReader(html);

// Create XmlReaderSettings that will properly process the HTML using the DTD we specified
// in the context above.
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.ValidationType = ValidationType.DTD;
settings.XmlResolver = new XmlPreloadedResolver(XmlKnownDtds.All);

// Create the XmlReader with the appropriate settings and context.
XmlReader reader = XmlReader.Create(stringReader, settings, xhtmlContext);

// Load the xml into an XElement.
return XElement.Load(reader);
}
I know the best blog posts are the ones that do a good job explaining all the parts and why they work, but to my shame, all I really know is that we're forcing the XmlReader to assume a certain Xml DOCTYPE and that it works.

Tuesday, March 20, 2012

Troubleshooting the insidious InvalidInput error

I've been plagued since the day I started working on Azure with occasional errors of type "InvalidInput" (such as this one). The detail of this errors says "One of the request inputs is not valid" and gives no other information. Obviously this means one of the properties on the entity is invalid for Azure Table Storage, but it does not tell you which one or why. A quick google will tell you lots of common mistakes that can cause this, but if those don't help you're out of luck.

I have come up with a function that will at the very least tell you which property is causing the error. It should be a method of a TableServiceContext subclass, although it could easily be modified to work some other way. The return value is a list of properties that failed. Essentially it works by trying to update the entity one property at a time. For each property it removes all properties from the request XML except the one being tested. If the request fails, it assumes that the current property (being the only one we sent) was one of the properties that caused the failure. Start by pasting the following code into your TableServiceContext, then calling TroubleshootInvalidInputError(entity) with the entity that is failing.

private string mTroubleshootingCurrentProperty = null;

public List<string> TroubleshootInvalidInputError(TableServiceEntity e)
{
WritingEntity += new EventHandler<ReadingWritingEntityEventArgs>(TroubleshootInvalidInputErrorBeforeWrite);
PropertyInfo[] properties = e.GetType().GetProperties();
List<string> failedProperties = new List<string>();
foreach (var property in properties)
{
mTroubleshootingCurrentProperty = property.Name;
try
{
UpdateObject(e);
SaveChanges();
}
catch (Exception ex)
{
failedProperties.Add(property.Name);
}
}

mTroubleshootingCurrentProperty = null;
return failedProperties;
}

void TroubleshootInvalidInputErrorBeforeWrite(object sender, ReadingWritingEntityEventArgs e)
{
if (string.IsNullOrEmpty(mTroubleshootingCurrentProperty))
return;

// The XML of the properties already being sent to Azure
XElement properties = e.Data.Descendants(Meta + "properties").First();
XName keepName = Data + mTroubleshootingCurrentProperty;
IEnumerable<XElement> propElements = properties.Elements();

XElement keepNode = null;

foreach (var propElement in properties.Elements())
{
if (propElement.Name == keepName)
keepNode = propElement;
}

if (keepNode != null)
{
properties.RemoveNodes();
properties.Add(keepNode);
}
}

It does not tell you why it failed, but knowing which property is half the battle. In my case, the problem was that I was trying to upload an entity with a property of a custom class that was Serializable. Normally, I'd been used to the Azure serializer ignoring any public properties that were not supported (such as List, custom classes, etc). However, I added a property based on a custom class I made which was marked as Serializable, so Azure attempted to serialize it and the data service triggered an error.