Friday, March 30, 2012

Parsing HTML into an XElement

I have been working with TinyMCE in my ASP.NET project. Mostly I really like it, but I ran into a problem today. I am doing some work where I parse the HTML on the server side and insert dynamic content before passing it up to TinyMCE. This was working until TinyMCE started inserting   tags into my code. It turns out that HTML is not, strictly speaking, compliant with XML. I kept getting the exception: System.Xml.XmlException : Reference to undeclared entity 'nbsp'. Line 6, position 393.

So, it turns out that this will work if you specify in the HTML tag the proper document type so it knows to treat these entities a certain way. Here particularly are some references I went through on my way to a solution:

Ultimately I did not want to manually insert the doctype into the header of my HTML, so I had to find a way to force .NET to load those DTDs for me. The solution ultimately came from this post on StackOverflow. One of the comments links to a blog post which no longer exists which the commenter claimed explained how to fix it. Using the Wayback Machine I found the archived post from 2005, which has some decent explanations as to what's going on, how to fix it and why it works. Unfortunately, it uses some obsolete concepts that don't really apply in .NET 4.0, so I made my own version of the code which I am sharing here for posterity:

        private XElement ParseXhtml(string html)
{
// Set up the proper XHTML Parsing context.
XmlNameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
XmlParserContext context = new XmlParserContext(null, nsmgr, null, XmlSpace.None);
context.DocTypeName = "html";
context.PublicId = "-//W3C//DTD XHTML 1.0 Strict//EN";
context.SystemId = "xhtml1-strict.dtd";
XmlParserContext xhtmlContext = context;

// Create a string reader for the HTML.
StringReader stringReader = new StringReader(html);

// Create XmlReaderSettings that will properly process the HTML using the DTD we specified
// in the context above.
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.ValidationType = ValidationType.DTD;
settings.XmlResolver = new XmlPreloadedResolver(XmlKnownDtds.All);

// Create the XmlReader with the appropriate settings and context.
XmlReader reader = XmlReader.Create(stringReader, settings, xhtmlContext);

// Load the xml into an XElement.
return XElement.Load(reader);
}
I know the best blog posts are the ones that do a good job explaining all the parts and why they work, but to my shame, all I really know is that we're forcing the XmlReader to assume a certain Xml DOCTYPE and that it works.

Tuesday, March 20, 2012

Troubleshooting the insidious InvalidInput error

I've been plagued since the day I started working on Azure with occasional errors of type "InvalidInput" (such as this one). The detail of this errors says "One of the request inputs is not valid" and gives no other information. Obviously this means one of the properties on the entity is invalid for Azure Table Storage, but it does not tell you which one or why. A quick google will tell you lots of common mistakes that can cause this, but if those don't help you're out of luck.

I have come up with a function that will at the very least tell you which property is causing the error. It should be a method of a TableServiceContext subclass, although it could easily be modified to work some other way. The return value is a list of properties that failed. Essentially it works by trying to update the entity one property at a time. For each property it removes all properties from the request XML except the one being tested. If the request fails, it assumes that the current property (being the only one we sent) was one of the properties that caused the failure. Start by pasting the following code into your TableServiceContext, then calling TroubleshootInvalidInputError(entity) with the entity that is failing.

private string mTroubleshootingCurrentProperty = null;

public List<string> TroubleshootInvalidInputError(TableServiceEntity e)
{
WritingEntity += new EventHandler<ReadingWritingEntityEventArgs>(TroubleshootInvalidInputErrorBeforeWrite);
PropertyInfo[] properties = e.GetType().GetProperties();
List<string> failedProperties = new List<string>();
foreach (var property in properties)
{
mTroubleshootingCurrentProperty = property.Name;
try
{
UpdateObject(e);
SaveChanges();
}
catch (Exception ex)
{
failedProperties.Add(property.Name);
}
}

mTroubleshootingCurrentProperty = null;
return failedProperties;
}

void TroubleshootInvalidInputErrorBeforeWrite(object sender, ReadingWritingEntityEventArgs e)
{
if (string.IsNullOrEmpty(mTroubleshootingCurrentProperty))
return;

// The XML of the properties already being sent to Azure
XElement properties = e.Data.Descendants(Meta + "properties").First();
XName keepName = Data + mTroubleshootingCurrentProperty;
IEnumerable<XElement> propElements = properties.Elements();

XElement keepNode = null;

foreach (var propElement in properties.Elements())
{
if (propElement.Name == keepName)
keepNode = propElement;
}

if (keepNode != null)
{
properties.RemoveNodes();
properties.Add(keepNode);
}
}

It does not tell you why it failed, but knowing which property is half the battle. In my case, the problem was that I was trying to upload an entity with a property of a custom class that was Serializable. Normally, I'd been used to the Azure serializer ignoring any public properties that were not supported (such as List, custom classes, etc). However, I added a property based on a custom class I made which was marked as Serializable, so Azure attempted to serialize it and the data service triggered an error.

Monday, February 20, 2012

Fixing Default Date issues in Azure Table Storage

In newer versions of Azure, the minimum acceptable date in Table Storage is far further in the future than the minimum date in .NET. This means that if you have a DateTime property and you don't explicitly initialize it you will get a "One of the request inputs is out of range" error. Personally, I'd rather not have to explicitly initialize every value. I liked in previous versions of Azure, when it would just default to the DateTime.MinDate and everything was fine.

An easy way to manage this is to reset the default date for any DateTime properties before writing to Azure Table Storage. The way I chose to do this was to add a function to the WritingEntity event in the table context. See the code below:


public class SafeDateContext : TableServiceContext
{
private static MinAzureUtcDate = new DateTime(1601, 1, 1, 0, 0, 0, DateTimeKind.Utc);

public SafeDateContext(CloudStorageAccount account)
: base(account.TableEndpoint.ToString(), account.Credentials)
{
WritingEntity += new EventHandler(FixDates);
}

private void FixDates(ReadingWritingEntityEventArgs e)
{
// The XML of the properties already being sent to Azure
XElement properties = e.Data.Descendants(Meta + "properties").First();

foreach (var p in properties.Elements())
{
string type = p.Attribute(Meta + "type") == null ? null : p.Attribute(Meta + "type").Value;
bool isNull = string.Equals("true", p.Attribute(Meta + "null") == null ? null : p.Attribute(Meta + "null").Value, StringComparison.OrdinalIgnoreCase);
if (!isNull && type == "Edm.DateTime")
{
string value = p.Value;
DateTime dateValue = (DateTime)ConvertType(type, value, isNull);
if (dateValue < MinAzureUtcDate)
{
p.SetValue(MinAzureUtcDate);
}
}
}
}

Wednesday, February 1, 2012

Azure Plugin Modules

UPDATE: A response from Microsoft was received at my post on the MSDN forums. Essentially the answer was: You shouldn't do this. I leave this here for you to review at your own risk. I submitted a feature request, but I'm not holding my breath.

I've been working with Azure for over a year now, and I've always had the question: How can I create my own modules along the lines of the Diagnostics or RemoteAccess modules. There is one major reason I want this. I have created 3 Azure services, with a total of 7 roles. Many of these roles have things in common. All of them use a setting called ConnectionString. Many of them have other connection strings in common if they are using a particular library that I created myself. What I would like to be able to do is import the connection strings for a library as simply as possible whenever I am using that library.

This is where the "Import" node in the ServiceDefinition.csdef file SHOULD come in handy. For example, if you add the Diagnostics module in ServiceDefinition.csdef it automatically pulls the necessary configuration settings into ServiceConfiguration.csdef. I would like similar functionality to create my own modules. Periodic Google searches over the past year have come up empty every time. Today I tried to do a bit more of a direct search and made some discoveries.

Azure Role Modules are defined in the Windows Azure SDK folder under the bin/plugins directory. There is a folder for each plugin, and inside each is an XML file with the extension .csplugin. For example, here is the XML for the Diagnostics.csplugin file:


<?xml version="1.0" ?>
<RoleModule
xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition"
namespace="Microsoft.WindowsAzure.Plugins.Diagnostics">
<Startup priority="-2">
<Task commandLine="DiagnosticsAgent.exe" executionContext="limited" taskType="background" />
<Task commandLine="DiagnosticsAgent.exe /blockStartup" executionContext="limited" taskType="simple" />
</Startup>
<ConfigurationSettings>
<Setting name="ConnectionString" />
</ConfigurationSettings>
</RoleModule>

It turns out you use this format to create your own plugins. For example, I created a plugin that looked like the following:


<?xml version="1.0" ?>
<RoleModule
xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition"
namespace="MyLib.Azure.Plugins.MyModule">
<ConfigurationSettings>
<Setting name="ConnectionString" />
<Setting name="OtherSetting" />
</ConfigurationSettings>
</RoleModule>


To do this:
  1. Create a new folder in the Windows Azure SDK folder under bin/plugins called PluginName
  2. Create a new file called PluginName.csplugin
  3. Paste in the XML above.
  4. Rename the namespace to whatever you would like it to be.
  5. Set the ConfigurationSettings section to what you would like it to be.
Now, when you import PluginName in your ServiceDefinition.csdef file, it will automatically pull up the configuration settings you added into the ServiceConfiguration.cscfg file.

Here are some problems I can foresee with this:
  1. It is likely not supported by Microsoft, since I could not find documentation on it anywhere.
  2. When you update Azure, it will likely overwrite or delete the plugins directory and you will lose this.
I'm not sure whether I'm going to put this into practice at work or not, but I thought I'd post it in any case so that if anyone else searches for an answer to this they'll get more results than I did.