Anatomy of a Platform Bug

Update 5/20/13 – See “Death of a Platform Bug”

Platforms and frameworks have bugs.

Nobody really likes to discuss it – especially platform and framework vendors. But it’s like Murphy’s law of computer programming: Every non-trivial program has at least one bug. In fact, one of the signs that you have become an “expert” on a platform or framework is that a high percentage of the problems that you run into and can’t solve are, in fact, platform bugs rather than your own code.

I’ve found bugs in Windows, MFC, ATL and the .NET Framework. Nowadays I find them in Force.com. The experience is pretty similar on all of the platforms. First you have to be very sure that it’s really not your bug – this can be harder than you might think. There’s a lot of detective work involved – unlike your own code, you can’t necessarily know what is going on with the platform – I once found a VB bug where I actually had to disassemble a part of the VB control interface code in order to demonstrate to the developers where their mistake was. Which brings us to one of the biggest challenges – getting past the first-line support team to someone who can actually solve the problem (or convince them that you really know what you’re talking about and that they should forward the information).

I thought it might be interesting to walk through what the process is like with an example that I am currently dealing with. This is a story in-progress – I will add more information as it becomes available.

It began with our latest release – where on some systems we started seeting many of our unit tests fail with the following error:

FATAL_ERROR|System.DmlException: Insert failed.
First exception on row 0; first error:
UNKNOWN_EXCEPTION, INVALID_TYPE:
sObject type 'DataDotComEntitySetting' is not supported.: []

This was perplexing. After all, we don’t access an object called DataDotComEntitySetting. In fact, we don’t reference anything related to Data.com.

As a software vendor, you really don’t want to see most of your unit tests start failing. So this became a top priority issue.

Our first concern was whether we could install the software at all. The answer is, of course – yes. If you’ve read “Advanced Apex Programming”, you’ve seen unit test design patterns that allow you to dynamically enable or disable individual unit tests before or after deployment – so we’re not dead in the water. However, not being able to run unit tests means we can’t validate the operation of the application on those systems – which is definitely not good.

Because we could disable tests for installation and then reenable them after the software was installed, we were able to eliminate one theory – that the problem was purely related to software installation – perhaps some security issue related to the user context used during unit tests on installation.

Another early step was, of course, to search for other instances of this problem. Unfortunately, this was one of those cases where we clearly were innovators. There was only one reference to a similar problem, and our scenario did not match the one described.

This left us with a number of questions.

Was this really related to Data.com?

Yes, the error message referenced an object called ‘DataDotComEntitySetting’, but I’ve seen cases where an error message has nothing even remotely related to do with real source of the error. This is especially true in a complex framework, where internal error handling attempts to internally recover from a problem and only after a cascade of errors do you finally see an unrecoverable error – that has nothing to do with the original problem. In this case, there are a number of factors that suggested it really related to data.com aside from the object name. First, both systems on which we saw the problem did have Data.com enabled – too small a sample for a firm conclusion, but an indicator nonetheless. Second, the StackExchange issue was seemingly related to a jigsaw package, that later seems to have been integrated into Data.com. Later in this article you’ll see how we obtained further proof.

What changed?

Our new software release had dozens of unit test errors – most of them on code that had not changed from the previous version (as a reasonably agile organization, we have frequent releases). But there was one change that impacted the entire codebase – we upgraded from API 25 to API 27, mostly in order to take advantage the new string library and some other new Apex features. When code breaks from one API version to another, that can be an indicator of a platform bug as compared to a bug in your own code.

Looking for a Workaround

At this point we had already submitted an initial case. But when dealing with potential platform bugs, you can’t just sit around and wait for support. You need information – the more the better. Fortunately, we have some great customers who are ok with us using the license management system to log in to their sandboxes – when you do so, you can see detailed debug logs for your managed packages. The push upgrades system also provides better information than a regular package install. This allowed us to see where the failure was occurring.

The code, in a nutshell, was like this.

// Code that creates some test lead
// objects but doesn’t insert them
List<Lead> newleads = initTestLeads();
InsertTestObjects(newleads);

The InsertTestObjects function is a public method that we use to insert test objects and perform some additional tasks. In this case, it sets a static variable so that our trigger framework will know to ignore these test objects.

public static void InsertTestObjects(List<SObject> objs)
{
   DisableExternalUpdates = true;
   insert objs;
   DisableExternalUpdates = false;
}

The error was occurring during the insert. We saw it occur on Leads, Contacts and Accounts – a fact that again pointed towards Data.com as the culprit, as it uses those objects.

One thing we found in the debug logs was that when the problem occurred, no object triggers were being called (at least in our application, or in user code). This provided additional evidence that the problem was not in our code or other user code, though it theoretically could have been in a different managed package.

This code is extremely simple. So we looked for ways to reproduce the problem.

We built some unit test classes in the sandbox that contained similar code. They worked perfectly.
We created another test package that contained similar code and tried to install it. It worked perfectly.

Things are so much easier when you can reproduce a problem. When you can’t….

What this did tell us however, is that whatever it took to cause this problem, it was not obvious. We had some test functions that failed, and others with almost identical code that succeeded. The problem was not intermittent – tests that failed did so consistently, those that passed also did so consistently. But there was no clear pattern.

So our next step was to create some patch versions of the application and see if we could change things to get the test to pass.

And we found something. If instead of calling the InsertTestObjects function we called a new strongly typed InsertTestLeads function, most of the tests passed.

public static void InsertTestLeads(List<Lead> objs)
{
   DisableExternalUpdates = true;
   insert objs;
   DisableExternalUpdates = false;
}

This would suggest that it was perhaps a language issue, except for one problem: there were other places in the code where a direct strongly types insertion would fail. For example:

Account act = new Account(…..);
insert act;

This would fail with the same error. Not everywhere, just in some test functions.

Presenting the Case

We were very fortunate to be assigned a really good support person, but we’d also done our homework. While the original case was filed as a “application won’t install” problem, by the time we were on a GotoMeeting with support we could demonstrate failing tests, had log files showing the problem, and could demonstrate code changes that could in some cases resolve the problem. In short, we had overwhelming evidence that we were dealing with a platform bug.

The support person, who was familiar with data.com, then walked us through some experiments. One of them involved turning off the “Clean” feature in data.com. That did it – the tests stopped failing.

So now we were in as ideal a situation as one could ask for under the circumstances. Salesforce support agreed that it was a platform bug, and we knew for sure that it related to data.com.

You may think I’m glad it’s a platform issue, and while in some sense there is relief that it’s not our code, the truth is that it would be much better if it were our code – we can fix our code. Now we have to hope that Salesforce will commit the resources to resolve the issue, and be able to figure it out – the inconsistent nature of the problem suggests that it may be hard to track down.

This is the “dark side” of modern software development – where we build applications based on packages, platforms, frameworks and services, many of which are outside of our control. It’s certainly not unique to Force.com. The best thing you can do is to be proactive – work with the platform and framework vendors to resolve issues, but be prepared to work with them on solving the issues, and where possible, develop workarounds.

I’ll add updates to this post as new information becomes available.

Meanwhile, if you have any insight to share, feel free to leave a comment (note, comments are moderated to limit spam so you won’t see them immediately)

Submit a Comment Cancel reply

Recent Posts

Links

sfdc99.com