Death of a Platform Bug

In my previous post, I walked through the process of discovering, diagnosing and reporting a legitimate platform bug. As I mentioned previously, on any platform as large and complex as Force.com, bugs are inevitable. Every OS has them. Every framework has them.

One of the biggest considerations when evaluating a platform bug is when it a appears. For example: if a bug appears on a new API version, and the platform is versioned – you can avoid the bug by either working around it, or by staying with the old API version until it is fixed.

If a bug is just there – and has been there for a while, you can either come up with a workaround, or just not use that particular feature – because the bug has always existed, there’s little or no risk the bug will impact code that you ship.

But, if a bug appears on the platform and breaks existing code – that’s a big problem. That’s why Salesforce puts in such a huge effort to test new releases, running every unit test (including customer unit tests and package tests) on the new version to detect any possible breaking change. Unfortunately, the DataDotComEntitySetting bug was this type of bug.

As it turns out, the problem related to a security setting on that particular object – one that I presume is used by Data.com Clean when enabled. It’s also not a common problem – it impacted our application and that of one other ISV (who started seeing sudden errors appearing with customers who enabled Data.com clean).

The good news is, that once we were able to reach the right people at Data.com to convey the impact that the problem was causing, they were phenomenal. They provided us with access to a sandbox with data.com to verify both the error and confirm the fix, they kept us updated as to the progress, and today – confirmed that a patch has been pushed out to production.

So – it’s a happy ending.

But, happy ending notwithstanding, it did point out one area that I hope Salesforce will work to improve. You see, the application I’m working is large and complex – and makes use of many platform features. So I’ve probably run into (and helped discover) more than my share of platform issues. Over the past few years I’ve noticed a dramatic improvement in the ability of the Salesforce frontline support to confirm, prioritize and address platform bugs. I’ve noticed a marked improvement in the Known Issues site – and the quick identification of workarounds where possible (and remember, for a developer, a workaround is as usually almost good as a fix). I’ve seen rapid and accurate responses on StackExchange.

I don’t know how Salesforce is organized internally, but from where I sit, the Data.com support group hasn’t quite gotten the message yet. Yes, they were great at confirming that a platform bug existed, but after that – things got… difficult. I won’t go into details, but it took some pretty extraordinary efforts on our part to finally reach the right people where we were able to have a good discussion and get real feedback that we could work with and convey to our customers. Anyway, I’m confident that they’ve learned as much from the experience as we have, and I am thrilled to see this particular platform bug dead and buried.

 

 

 

 

Anatomy of a Platform Bug

Update 5/20/13 – See “Death of a Platform Bug

Platforms and frameworks have bugs.

Nobody really likes to discuss it – especially platform and framework vendors. But it’s like Murphy’s law of computer programming: Every non-trivial program has at least one bug. In fact, one of the signs that you have become an “expert” on a platform or framework is that a high percentage of the problems that you run into and can’t solve are, in fact, platform bugs rather than your own code.

I’ve found bugs in Windows, MFC, ATL and the .NET Framework. Nowadays I find them in Force.com. The experience is pretty similar on all of the platforms. First you have to be very sure that it’s really not your bug – this can be harder than you might think. There’s a lot of detective work involved – unlike your own code, you can’t necessarily know what is going on with the platform – I once found a VB bug where I actually had to disassemble a part of the VB control interface code in order to demonstrate to the developers where their mistake was. Which brings us to one of the biggest challenges – getting past the first-line support team to someone who can actually solve the problem (or convince them that you really know what you’re talking about and that they should forward the information).

I thought it might be interesting to walk through what the process is like with an example that I am currently dealing with. This is a story in-progress – I will add more information as it becomes available.

It began with our latest release – where on some systems we started seeting many of our unit tests fail with the following error:

FATAL_ERROR|System.DmlException: Insert failed.
First exception on row 0; first error:
UNKNOWN_EXCEPTION, INVALID_TYPE:
sObject type 'DataDotComEntitySetting' is not supported.: []

This was perplexing. After all, we don’t access an object called DataDotComEntitySetting. In fact, we don’t reference anything related to Data.com.

As a software vendor, you really don’t want to see most of your unit tests start failing. So this became a top priority issue.

Our first concern was whether we could install the software at all. The answer is, of course – yes. If you’ve read “Advanced Apex Programming”, you’ve seen unit test design patterns that allow you to dynamically enable or disable individual unit tests before or after deployment – so we’re not dead in the water. However, not being able to run unit tests means we can’t validate the operation of the application on those systems – which is definitely not good.

Because we could disable tests for installation and then reenable them after the software was installed, we were able to eliminate one theory – that the problem was purely related to software installation – perhaps some security issue related to the user context used during unit tests on installation.

Another early step was, of course, to search for other instances of this problem. Unfortunately, this was one of those cases where we clearly were innovators. There was only one reference to a similar problem, and our scenario did not match the one described.

This left us with a number of questions.

Was this really related to Data.com?

Yes, the error message referenced an object called ‘DataDotComEntitySetting’, but I’ve seen cases where an error message has nothing even remotely related to do with real source of the error. This is especially true in a complex framework, where internal error handling attempts to internally recover from a problem and only after a cascade of errors do you finally see an unrecoverable error – that has nothing to do with the original problem. In this case, there are a number of factors that suggested it really related to data.com aside from the object name. First, both systems on which we saw the problem did have Data.com enabled – too small a sample for a firm conclusion, but an indicator nonetheless. Second, the StackExchange issue was seemingly related to a jigsaw package, that later seems to have been integrated into Data.com. Later in this article you’ll see how we obtained further proof.

What changed?

Our new software release had dozens of unit test errors – most of them on code that had not changed from the previous version (as a reasonably agile organization, we have frequent releases). But there was one change that impacted the entire codebase – we upgraded from API 25 to API 27, mostly in order to take advantage the new string library and some other new Apex features. When code breaks from one API version to another, that can be an indicator of a platform bug as compared to a bug in your own code.

Looking for a Workaround

At this point we had already submitted an initial case. But when dealing with potential platform bugs, you can’t just sit around and wait for support. You need information – the more the better. Fortunately, we have some great customers who are ok with us using the license management system to log in to their sandboxes – when you do so, you can see detailed debug logs for your managed packages. The push upgrades system also provides better information than a regular package install. This allowed us to see where the failure was occurring.

The code, in a nutshell, was like this.

// Code that creates some test lead
// objects but doesn’t insert them
List<Lead> newleads = initTestLeads();
InsertTestObjects(newleads);

The InsertTestObjects function is a public method that we use to insert test objects and perform some additional tasks. In this case, it sets a static variable so that our trigger framework will know to ignore these test objects.

public static void InsertTestObjects(List<SObject> objs)
{
   DisableExternalUpdates = true;
   insert objs;
   DisableExternalUpdates = false;
}

The error was occurring during the insert. We saw it occur on Leads, Contacts and Accounts – a fact that again pointed towards Data.com as the culprit, as it uses those objects.

One thing we found in the debug logs was that when the problem occurred, no object triggers were being called (at least in our application, or in user code). This provided additional evidence that the problem was not in our code or other user code, though it theoretically could have been in a different managed package.

This code is extremely simple. So we looked for ways to reproduce the problem.

  • We built some unit test classes in the sandbox that contained similar code. They worked perfectly.
  • We created another test package that contained similar code and tried to install it. It worked perfectly.

Things are so much easier when you can reproduce a problem. When you can’t….

What this did tell us however, is that whatever it took to cause this problem, it was not obvious. We had some test functions that failed, and others with almost identical code that succeeded. The problem was not intermittent – tests that failed did so consistently, those that passed also did so consistently. But there was no clear pattern.

So our next step was to create some patch versions of the application and see if we could change things to get the test to pass.

And we found something. If instead of calling the InsertTestObjects function we called a new strongly typed InsertTestLeads function, most of the tests passed.

public static void InsertTestLeads(List<Lead> objs)
{
   DisableExternalUpdates = true;
   insert objs;
   DisableExternalUpdates = false;
}

This would suggest that it was perhaps a language issue, except for one problem: there were other places in the code where a direct strongly types insertion would fail. For example:

Account act = new Account(…..);
insert act;

This would fail with the same error. Not everywhere, just in some test functions.

Presenting the Case

We were very fortunate to be assigned a really good support person, but we’d also done our homework. While the original case was filed as a “application won’t install” problem, by the time we were on a GotoMeeting with support we could demonstrate failing tests, had log files showing the problem, and could demonstrate code changes that could in some cases resolve the problem. In short, we had overwhelming evidence that we were dealing with a platform bug.

The support person, who was familiar with data.com, then walked us through some experiments. One of them involved turning off the “Clean” feature in data.com. That did it – the tests stopped failing.

So now we were in as ideal a situation as one could ask for under the circumstances. Salesforce support agreed that it was a platform bug, and we knew for sure that it related to data.com.

You may think I’m glad it’s a platform issue, and while in some sense there is relief that it’s not our code, the truth is that it would be much better if it were our code – we can fix our code. Now we have to hope that Salesforce will commit the resources to resolve the issue, and be able to figure it out – the inconsistent nature of the problem suggests that it may be hard to track down.

This is the “dark side” of modern software development – where we build applications based on packages, platforms, frameworks and services, many of which are outside of our control. It’s certainly not unique to Force.com. The best thing you can do is to be proactive – work with the platform and framework vendors to resolve issues, but be prepared to work with them on solving the issues, and where possible, develop workarounds.

I’ll add updates to this post as new information becomes available.

Meanwhile, if you have any insight to share, feel free to leave a comment (note, comments are moderated to limit spam so you won’t see them immediately)

New course: Force.com for .NET Developers

I’m pleased to announce my latest Pluralsight course “Force.com for .NET Developers”. This course is a prequel to my course “Force.com and Apex Fundamentals for Developers” intended specifically for .NET developers who are curious about Force.com

Here’s how I describe the course:

Force.com is a unique cloud development platform that is in many ways different from traditional software development platforms – even those based on cloud technologies. This short course is designed specifically for .NET developers to understand the nature of Force.com by comparing and translating .NET concepts to their Force.com equivalents.

If you are a .NET developer, I encourage you to check it out – if you’re not already a Pluralsight subscriber, they have a free trial available (see right sidebar for link).

Intriguing Design Pattern for Scheduled APEX

One of the disadvantages of Scheduled APEX is that a scheduled class can’t be updated. Force.com creates an instance of the APEX class when it is scheduled, preventing it from being updated. You can’t edit a scheduled class, or update it via a ChangeSet, the Force.com IDE or a package update.

What’s more, Force.com prevents updates to any dependent classes as well. Thus it is quite easy for a scheduled class to “poison” an application – preventing many, if not all of its components, from being updated. As a result, updates to applications that use Scheduled Apex often require any scheduled jobs to be manually aborted before an update can take place.

It turns out, however, that a recent API update allows use of a design pattern that can help you avoid most of these problems.

Here’s how it works.

You still create a class that implements the Schedulable interface, but this class will be a simple wrapper that defines it’s own interface – call it IScheduleTest. This interface is identical to Schedulable. The execute method of the Schedulable global class creates an instance of a second class that implements the new IScheduleTest interface using the new Type class instantiation method. It then calls the execute method on that interface. It looks like this:

global class ScheduleTest Implements Schedulable
{
  public Interface IScheduleTest
  {
    void execute(SchedulableContext sc);
  }

  global void execute(SchedulableContext sc)
  {
    Type targettype = Type.forName('CalledByScheduleTest');
    if(targettype!=null)
    {
      IScheduleTest obj = 
      (IScheduleTest)targettype.NewInstance();
      obj.execute(sc);
    }
  }
}

The second class looks something like this:

public class CalledByScheduleTest 
  implements ScheduleTest.IScheduleTest
{
  public void Execute(SchedulableContext sc)
  {
    System.debug('called in schedule');
  }
}

What does this accomplish?

You still can’t modify the ScheduleTest class once it’s scheduled, but this class is so simple, you may never need to update it. You can update the CalledByScheduleTest class. Using the Type.NewInstance method to create the class dynamically prevents the platform from seeing it as a dependent class.

I’ve been able to successfully update the CalledByScheduleTest class even during a managed package update as long as the scheduled ScheduleTest class remains unchanged. Though this design pattern is not officially documented (to my knowledge), I see no reason why it should not work reliably going forwards.

This design pattern eliminates one of the major impediments to using Scheduled Apex and is worth not only considering for new designs, but as a possible retrofit to existing applications.

 

New Course on Migrating to Apex

I’m pleased to announce the immediate availability of my first ever online course – Force.com and Apex Fundamentals for Developers on Pluralsight.com.

Think of this course as a prequel to “Advanced Apex Programming”. The book was designed for intermediate and experienced Apex developers, but isn’t a perfect fit for experienced software developers who are moving from other languages to Apex and Force.com. This course fills that gap.

It’s designed for experienced programmers who are beginning or intermediate Force.com developers. It does not teach computer programming – I assume that those viewing the cost know how to program in a modern block structured language, and that they know how to read documentation.

Here’s how I describe the course:

Apex is the native language of the Force.com platform, and there is a huge demand for skilled developers in this space. The Java/C# like Apex language looks familiar enough that experienced developers often expect a short learning curve, but the platform is actually radically different, and requires use of a unique set of set of design patterns. In this course, you’ll learn the core concepts that are essential for every Apex programmer to learn, and a roadmap to further resources to help you quickly become an expert in this rapidly growing space.

This course, like Advanced Apex Programming, is not an overview or comprehensive introduction to the Force.com platform. It is a book on developing software, and the emphasis is on programming, design patterns and best practices.

I invite everyone to check it out – if you’re not already a Pluralsight subscriber, they have free trials available (see right sidebar). You’ll probably find other content there that you’ll like as well – I chose Pluralsight because I know quite a few of their course authors (both personally and by reputation), and they are the best (honestly, I feel honored to be counted in their company).