I was investigating an OutOfMemoryException today that occurred in a production intranet system. Fortunately by leveraging smart people like
Joel Pobar the cause didn’t stay a mystery for very long (and we didn’t have to resort to the usual vadump and ADPlus route), and luckily the fix was as simple as changing a single Boolean parameter from true to false on one framework method call. We had a good repro in one of our test environments, but because of the vagaries of the build and deployment process it looked like it was going to be a bigger-than-expected deal to patch and re-deploy to verify the fix in the test environment. “But it’s just ONE assembly” I protested “surely we could just ILDASM , modify and then ILASM it”. So that is just what I did. It took a little longer than expected because I was a little disconcerted by the fact that the ILASM’d assembly was a few KB different in size to the original one. After reading
this post by Kenny Kerr I felt relieved, and was ready to deploy my patched dll for testing, which went off smoothly.
Next came discussions of production – we didn’t have a scheduled outage where a properly build and patched version could be deployed for a few days. Joel, possibly mildly impressed with my ILASM bravado cooked up this
proof-of-concept “zero downtime” approach involving WinDBG and modifying the JIT’d x86 code on-the-fly to show me how a
REAL programmer does it. Yup - attach WinDBG, trace through a few memory addresses, modify one memory location and you're good to go. Just to be clear, we never actually DID this (the WinDBG stuff), not even in testing, but I think Joel has shown us the way forward next time one of our managers asks how much downtime is required to patch a system. Thank goodness for clusters and NLB, otherwise we all might have to actually know how to do this.