Friday, June 09, 2017

Gotcha #7 - AppDynamics and the Mysterious Case of the Vanishing Cookie

Not much of a mystery today in this article (the answer is alluded to in the title) but the process of identifying the problem, discovering why it was a problem, and then fixing it was quite a challenge.

Today's Gotcha is about Rational Performance Tester (v8.5.1.2 for anyone interested) and a vexing problem with successful test execution.

The problem originally arose when we discovered that our tests weren't always succeeding. After executing a schedule we would consistently have a large number of pages that wouldn't have 100% success status codes (located on the Server Health Detail view). There would be a few with 100% success rate, but consistently nearly every page was hovering between 94 and 98%. In some larger tests this would drop further, into the 80's or high 70's, but even a simple low volume test would have most pages at less than 100%.

To get a better handle on why and when these page errors were happening, I added a new counter to the default Page Throughput / Page Hit Rate graph (Add/Remove Performance Counters --> Page Performance Counter... --> Status Code Successes --> Percent Page Status Code Success [for Interval]).

Suddenly we could see this wavering line appear on the graph that would start off at 100%, dip sharply at the beginning of the test, before returning back up near 100% and stay fluctuating in the high 90% range.

Baffling when the very same test scripts in the past against the same application would run at 100% without trouble. Doubly-baffling when running the script as a standalone would work perfectly fine.

Following the Clues

The first clue came when we decided to rerun one of our test schedules and monitor the Test Execution of a user thread. Since the failures were reasonably consistent across all pages, chances are we might be able to catch it in the act.

After a couple of hours of watching successful page responses, we caught one. Right from the beginning of a new test script - a page failure. And then another, and another, and so on until every single page in that script failed in order. And the next script. And the next... but then the fourth script, for no apparent reason, began returning successful responses again.

Digging into the error, the only message we could find was a "Failed substitution ^csrfToken=.*$" message followed by a cascade of 500 - Server Error responses. The csrfToken appears in the HTTP Response header as a "Set-Cookie" value from the server (you can find an explanation of its purpose here). And for some reason, it wasn't being found.

Except... inspecting the response of that very first page - the csrfToken was right where it should be. Further digging into the application logs showed the same thing - on occasion our test would be sending an invalid csrfToken - the recorded token - which happens when the substitution couldn't be made.

But it was right there in the Response Header - right where we expected it to be. Except, not quite...

Smoking Gun

It looked like it was right where it was supposed to be - in among a half-dozen other cookies being set by the server. But a careful comparison with the original recording showed that it was 1 line down. There was a NEW Set-Cookie value that was inserted in the list just ahead of the csrfToken.

Inspecting and comparing successful test executions with failed ones showed that this new mysterious cookie only appeared in the failed tests, and wasn't to be found in the successful ones.

The server was randomly inserting a new cookie in its response... some of the time. And it was enough, by shifting the order of the Set-Cookie headers for RPT fail when trying to identify the csrfToken.

The culprit was a cookie called ADRUM_BTs and it looks like the following:
Set-Cookie: ADRUM_BTs="R:0|s:f"; Version=1; Max-Age=30; Expires=...date...; Path=/; Secure

There was no consistency in when or where it would appear. It would show up for a script execution or three, then disappear once more.

Solution

It turns out that we weren't the only ones to run the Mysterious Case of the Vanishing Cookie - other people had identified ADRUM_BTs as playing havoc with their scripting efforts.

The cookie in question belongs to AppDynamics End User Monitoring javascript package, a tool that can be used to monitor user behaviour and web application performance from the user's perspective. It was a new feature in the AppDynamics package that had been turned on in our Performance Testing environment recently to try and improve our measurement capabilities.

In the end, there was no way to remove this new cookie or adjust the system to consistently distribute it. In order to get our Performance Testing back up and running we had to remove all End-user Monitoring components - and suddenly our vanishing cookie was gone for the last time.

No comments:

Post a Comment