I ran into a new and exciting issue with ColdFusion today. I was called into figure out a bizarre problem a coworker of mine was having. The problem was that the navigation bar on a website was occasionally (meaning about 1 out of 25 times or so) not showing all the data or was showing too much data, or just plain wrong information. If you were to go to the site and reload the page, every once in a while you’d see data that was just wrong. For Example:
Then, randomly, you’d see this:
In general, the code that outputs the navigation menu was looping over a query in the application scope. When the application was initially loaded application.nav_menu would be set to the results of a query. Then, on all page loads the nav menu would be generated by looping over the query in application.nav_menu.In addition to that, the problem could be duplicated by creating some code as simple as this and then reloading the page over and over:
<cfoutput query="application.nav_menu"> #CurrentRow#.) #Name#" </cfoutput>
When you ran this code most of the time you would see rows 1 through 30. However, occasionally, it would skip rows! You might see 1, 2, 4, 5, 7…. Notice that 3 and 6 are missing. The next time you’d reload the page you’d see 1, 2, 3, 4, 5 with no missing data.
Incorrect (note the missing data):
(and so on)After beating my head against this problem for a few hours I began to have an idea what was going on. Here are a few more hints:
- The problem only occurred on the live site.
- The problem began occurring after upgrading from ColdFusion 5 to ColdFusion MX 6.1 with the 6.1 updated released in August.
- The ColdFusion server had been restarted at least once.
- The site gets about 125,000 hits per day. (I assume that’s on more than just .cfm files)
After some research I was able to reproduce the problem outside of production. It seems that when you have a query in the application scope that all of the query’s metadata is in the application scope too, including the currentrow value. That means that if you have two people looping over the same query in the application scope at the same time that as user one reaches the end of the loop and the currentrow is incremented that it’s also incremented for the other user looping over the same application variable. The second user when they reach the end of the query will increment the currentrow and begin the loop again. However, at this point they will appear to have jumped two rows, not one. To test this theory, I created a new folder in my application and wrote a few test files. One was an application.cfm designed to isolate my tests from the application.
<CFAPPLICATION name="test6" clientmanagement="No" sessionmanagement="Yes" applicationtimeout="#CreateTimespan(2,0,0,0)#" sessiontimeout="#CreateTimespan(0,2,0,0)#" setdomaincookies="true">
I then created a simple file, test1.cfm, which would create and cache the query into the application scope if it didn’t exist and loop over the cached query:
<cfif not isdefined("application.NavMenuTEST")> <cfquery name="application.NavMenuTEST" datasource="beta"> EXEC amsp_NavMenuSetup 0, 'http://www.website.org/', 'http://www. website.org/', 4.1 </cfquery> <h2>Application Query Var Set</h2> </cfif> <cfoutput query="application.NavMenuTEST" maxrows="30"> #CurrentRow#.) #Name#" </cfoutput>
At this point I could load test1.cfm all day and never see the problem. This is good. These tests were not under any load and it was the expected behavior.So, I created another file, tDoug.cfm. This file simulated thousands of users looping over the cached query at the same time.
<cfloop from="1" to="50000" index="x"> <cfoutput query="application.NavMenuTEST"> <cfset t = "#CurrentRow#.) #Name#" /> </cfoutput> </cfloop>
This file looped over the entire query 50000 times. Each time it would perform some unimportant mumbo-jumbo.Running tDoug.cfm took 15 or so seconds. If, while this file was running, I hopped to another tab and ran test1.cfm the data would be all out of order with lots of “missing” rows. Just the same problem I was having but to a larger extreme.Interestingly enough, I was not able to reproduce this on ColdFusion 5, or ColdFusion MX 6.1 (on Linux) but I was able to reproduce it on a different ColdFusion MX 6.1 with the 6.1 updater server while pointing to a different database and server altogether. It sure sounds to me like a problem with the 6.1 updater.For those of you who are having this problem my solution was to duplicate the query into the local variables scope before looping over it. The major drawback is that I’ve got to have a minimum of two copies of the query in memory at any given point in time for this to work (because I’m duplicating the query on each request.) In other words, if I updated my test1.cfm as follows I no longer ran into the problem.
<cfset navmenu = duplicate(application.NavMenuTEST) /> <cfoutput query="navmenu" maxrows="30"> #CurrentRow#.) #Name#" </cfoutput>
This is all I can say at the moment. I let some people at Macromedia know about the problem and the ball is in their court now.
Update 9/23/2004: So, my coworker did send the results of our findings off to Macromedia yesterday. Apparently Macromedia has no confirmed that this is a bug going all the way back to MX 6.0! And no one ever caught this before?! Any how, they hadn’t heard of this problem at all before yesterday and then, all of the sudden, five companies call in with the same problem. Very strange.
The Macromedia support engineer is apparently going to suggest a hot fix for this issue. In the meantime, the official Macromedia line appears to be that you should use an exclusive lock around any loops over application queries. The other suggestion we’ve come up with is to use the duplicate method to copy the query into the variables scope before looping over it. Both will hurt application performance in their own ways. If you have this problem , I would suggest trying both solutions under load before choosing one or the other.I’ll keep this post updated as I learn more.