Best Bugs Yet

This week ended up representing the cumulation of a years worth of experience at my current employer and it happened through some very special bugs that had been haunting us for months.

1. Ajax Timing

First of all, our UI automated tests have been failing due to ajax and javascript timing issues.  Intermittently and actually quite rarely, say 1-5%.  That’s a problem though because, including branches, we frequently do several hundred CI runs a day, so that means sometimes dozens of runs fail.  This leads to the dreaded lack of trust in test runs which has very negative consequences as ultimately you no longer can trust them to reflect if the product is actually working.

The mystery was why.  The page loads up in 2-3 seconds on our macs and we had a 15 second max wait time which seemed ample.

The clue finally came when I ran the UI tests locally at home on my Ubuntu machine, which ironically is the same OS as the CI machines and also with lower power than my mac.  I ran the specs and they brought up the browser locally, but instead of 2-3 seconds for the page and all the js to load it took 30-40 seconds!

Finally I had the cause of our issues and was able to adjust the 15 in this piece of code for implicit waits to 60 to allow the page to fully load.  Boom.  5 months of a mysterious ajax loading issue finally nailed!

                                         -->  <-- 15 changed to 60
def wait_for(condition_name, max_wait_time: 15, polling_interval: 0.01)
  wait_until = Time.now + max_wait_time.seconds
  while true
    return if yield
    if Time.now > wait_until
      raise "Condition not met: #{condition_name}"
    else
      sleep(polling_interval)
    end 
  end 
end

2. Slow Automated Mobile Tests

Finally tests are passing for mobile devices through the browserstack automated service.  But they are soooo slow.
One of the most simple’fixes’ for this was to simply reduce the tests being run to reflect those workflows actually used on mobile devices.  Two of our workflows are only on desktop.  This reduced the mobile test specs from 24 to 9 and the run time from 42 mins to 18 mins

3.  Passing, Failing, Passing and stopping randomly

The mobile tests have been behaving very erratically in other ways.  At various different points it seems like the device suddenly stops working or responding.  Then we would get 3 test suites in a row passing.  With that good sign we ran a bunch more.  But most failed.  Except the last one.  After a long day of runs, pattern finally spotted.  The different test runs on circleCI (we can have 3 running simultaneously) have 4 slots (machines) with each run).  They create **and destroy** a tunnel to virtual devices at Browserstack, so different runs are killing each others connections.  Not obvious until running multiple (or the same) branch at the same time as other CI runs.  Proved by running one run at a time.  Fix is to start/stop the tunnel once.