We recently swapped our prepay billing system – for some reason, I prefer the term “prepay billing system” as opposed to Intelligent Network but that is a topic for another day. Replacing a system is fraught with risks- some are foreseeable, others are simply not (and that is the fun bit, I suppose). Having witnessed a similar system swap that went horribly wrong in 2008, I expected that things would even be worse this time round. Back in 2008, I could hardly name any functionality that was working for 6 hours after cut-over. Services that had passed UAT failed. The RA tool failed to process the feeds. There was surprise after surprise. Over the past 3 years, complexity has only increased and I was not sure that human intelligence has gone up by the same measure. This time, I expected nothing to work.
The swap was scheduled for the last weekend of February. My incredulity when I heard of the timing was great, to say the least. In a predominantly prepay market, most people will load airtime on their phones around this time and revenue also registers a perceptible increase around this time. I have no statistics but I daresay even pubs register increased traffic and the number of cabs heading to the red-light district also goes up. Why would we ever want to swap the system around this time? Nevertheless, the decision had been made. Management giveth, management taketh.
I was surprised that when we did the swap, things worked. From review of tweets around that midnight, some customers actually thought the upgrade had been postponed (they did not experience service interruption until the critical 5-minute period when the actual swap took place, voice calls were OK, internet was OK, recharges OK). Data dumps had been taken and uploaded before actual service affecting procedures were run hence everything looked normal. Quickly a trending topic developed on twitter titled “reasons why Safaricom upgrade is late” or something to that effect. The suggestions were creative.
They are waiting for Bob Collymore [CEO] to tweet one last time…
The Engineers are drunk…
They cannot find the power button…
They disconnected themselves first…
What is wrong with these guys – they cannot even go down on time…
It’s nice (in a disturbing way) to know that your customers expect you to mess up by default – if you don’t, you shine. Although there were several reconciliations to be done, revenue assurance participated in this project largely in a risk advisory capacity. When the whole exercise was over, I think we all learned a few lessons. Were there problems? Yes. Do we still have problems? Yes. Was it better than last time? I will leave that for the customers to decide. Contrasting our experience in 2008 and the relatively smoother run of 2011, some things come to mind:
- Dry run, dry run, dry run. The mock migrations teach the project team a lot. We performed a total of 7 dry runs and also obtained authorization to route live traffic to the new system during one of the runs. Dry runs help pinpoint which resources are needed, how much time each migration task can reasonably take, what to do and what not to do. Even if practice does not make perfect, it certainly does make possible with less bloodbath. The dry runs also point out which characters in the project team are a vexation to each other and the project team can be realigned accordingly. You certainly don’t want two of your bright engineers going at each other hammer and tongs on that cold lonely night when the new system must go live. During dry runs, some light bulb moments come up and some dumb ideas also come to the table. As a team, listen to both before deciding. Even better, test both the “stuff of genius” and the “dumb ideas” at the next dry run. The results may surprise you – what was sounding so dumb may be the perfect idea and what was sounding so great may be useless!
- Proper company-wide pre-launch testing is equally important. The user acceptance testers need to be competent resources who have a sizeable part of their job, at least during the testing period, dedicated to the tests. In 2008, we treated testing as an additional task. The testers had their normal daily jobs and tested the system when time allowed. This time, we had dedicated teams from Revenue Assurance, Technical, Retail Teams, Call Centre Teams, Finance…each of these teams is an expert in one area or another and if the system has a problem, you can depend on them to discover it. In total, 4 rounds of testing were done. We had a fairly good idea of what was working and what had issues.
- Keep an RA Person at a Customer Facing Point during the swap and for some time after the swap. We got first hand feedback on the billing and service functionality issues that customers were complaining about because we had team-members at our retail centres and the call centre. Even more important, this avoids the real issues being lost as they are forwarded via emails.
- During a swap, sometimes revenue is not the critical part. Your team may be called revenue assurance but it needs to go beyond ensuring that the new system is billing correctly. Services will be affected. How long they will remain affected is even more important. Some processes will change and with that, new risks will be opened up. How ready is the business for this? Things will go wrong. That is not the important part. How fast your team can react is even critical and will determine how long customers suffer. If some processes of correction have unnecessary red-tape, now is a good time to get rid of the bureaucracy.
- Is there anything that you require? Make sure it’s in place. Do not depend on promises that it shall be availed on D-day. It does not matter how small or big it is (recharge vouchers for testing top ups, disk space on the analytics tool that you will use, sitting space…chewing gum, Redbull, coffeemaker). If you have an automated RA tool, put the vendor on standby. Even if you tested integration of the new system feeds into the RA tool ahead of the swap, you can be sure something will come up.
- Separate symptoms from the real problem. Some complaints are simply symptoms of a deeper problem. For example, shortly after the swap, we had customers complaining that the money transfer services were not working AND top ups were not working AND data bundle purchases were not working. All seem unrelated but the common factor is that they rely on SMS. We had an issue with SMS. If the dots are not connected fast enough, the customer suffers even more.
- Never Trust A CDR definition Document. Dive deep and find out if what the document says is what you received. 3 years ago, we trusted and we suffered.
- A system swap is as much about people and processes as it is about systems. When changing systems, be aware that politics will come in. People who have built their reputations around mastery of the old system may be understandably nervous. When somebody tells you, “That system feature will never work!”, do listen to the statement and the underlying meaning as well. Perhaps the person has no idea how it works and may require some more training.
If you forget all the lessons I have highlighted above, remember this one as it is very important: Ensure that your team has a place they can take a quick shower and brush their teeth. The swap stretched longer than 24 hours. Trust me, you do not want to be in a room with some 50 odd individuals who have not showered or brushed their teeth. Next time, we shall keep a garden hosepipe handy.
Looking at these lessons, they are not the stuff of great discoveries really. They are basic stuff. Stuff that easily gets overlooked. Stuff that we overlooked in 2008 and consequently made our customers heavily pay the price.