2019-05-24

Taco Bell Programming and The Time I Summoned The Ghost of Doug McIlroy (Who Is Still Alive at the Time of This Writing)

Don't overthink things.

If I had one piece of advice for software developers, engineers, and other folks working in careers that value attention to details, it would be this. Don't overthink things.

Think about things, sure. I'm not advocating an "ignorance is bliss" policy or saying that the simplest solution is the only correct one. Far from it; if you have a problem, you should consider all the variables before you implement a solution if you have the luxury of time to do so. Too much time is wasted on trying to forge a perfect bespoke widget that is the exact size and shape of the problem when a much more generic option exists that can get the job done in a fraction of the time.

A recent BSD Now episode (#291) discussed an article about Taco Bell Programming. The idea is that Taco Bell is one of the most successful fast food chains in the US and its entire menu is composed of only a few basic ingredients: meat, cheese, lettuce, tomato, onion, flour tortilla, corn flour tortilla, spices. All basic stuff, but they can be reorganized and arranged into a dozen or more permutations.

It reminded me of the 1986 Jon Bentley challenge. The challenge was simple enough by today's standards: given a text file, do a word frequency analysis on it and print the n most common words in the file. He enlisted Don Knuth to solve the problem, which he did, with a custom language he'd designed called WEB. Knuth's program was mathematically elegant and used a bespoke data structure perfectly tuned to solve the problem. It also had bugs that made it incorrect.

Doug McIlroy, when presented with Knuth's WEB program, complimented its design and cleverness, then produced a new implementation of it in a six-part UNIX shell script:

tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q

Not only is McIlroy's solution more correct than Knuth's custom WEB app (har har), it's short and sweet. In fact, it's trivial to understand if you use UNIX regularly and still easy to digest in simple terms if you don't. It uses four different programs found in the base system, and it only calls sed because head hadn't been written yet. It doesn't require you to download and install the WEB runtime and compile "knuth-1.wb" into a binary that you will only use once in your lifetime.

McIlroy's solution was Taco Bell Programming at its finest, and he taught a valuable lesson that is all too often forgotten, especially now in this era of cloud computing/machine learning/blockchain/distributed buzzword soup. Companies push their complex software as "platforms". The Taco Bell Programming philosophy says that new systems mean new problems and I remember the time I had to pull a McIlroy move of my own several years ago.

I was just a junior sysadmin sitting at my desk, wondering what TV I was going to watch that evening when my boss and his boss walked into my office. When that happens, it's never a good thing.

"John's on vacation and there's a hotfix that the Green Team needs to deploy, can you do it?"

I was on the Blue Team, and we were almost entirely separated from the Green Team except by a few endpoints that relayed our data over to them sometimes, and they never invited us to see how their sausage was made. I say "them", but really the Green Team was one guy: John. John was a senior ops engineer and there was no junior to support him. He inherited the role when the last senior left, and he never backfilled a replacement due to never finding a suitable apprentice.

At one point after his rejection of an umpteenth potential candidate, a coworker asked him, "John, remember when you started working here? If that version of you showed up for an interview today... would you hire him?" John still never found a replacement. A few years later he'd be forced by management to take one of my Blue Team vendors and I hope that worked out for everyone.

But on this particular afternoon John was the only ops engineer the Green Team had and John was on vacation for at least a couple more days. They needed to deploy a hotfix and since John wasn't available, they just needed a keyboard monkey with the right credentials to hit the keys.

I agreed. How could I refuse?

I was assured that this was A Big Deal: customer-impacting, gotta get it fixed, now-now-now, and that the Green Team Developers were hard at work upstairs — developers were never on the same floor as the lowly operations guys — hammering out a fix for the problem. "OK!" I thought. I will just wait for the devs to finish writing their thing, testing it, confirming that it works in their test environment, and then I'll just run a script or swap out a binary or twelve, restart a service or two, and everyone will be happy.

I wanted to make everyone happy. I was young.

I sat in my office, fiddling with whatever I was supposed to be doing, but anxious about getting this hotfix deployed. I'd never touched the Green Team servers. I could screw it up. "You won't screw it up," I was assured by my boss. I was to be given the usernames and passwords I needed, all the server names were documented. All the repair instructions were going to be handed to me on a silver platter. The dev team was going to make absolutely certain that I could not mess up their system. I couldn't. They built it to be un-screw-uppable.

Throughout that afternoon, I'd get status updates from upstairs. "They're almost done coding." "They just handed it to the test team." "Test team is going to sign off on it as soon as the turboencabulator finishes reticulating its splines." I barely understood what any of the statuses that I was being given meant, but I understood the bottom line. I was going to be working late.

Not too too late. Test team was going to finish their testing by 5 PM, I was going to deploy the fix, verify it, let everybody shake hands on a job well done, send out an all-clear e-mail, and be on my way. Twenty minutes. Tops.

So at about 5 o'clock people I've never seen before or since start to file into my office one or two at a time. They said hi, introduced themselves as so-and-so from the Green Team upstairs, and then parked themselves somewhere to await the final blesséd bits to be delivered into my waiting embrace.

By 5:30, I'd say there were a dozen people crammed into my office. They were sitting on my guest chair, sitting on my love seat, resting on its arms, milling in the corners, and hanging out anywhere there was a spot to stand. I had so many people interested in this hotfix that they were standing outside in the hallway because my office was too crowded to get anyone else in it.

Then the word was passed that the fix was ready and some senior program manager with a look that just screamed "all business, all the time" came in, said hi, and sat next to me to get started. The sea of devs parted to let him get access to lowly me, the guy who was going to do the typing.

He had some notes and he handed them to me. I looked over what I needed to do and it didn't seem that crazy. Copy a directory from A to B, open a file, confirm it says C in it, restart a service. Basic stuff, and the guys upstairs had spent all damn day making sure this was going to fix everything automagically and the day would be saved.

So with all eyes on me, I got to work. Go to A, find the directory, log into B, credentials worked (whew!), make a backup on B (brownie points for that), copy the directory to a safe place on B, move it into position, open the file, do the thing. I made sure at each point that I was moving just slowly enough that if I started to go off the rails one of the more than 24 eyes would catch my mistake and stop me before I broke something. I confirmed with the crowd. "OK, directory is Foo.Bar.Baz-7." If I'd said "Foo.Bar.Baz-8", they'd correct me. Things were going smoothly, even though the tension in the air was palpable.

The temperature in my office was climbing just from body heat and hot breath. It was a little after 6. Everything was going according to plan. I was going to go home soon.

The old files had been backed up, the new files had been put into place. It was time to restart the service and split. I restarted it, confirmed it was up and running, no errors, great! I walked through the verification steps.

It wasn't working.

The program was running. It hadn't frozen, it hadn't crashed. But the hotfix was not actually fixing anything. There were sighs and groans, and the tension in the room ticked up another few notches.

Someone stepped out of the office. I'm not sure I ever saw them again. Maybe they went to pack up their belongings and leave the company.

The PM sitting next to me didn't lose the weird intensity he'd had when he arrived. It didn't even waver. The troubleshooting began immediately and from a dozen people. Backseat driving pales in comparison to backseat debugging, especially when so many people are doing it simultaneously they could all go form a soccer team. They even had to make their condescending opinions known about my choice of text editor: my choice of WordPad was met with snide disapproval.

"Open this log file," said one. "Check the Event Viewer," said another, and we were off, trying to understand what this program was doing and why it wasn't doing what we wanted it to do. I say "we", but I honestly didn't know what this program was supposed to do in the first place and this wasn't my team. But I dutifully opened all the log files and checked all the tire pressures and patted my head and rubbed my stomach at the same time.

Around 7 PM or so, with tabs opened and events viewed and logs grepped for arcane bits of data only God could know the meaning of, I finally asked what the problem was. What was so terrible in the world of the Green Team that they'd dropped everything that day to write a fix for it and were now settling into hours of unpaid overtime to debug all its heaping helpings of failure in production over the shoulders of a borrowed Blue Team involunteer?

"Files are arriving on Machine One," the PM said, "and they're getting stuck there. They need to get dequeued and sent to Machine Two."

Without thinking, I asked. "Why not use robocopy?"

Tongues were clucked. Maybe someone laughed, the smug kind of chuckle parents don't even try to hide when their kid asks in all seriousness if clouds are made of candy. Their meaning was clear: robocopy isn't an enterprise-grade solution you silly boy! You are looking at a complex pipeline of data management services! A dinky built-in file utility that our test team hasn't signed off on is not a proper solution! The development team has spent all day carefully crafting expert algorithms to handle this issue and they've got a proven pipeline of tools to demonstrably create and test reliable software...

...idiot.

Hours passed. The devs couldn't figure out what was going wrong, the testers couldn't figure out why this all of a sudden wasn't working when they swore up and down it had in their lab. There was muttering and bickering amongst the throng, but it didn't result in any enlightenment. People started filing out. It was late. They had families to get to, cold dinners to reheat, and really, they were only there to see their hotfix work so they could get cake & ice cream from their management. Once it was obvious their update was a complete wreck and the cake party would be cancelled, it was time for them to quietly exit my office. The teeming mass slowly dwindled to just a few people over the next few hours as plans B, C, and D were thought up, attempted, and rejected.

We were heading into hour four. This isn't counting me staying late just waiting for the dev and test teams to finish coding up their stupendous little notfix. This was just four hours of pure deployment fail.

By 10 o'clock, only the program manager, one last dev, and myself remained. We were tired, we were hungry, and we were out of ideas. So I mentioned it again.

"If you're just trying to move files from one machine to the other, and that's it? Right? Why. Not. Just. Use. Robocopy?"

I can't say I'd persuaded him. By this point in the evening he was already sunk, treading water, and had lost a lot of energy. I wasn't influencing him with a brilliant display of lateral thought. I was giving him a straw to grasp. He relented, and asked what that would entail.

I ran "robocopy /?" and pointed out that it was designed to copy files and it can do it between machines with UNC paths. And you could remove the copied files from the source machine if you wanted. And you could just have it run every few minutes. Which, it turns out, was exactly what their holy Foo.Bar.Baz-7 software was designed to do, poorly.

It took me all of about a minute to write up "fix.bat":

@echo on
robocopy.exe D:\Data\Files \\other-machine\D$\Data\New-files /MOV /MOT:5

The PM was shocked. Not in an "I'm impressed" way. Shocked like he'd just spent twelve hours getting screamed at that everything is on fire and now this Blue Team bozo was showing him a two-line batch script that was going to save the day. It was janky. It wasn't tested. It didn't comply to the Big Book of Secure Coding Practices. It didn't have a code spec document that had been reviewed and approved by the code committee. It hadn't been checked into source control or pair programmed or anything. And it was written by some guy in Ops. Ugh!

But it worked.

It worked well enough that we could go home. The files were moving from one machine to the other and that's all we needed to get things working again. The devs could come back in the morning and start rewriting their algorithms and they could try their homebrew solution again when John returned.

The PM could write his bosses and, I expect, understate the waste of a day's and night's work on a bogus fix that didn't do anything. He could report that the live-site issue was mitigated and that additional cleanup work remains to be done to ensure a permanent resolution to the problem.

He thanked me. He and the other guy left my office. I packed up, went home, and microwaved something for dinner.

Don't overthink things.

No comments: