Recipe for success

What does a team need to deliver a successful software project, starting to think about what I’ll want in my next engagement.

There’s plenty left to do, but as I approach the end of my current main assignment as a Technical Architect, I’m starting to think what my future engagements should have.

This is my starter for ten five:

  1. Anything but waterfall
  2. Genuine Public Cloud, with a hint of lock-in
  3. Internal users matter just as much
  4. Partnership with your Product Owner
  5. Embedded QA, seen as a benefit, not a drag

Anything but waterfall

Scrum? Kanban? Scrumban? I don’t really care exactly what it is, more that it works for the project, everyone understands and supports it.

I hate designing things entirely upfront, it just seems so conceited that you can genuinely design an entire system without trying to make any of it. While I know this doesn’t apply when you’re building a rocket1 or CERN, you’re not doing that, are you?

Yes, you absolutely need a sense of roughly where you’re heading, and ideally an end goal that you’re heading towards – but you also need the pragmatism to know if you try to build that from the start, you’re going to burn lots of rubber on the road, while making very little progress.

Show your dev teams that you can and do go back to make things better. Build the sense of trust that when you say “Just build the slightly-hacky ‘tactical’ thing, we will fix it later” that you do go back and fix it.

You’ll free everyone up from the performance anxiety of “Must get it right first time, because I can’t go back and fix it”.

Genuine Public Cloud, with a hint of lock-in

I would like to think that cloud is a given, but I still face people who say things like “It’s just someone else’s computer” – yes, but in general they have better capacity planning than you, or the “I could do x for cheaper” – which I’m sure you could, but you’re usually not factoring in the hidden costs.

The main system we built does have an on-premise element, but it’s controlled by the cloud, and deployed in a similar way.

We host the core of the system in the cloud, and that gives us an agility in scale and deployment we don’t have on-premise. Now, could we get that in time, I’m sure we could, but then we lose the benefits of the AWS value-add services…

“we use Amazon, but we only use EC2 and we don’t use any of their special services, so we’re not locked-in”

Speaking of which, when I hear that particular line, I want to congratulate the person on ensuring they’ve deployed their software in a way that will either cost them more, or be less reliable, or both.

At some level, to get the best value out of a cloud provider, you do need to be using their value-add services, meaning you can run bits of your application server-less or other bits as more scalable state-less systems.

Yes, if you write a Lambda, you can’t instantly port that to Google Cloud Functions, but given they both run Node, provided you put the thing that does the work in a scoped module, migrating should mean you write the Google invoking code.

I’m not saying use every service, but to start with the position that you’re just going to use Infrastructure as a Service, is too dogmatic.

Internal users matter just as much

Yes it’s an internal system. Yes it’s not public facing.

Yes it should still be as performant and usable as your public properties.

Facebook probably does more than your system. Facebook is generally fast to use, and yet nobody gets training in how to use it. If your system requires lots of training, are you doing things as well as you could?

Consumer technology and services are good. Very good. Your users expect your system to match that, and when you give people tools that work well, they’re freed from hating the system they are using, and allowed to actually focus on the tasks they’re doing.

Focussing on my current engagement, a partnership with our core users meant they took up some extra manual working, while we ran the extended migration. They only agreed to those once we had earned their trust, and they realised that “could you do this for 3 months” was just that. (granted it was more like 4 months).

Partnership with your Product Owner

Product Management is still a relatively new discipline, so there is no one-true-way, and I hope there doesn’t become one, because not all products are the same.

Regardless, partnership with your Product Owner is crucial, and if they’re technical you want to work hand-in-hand with them on key design decisions. If they’re less so, you need their trust and for them to delegate responsibility.

Embedded QA, seen as a benefit, not a drag

The embedded tester in the team is a key resource. They should ask questions, spot the things we didn’t, and invariably are a first call for “do we know what happens in situation x?”.

For all the frustration that Test Driven Development can cause when doing genuine micro-services, the testing framework that provides means that we never ship the same bug twice. Sometimes when we’ve suspected bugs, modifying an existing test have helped us check our hypotheses quickly.

Easy regression testing make you far more able to build and iterate quickly.

In conclusion

You can’t make a project be a success, but there are things you can do that increase the chances…


  1. And talking of rockets, look at what SpaceX have done, which looks pretty like rapid evolution of a rocket platform adding more capabilities…

Re-use more than code?

“You can just re-use the code from x, can’t you..?” is a common call in organisations, but does it always make sense?

I’ve been working on a project recently, and when it started, we were “just going to use the components from <another project>”.

You’ve written many lines before, so why wouldn’t you re-use them? In the abstract it seems a pretty sensible thing, but it rarely works so much in practice.

It’s unlikely your company is writing something as fundamental as a security library where the domain is fixed or as universal the company Active Directory, where you only need one.

What you likely have done is a series of tactical solutions that meet the needs of each silo, which isn’t a bad thing because they’re probably bits of code that were delivered. How often have we waited for the ‘generic’ solution that didn’t really work for anyone.

Now I’m not saying that where it’s genuinely re-usable, you shouldn’t avoid code re-use. If the domain is simple and generic enough, converge on one library. But code isn’t the only thing you can re-use.

Going back to the specific example, I spoke to the architects from the project that we were just going to lift-and-shift; and we discussed how the new things that AWS launched made much of it moot, or far more heavyweight than you’d build if you were starting from today. “You could re-use this, but why don’t you look at doing that” was the outcome.

Instead the value came from, speaking about the things that they couldn’t (feasibly) change now, but would want to, “We have too much data in this account, and we can’t ever move it”. We used those as a basis so we didn’t end up in the same situation.

Experiences and things learned along the way, are just as valuable as avoiding writing some code.

Some great new/newish podcasts

Searching for a new podcast after Serial, there are loads of them to choose from right now.

Podcasting, after many of the UK newspapers pulled out of it, is going through a resurgence, here are some suggestions of additions to your listening list if you’re feeling a bit lost without Serial.

(I’ve still not listened to Serial, please don’t hate me).

NPR’s Invisibilia is from the same stable as RadioLab, but isn’t quite as heavily produced. Delving into the mind the first few episodes have been really enjoyable.

Alex Blumberg’s (formerly of This American Life and Planet Money) meta-podcast Startup about the launch of his podcasting empire (the one about the mistake is great listening to everyone who’s ever made one in business) has already stolen the hosts of internet show TL;DR to given us Reply All. Basically the same format, quirky stories about people and the internet.

Meanwhile back at WNYC TL;DR has a new host, and is still worth a listen.

Finally, Helen Zaltzman from Answer Me This now hosts a show about words, The Allusionist. It’s much shorter than AMT, and the first episode describing her suffering at her family’s puns will be all to real to anyone who listens to The Bugle.

You’ll be literally drowning Mail Chimp mentions and Square Space promo codes, did you know they’ve just launched Square Space 7 which integrates Getty Images… THEY’VE GOT TO ME.

Blogging about your Cloud Tech is only interesting when it’s Novel

If you’re blogging about moving to the cloud, you have to write about the interesting things in your migration, and not just how you did Best-Practice.

So a while back I bitched about Why The Cloud Is Oversold, talking more generally about the supposed other-wordly experience that having Sensibly Flexible Virtualised IT is… well I’ve a new pet-hate: organisations Overselling Their Adoption Of The Cloud.

I know transparency is good. It’s also pragmatic because if the information is on a computer that is even near another computer that’s on the internet, it’s going to be leaked.1

It’s genuinely interesting when people share the unique work they’ve done. Especially when Public Bodies do stuff, look at how much open-sourced, and how much of that reused. We’ll not mention that Scottish Government developers can’t access the repo as GitHub is a blocked “file-sharing” site.

The team I worked with at the BBC have spoken widely about how they turn ongoing streams of video into neatly segmented files, that are uploaded to S3 at more than 1 gigabit a second, and how these are made into the things you see on /iplayer.2

Alongside the stuff that’s of sufficient scale to be interesting, Video Factory also uses a load of standard enterprise patterns: micro-services, communication through queues, separation of concerns, etc… They’ve spoken about these, but very much in a “we’re just doing best-practice after big monolithic system pissed us off too much” way.

Anyway, I just read a blog post, by another public body documenting their transition to the Cloud and a new Responsive Website.3

Turns out sometimes they get a lot of load, and this is a problem they’ve had to solve. I’ll give you a second to think about how you’d solve bursty load on AWS.

Have you guessed?

They’ve only cached the site behind varnish, and are running that in an auto-scale group behind a Elastic Load Balancer.

That’s a pretty standard best-practice. Perhaps the novelty is that they’re a Public Sector body doing a sensible thing.4

But best-practice, by its very definition, just isn’t interesting blog-fodder: “Hey, We Do The Thing That Everyone Else Is Doing”.5

This leaves me wondering what next from this organisation:

  • “Our Windows PC Estate uses Microsoft Update Server to ensure they’re patched”
  • “We make our endpoints run anti-virus and disable USB ports on front-line single-use machines”
  • “We use Active-Directory federation to provide single-signon across all of our desktop applications”

If we’re really lucky maybe they’ll tell us: “How We Use Chaos-Monkey to Simulate Cloud Error-Situations”

I can’t wait.

  1. That is an exaggeration, but not nearly as much as I’d like it to be
  2.  I helped make this bit and I’m still disproportionately proud of it
  3. The kind you hate on the desktop because of all the white-space, and where the custom fonts don’t look quite right
  4. I could link to numerous projects here, so here is a small selection of failure
  5. Netflix get to do it, because they’re the one of the groups setting out best-practice in AWS

Perfect is indeed the enemy of good

The desire to do things well stops us doing them at all.

I re-connected with someone on linked-in the other week. (Yes, I actually use it like that). And he sent a lovely, long detailed reply. One that I was delighted to read. One that I want to reply to.

But I haven’t.

Anytime someone sends me a nice, long, structured message, on pretty much any medium, it falls into the awful silo of “well i need to sit down and write a nice reply”.

And it stays in that silo, along with all the other things like that.

So instead, I’ll write a little blog post about not being able to write, using up some of my daily word-quota in the process, and making the writing of the reply, even less likely.


Secret Cinema’s PR Car-crash

Secret Cinema showed how not to communicate after the opening night of their latest event was cancelled.

Lots of modern knowledge-based skills are like Search Engine Optimisation: the first 80% of SEO is “build a decent website” and the last 20% is the ever-changing dark-magic that few people really understand.

I’m adding “communications in a crisis” to this list.

Secret cinema have cancelled their opening shows of Back To The Future, the first show cancelled about 2 hours before it was due to start. The comments on that post are just about as awful for the company as you’d expect.

The company is replying, but with a statement usually of the lines “please address your concerns to us at this email”. Unsurprisingly this isn’t meeting with much understanding from their customers.

As I type this on Friday evening, they’ve just cancelled the weekend shows, and the “situations beyond their control” appear to be the council aren’t satisfied the venue is safe.

Predictably, their Facebook wall has been carnage. People explaining how they’ve travelled far for this event, and are feeling let down. Now if you travel to a faraway place for a pop-up event, by a company who have cancelled opening nights before1, caveat emptor comes to mind. I’m not saying I don’t have sympathy, but I doubt I’d travel myself in the circumstances…

Crisis comms are hard

There are companies who charge you an awful lot of money for just this. The ones you call when things are really bad: like when your product kills people.  But much like SEO, companies can do the simple things to get the first part themselves.

4 Basic Steps to Delivery, You’ll Never Guess What Happens When You Don’t Do Them:

  1. Project Management is your friend: If they didn’t know until the first day that they had these problems, they don’t have a decent project/production management team. This isn’t a hobby, this is a company that take a lot of money from people, they need a decent delivery function that could warn ahead of time.
  2. Honestly within the company: can you delivery team tell you that there are possible problems, or are you stuck in an organisation where the status report has to be green? Or worse, are you in an organisation that denies possible problems until they’ve actually happened.
  3. Run Pilot events. This is the kind of thing you probably want a few preview nights, beyond rehearsing with the cast, but rehearsing with audiences there so you can check things work. You can set expectations for these nights better, with lower tickets prices, and framed as a community rather than a customer experience. Scratch that, apparently their preview on Wednesday was also cancelled.
  4. Prioritise: there will have been things here key to the experience, and things that were icing. Build and get approval for the main stuff first. If you can’t do the other bits that’s a shame, and the pilot/early nights might be impaired. But at least they can run.

The 6 Secrets to Basic Crisis Comms Techniques They Don’t Want You To Know:

  1. Don’t Weasel Word: Be very careful about the phrase “beyond our control”. I watched a documentary about Crossrail last week. The crane they needed one weekend didn’t turn up because only 2 of them are in the country, and the one they’d booked was delayed. That is “beyond their control”.  I say this with no insider knowledge, beyond the news articles, but  Secret Cinema were in control of applying and meeting council safety approvals. Saying it’s “beyond your control” makes an organisation look like it’s in denial.
  2. Appear Open:  They should have published their compensation policy and directed people to that. Telling people to “address concerns” privately makes it look like the organisation has something to hide.
  3. Appear Honest: This isn’t an outage of a complex system that takes time to diagnose. Saying you’ll post “more information later” just makes it look like an organisation in disarray.
  4. Take the Hits upfront: They could have cancelled more shows upfront, still disappointed people, but put them in control earlier. Drip-feeding cancellations just continues the uncertainty, again adding to the appearance of disarray.
  5. Finally, you’ve broken promises: Don’t make any other promises you can’t keep. It seems so minor, but saying you’ll update at 11am and failing to post anything until after 12 just continues the appearance of the organisation in crisis and denial.

I suspect this incident will be a case-study for crisis PR for years to come.

The Meaning of Silence

Marco’s new podcast app Overcast can remove the silences. Does our relentless demand to make everything more efficient sometimes remove more than is desired?

Marco Arment, formerly of Tumblr, Instapaper,  and The Magazine, has released his podcast app Overcast. It’s generally very nice, and already seems to annoy me less than Apple’s own app.

As well as the standard playback speed settings, Overcast offers the option to shorten silences. This speeds up your podcast playing without distorting the audio. It’s an optional setting, and one that correctly you can set per-podcast.

Now… I can.. appreciate… how this might.. erm… help if you’re listening to a podcast by someone who has awful delivery. Most of mine are from radio shows from members of Big Media: there isn’t a lot of silence to be culled.

Some of the tweets have been very, for want of a better edited phrase, Techno-utopian-efficiency-fetishizing. Comments along the lines of “Already saved 30 minutes using SmartSpeed” and “Can you add up and display all the time I’ve saved?”

My issue is that well-meaning pauses are just as much part of good oratory as the words.

Take them away and things can go hilariously wrong:

This isn’t a criticism of the app, or the author. The feature has its place. I’d prefer to think about it being used to fix deficient audio, rather than eke every possible minute possible out of listening.1

I just tire of the endless demand for evermore efficiency in everything.

Yes I want my banking to be easier.  Of course I’d rather type data into systems directly rather than sitting on the phone, as someone enters it for me…

But when the need for faster/cheaper/better detracts from the experience, that’s when it starts annoying me. When it’s the kind of mindset that thinks that chewing food is a chore.

Not everything needs to be efficient, not everything needs to be a measured.2

  1. I’d prefer people to produce better audio in the first place, but it turns out producing decent audio takes time…who knew?
  2. And on that note my FuelBand is nagging me to get moving

Falsehoods Smart-Device people believe about Home Networks

A few years ago someone posted a great article about the bad assumptions programmers make about names; here’s a similar list about assumptions about home networks and smart devices.

We all remember the excellent Falsehoods people believe about names don’t we?

Having lived with a few smart devices sharing my network for a while, I thought we need a similar one about smart devices and home networking.

Items marked with a * contributed or inspired by @davidmoss

  • The WiFi is always available
  • The WiFi is continuously connected to the internet
  • The WiFi network isn’t hidden
  • The WiFi network isn’t restricted by MAC address so they can be hidden from the user
  • The WiFi network doesn’t use strong authentication like WPA2
  • The WiFi network definitely doesn’t use authentication mentioning the word ‘Enterprise’
  • The user knows the exact authentication type is use for the WiFi, so no need to auto-detect it*
  • There is only a single WiFi network
  • The name of the WiFi network is ASCII*
  • There is only a single access point for the WiFi network
  • Any device connected to the home-network is trusted to control the smart devices on it
  • Smart devices and their controllers are on the same network
  • Devices on the network can connect directly to each other
  • The network is simple, and doesn’t use other technologies such as powerline1
  • All networks have a PC type device to install/configure/upgrade devices (and that device is running Windows)*
  • There is always a DHCP Server*
  • Devices will always get the same IP address on the internal network from the DHCP server
  • DHCP device names don’t have to be explanatory, because nobody ever sees them
  • Devices can have inbound connections from the internet 2
  • The network is reliable without packet loss
  • The connectivity is sufficient for all devices on the network
  • The performance characteristics of the network is constant and doesn’t change across time
  • The Internet connectivity isn’t metered, and there’s no problem downloading lots of data
  • Encryption of traffic is an overhead that isn’t needed on embedded devices
  • Predictable IDs like Serial-Numbers are good default security tokens
  • Unchangeable IDs like Serial-Numbers are acceptable security tokens
  • The device won’t be used as a platform for attacks, so doesn’t need hardened from threats internal and external to the network. 3
  • Devices can be shipped and abandoned. They won’t be used for years, as so any future software vulnerabilities can be ignored
  • IPv6 is for the future, and doesn’t need to be supported4

What have I missed?

  1. These should be layer 2 transparent, but they can disrupt Multicast which can break bonjour
  2. aside from security implications, ISPs are moving to a carrier-grade NAT to work around IPv4 address exhaustion, so inbound ports may not be possible
  3. many devices have a pretty complete Linux stack, at least complete enough for attackers to use
  4. Chicken and Egg this one

Security is hard, but the easy bits aren’t

The hard bits of security are hard, but the easy bits aren’t. As infrastructure gets more dynamic, we need to make sure it isn’t everyone else redefining it.

Another week, another story about security.

Actually multiple stories about security.

And what’s upsetting with these ones are the fact that the fixes for them are already available.

I don’t cut-code anymore. I’m not a particular adept coder, and I think my code is a bit ugly. But I still know what bad practice smells like and what upsets me is how often we have repeat the mistakes of old. 1 2

Yes there are always deadlines, but if we’re working with advanced software defined infrastructures, then we have to restrict who can redefine those.

If you’re in a Product Manager role, don’t be afraid to ask what you’re doing for security, or what response plans are if something is compromised. Be mindful of the risk to your reputation or risk if you don’t give developers time to improve security instead of piling ever more features on. The mitigations for the most obvious attacks are documented, and usually relatively easy to implement.

And now to the details

Code Spaces had all their data wiped, we don’t know all the details but it sounds like:

  • They hadn’t enabled 2factor auth on their AWS account
  • Their backups weren’t to a different AWS account,  or better still to another provider.

If you’re running a production service, and you’re hosting data for anyone else, then your backups need to be rock solid. Backing up to the same provider, in the same account, is like copying all the files from your desktop into a folder called “backup”.  Sure you’ve two copies but when that disk goes bang they’re both gone.

And yes, 2 Factor is a pain when you’re logging into services, but if you’re hosting customer data that’s a pain you need to cope with. Providers usually let you set up many secondary accounts with reduced privileges, so use those tools to protect your services, and let people do just what they need in order to do their jobs.

On a similar theme people are leaving their AWS keys in android apps. Amazon offers a ticket granting service that’s ideal for this, but that’s more work, but work that you should be doing.

Some people aren’t even using those permissioning tools to embed keys with limited access, which just to reiterate, you shouldn’t be doing anyway. Instead they are embedding their main access key pair, which means that attackers could access and delete all data, and spin up thousands of instances just for fun/profit.

Security is hard, the recent problems found in libraries like OpenSSL are hard for an individual coder to work around, but decent libraries are still better than going it alone.

The 80:20 rule is ever present, will you ever make your app fully secure; unlikely. Can you prevent the most obvious attacks with application of best practices, which many programming languages can do for you; yes.

Don’t leave keys lying around, give apps or services any more permissions than they need, or use predictable IDs for sensitive data…

Do sanitise data you’re given, protect from XSS attacks, turn on 2-Factor Authentication for anything serious and always keep decent backups hosted on separate infrastructure…

These lists go on, but they not new: Best practice years ago, is still best practice now.

  1. Don’t get me started on file-moving scripts that don’t use incoming and outgoing folders to avoid race-conditions
  2. Or when we tolerate software from vendors that can’t run as anything other than root or Administrator

Cloud, the cost and value of everything

Forget scalability, speed to change and flexibility, I think the single most important thing about cloud hosting is putting an explicit cost on everything…

Last night I gave a lightning talk at the newly tweaked #metabeertalks, these guys are great friends of mine, and their topic was “is realtime Fashion or Fad?”.

Modern hosting approaches, aka “the cloud” have many advantages: They scale trivially, encourage you to use best practices in how you architect and deploy, and are flexible to change as your application does.

The single most powerful thing though, is that it puts a cost on every element of your application. We can debate if it’s cheaper, more expensive, or about the same as hosting on tin: but you know what your components cost.

Your application isn’t being bundled up with a load of others on a server, with your IT team complaining they have to install a new one with about 1 months lead time every 3 months.

You host your application on instance the right size for it, be it small or huge, single or a fleet of 20. You use the storage you need, when you need it, without playing that impossible game of “how much storage will we need by the time the storage system actually arrives”.

And all this comes with transparency: set your system up with the right tags, and all the costs of an application are known.

Knowing those, you can start flexing: If you need 10 machines to keep up with realtime analysis, they’re yours. Or if you don’t want to pay that, bid for some cheaper instances and batch the work overnight.

Within reason, you can do anything, if you can afford it. So you can take a call about which bits of information are valuable enough to justify being realtime.

When Netflix launched House of Cards series two, you can hear them talking about the “Play Start” messages coming in. That kind of realtime information is amazingly helpful for debugging.

The deeper stats of how many people watched, and how many episodes they binged on, that information could probably wait a few hours to batch…

My take on realtime: do it where it’s valuable, and where you can justify the cost.

Which is exactly the same for all elements of cloud hosting.