AWS Launches MediaConnect and almost gives us multicast

It’s Re:invent time, and Amazon have launched a new service to make video routing to the cloud reliable and easier to set-up.

A few weeks back I was at the brilliant DPP Leaders Summit, it was under the Chatham House Rule.1 There were some great speakers, and I particularly loved the exec who, to paraphrase, “If it doesn’t work without months of professional-services, THEN IT ISN’T AN ACTUAL PRODUCT.”2

Anyway one of the speakers was facing rebuilding their entire stack due to ownership changes, and wanted to do so in the cloud. They said “We need multicast and Precision Time Protocol”. Which I can understand, for playout or production applications, the need for those two is pretty clear.

It’s now Re:invent season, which is the point in the year when AWS tend to release a lot of their good stuff. And yesterday they unveiled a new media ingest service AWS Elemental MediaConnect.

It’s a managed service to get your video signals to/from/between your Amazon clouds.

This has historically been a pain: back when I was working on the Video Factory project we initially mooted a box in the cloud that we would send the signal to, and then that would fan out to both archiving and live streaming. This was hard to do, so we side-stepped the issue, and just rapidly uploaded the stream to S3 in consistently sized chunks instead. Later something was put in place to do the streaming, using something that I don’t think has been spoke about too much in public, so I shan’t detail here.

Anyway, this new service allows you to send content to/from an endpoint using standard RTP (with/without Forward Error Correction) or the more reliable but commercial Zixi protocol. The video has an Amazon ARN identifier, which then means that external accounts can have permissions to subscribe to the stream, the documentation says a ‘flow’ can have up to 20 outputs.

How are we going to use this?

  1. Contribution to streaming output: fire the video somewhere and you don’t have to know if/where it’s being used
  2. Contribution for programming: using few Amazon regions, broadcasters could very easily build a global contribution network to backhaul outside-broadcasts very easily
  3. Contribution from a Playout appliance, if your cloud playout outputs to an MediaConnect flow, then you can then output that flow to your broader distribution chain, allowing re-routing of things downstream.

It isn’t multicast within a VPC, it’s not PTP, I suspect the latency involved may be too great to allow it to be used to route between different stages in a virtual playout chain3.

MediaConnect does however simplify integrating cloud processing workflows by providing fixed points at the edges in and out of the cloud.

I’ll be interested to see how people use it.

  1. That it is a singular rule is one of those bits of pedantry I cannot let go of
  2. This is probably a topic for another time, but the fact that so many enterprise vendors expect you to pay for their ‘product’ then explain that ‘oh, no, you can’t just use it out of the box even in a basic manner’ is a bit of a joke
  3. I could be very wrong here, I don’t have a one of those hanging around to test

Anatomy of Ticketing ‘Fail’

Having failed to get tickets for something in an annoying day of pressing reload, I try to write something constructive about scaling for big things

(Or what happens when a company that isn’t eventbrite tries to be eventbrite)

A friend wanted to book some tickets for an event. I had some time today, so I said I’d book them.

For reasons of politeness, I’m not going to name the company. The event was massively over-subscribed, there were always going to be people who were annoyed (kinda like the Olympics). I’m just annoyed because I saw things done generally quite ad-hoc, specific technical bugs hit me.

Tickets were delivered in tranches. This is a sure sign there will be massive peaks in demand…

The hour arrives, and, in your all too typical scenario: www.example.com rapidly stopped responding.

A few minutes later everything went 403’d as they killed all access on the server to get the load down. Not great, but it’s a sign somebody is looking at the problem.

example.com then starts redirecting to http://xxxxxxxxxxxxx.cloudfront.net/url1 with all the individual ticket pages iFramed through to an Amazon EC2 instance (http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com/blah)

The IDs for the events were sequential, some had already been released, and you started to think that people had been gaming the system and ordering tickets prior to their availability windows. This was later denied by the company (which I accept), but given the way the scaling was going, at that time it was all too easy to think they were using security by obscurity to prevent access to the events.

Later in the day, when tickets appeared, it was announced via a tweet. The tweet though didn’t link to the site, but to a mailing list post, which again didn’t reference the actual site.

The site had now changed, example.com was redirecting to http://xxxxxxxxxxxxx.cloudfront.net/url2 again passing through to EC2 instance. Later many people complained on Facebook that they were looking the old page and pressing reload.

Anyway, I tapped. I was at the gym. I was on my iPhone. But I know my credit card number, I know my paypal password, I can even use that tiny keyboard. I’ve topped my starbucks card up while in the queue. I can do this.

There was even a mobile site.

Only the mobile site was erroring because it was asking for a non-existent field/table. I had no way to change my user-agent (and wouldn’t have trusted Opera with my credentials), and in the 10 minutes it took me to get back to my laptop all the tickets had been sold.

No tickets for me+mate. Grumbly me having seen things done badly.

As many will say, this is not life and death – but example.com is primarly not a ticketing company, and that showed today.

If you’re going to compete with the likes of eventbrite, you’re going to have to be as good as eventbrite.

The Constructive “What can we learn” Bit

1. Believe it could happen, no matter how unbelievable.

Ask yourself “if we get off-the-scale load how will we fix it”. Working out volumetrics and scaling is hard, so alongside your “reasonable” load calculations of “we can turn off these bits of our site”, have your plans of “how you’ll move to something big, cloudy and scalable if the unbelievable happens”.

Are there components that you should move upfront? You have something like 15 to 30 minutes of goodwill. What do you need to do upfront, so that in that downtime you can come up fully scaled.

If you’re looking a scalable elastic thing, look at how much it costs to start in that state anyway.

2. Architect things to give you agility

If you can’t host all your website on a scalable platform: Subdomains, DNS expiry times and proxy-passes give you room to move, but only if set up ahead of time.

Had tickets.example.com been available, example.com wouldn’t have had to disappear as it has done until tomorrow. You don’t want your website down for that length of time.

DNS changes can take time, much less if you dial down the expiry times, but again you have to do prior to the event. Amazon’s route 53 is cheap so move the domains ahead of time, and set Times to Live appropriately.

While you’re waiting for that propagation, proxy-passing can be a useful technique to bounce the traffic to the new server, while the DNS propagates. Proxy passing also means that example.com/tickets could have been redirected, rather than an entire domain.

Are you caching what you can at an HTTP level with varnish or a service level with memcache?

3. Be careful sending people onto new URLs that won’t update

Taking the ticketing system off their main website was a good move, but the static page should have remained there. The second you redirected to cloudfront, they were then looking at a page that would get stale.

Many people would have pressed reload, expecting it to appear, but they didn’t because as you can see from above above, the URL changed. They could have used the Cloudfront revocation API, but this wasn’t used.

4. Remember data protection issues

This company used the Virginia data centre (which I think is the AWS default). Without going into the whole world of pain that is data-protection and EU borders – Dublin would have less latent and less problematic compliance wise.

5. Testing is good, as is automatic deployemnt

There were not many tickets and the loading was huge, those were not avoidable. I can’t say the same about the erroring mobile site, that should not have occurred.

6. Rehearse

It’s not fun doing disaster recovery, but if you’re receiving catastrophic load then that is what you’re doing.

Write the script. Have someone else test it.

It’s not a valid plan until you have shown it works.