Despite the documentation, blog posts and experiences, you’re just going to have to test it…
AWS offers a dazzling array of services for similar things. Dazzling is a great synonym for “confusing”. Feels like “Speed of Innovation”, “Range of Services” and “Coherence of Offerings” is a “pick 2 of 3” situation…
An often asked question on the AWS communities I hang out on is “can we use a Lambda for that?”
People asking this are often primed with some of the problems of serverless compute, and jump to the advanced features to mitigate them:
“I’m worried about cold starts, do I need to pay for provisioned concurrency?”
“I’m accessing a database, do I need to add an RDS Proxy?”
“I worry about response time, do I need to enable API Gateway caching?”
“I worry about out-of-order”, do I need First-In First-Out queues1
These are valid questions, with valid solutions, but paraphrasing a global sportswear brand: Just Try It.
Deploy your code, and see how it performs/responds. Then apply the tricks to make it better.
Starting with the dirtiest hack of all: Make it bigger.
Lambda CPU is only controllable by memory, and I’ve been in multiple situations where there was a step-change in performance jumping from 64 to 128 megabytes. There’s tooling that can help you find the right size for your application.
It’s lazy, brute force, but takes literally seconds to change the memory and re-test.
I’m a big proponent of ‘innovation as Business as Usual’: you shouldn’t need the excuse of an innovation day to try the new things. That only works if you start simple.
But I’m an even bigger proponent of not stressing yourself/teams unnecessarily: If you don’t have time to test in your current project, do what you know works. Hit your shipping dates, have an easier sprint.
Hopefully you can have a play in the future and add something else list of what “works”, it’s nice to expand that list, but only when you’ve enough headroom to do so.
In conclusion, while sometimes I can tell you that something is unlikely to work, most of the time you’re just going to have to try it.
Sorry.
I understand FIFO messaging has its place, but I work in content, not transactions. Including a ‘reliable’ timestamp in the event (e.g. the time an article was published, not the time the message was sent) is cheaper and allows consumers to identify if they should process an event, allowing easier recovery if databases have to be re-crawled. ↩︎
I’ve been following a bit of Internet Of Things drama because a company Cease & Desisted a developer who was polling their API for an unofficial Home Assistant Integration.
Thankfully, it looks like the company is now engaging with the Developer and The Home Assistant team, so maybe it can be resolved.
But one of the recurring comments initially was “if they’re saying it costs too much money, they should use a cheaper host than AWS”
Disclosure, my career is helping companies use the right bits of AWS well, and for all my discomfort at the centralisation of the internet into the Hyperscalers – the ability to deploy a website that costs nothing if nobody uses it, yet scales to match demand, is pretty compelling.
“What Context do you mean Gareth?”
Backing up a second, this integration polls the endpoint pretty aggressively, and so could be causing a noticeable spike to API calls, and to the costs experienced by Haier.
If there were about 500 active installs, that could cost something between 500 and 1,500 USD per month, based on the polling rates, probably immaterial in the grand scheme of things, but noticeable.
The context I’m talking about, is that they built this API knowing it wasn’t going to be called that much… many of the interactions customers saw would be driven by push notifications or other event driven things. It’s only called by the app/website when people visit… it was never costed for being called continuously.
They didn’t optimise that cost, because it didn’t make sense to.
But still, why aren’t they doing it cheaper?
Bluntly because “total cost of operation” (TCO) is more than just the compute.
Sure I could just “spin up a VM and do it all myself” but then I’m doing it all myself. I’m suddenly in the realm of patching. I’ve got to do my own auth.
If AWS are terminating your HTTPS endpoint, then AWS are on the hook for the latest HTTP desynchronisation exploit.
When you host stuff on managed services (of any provider) you’re really going hands off, you don’t need an ops team because ultimately, there isn’t generally anything that you can operate… If there’s an outage, often times, you just have to wait for the provider to fix it.
That might feel disempowering, but also, there’s no point having an on-call team and waking someone at 3am to go “Yup, it’s fucked” but be unable to actually fix it.
In summary
There’s usually a cheaper way to deploy things.
The question is “is it actually cheaper, everything else considered?”
Don’t try to second guess other peoples architecture, and if you are going to poll an API, try to do it considerately…
Quick rant here, but I’m in the process of leaving a service that requires some careful presentation of the data/documents for regulatory type purposes.
Fine, I think, they tell me to head to the download page, and download my data.
Use this to get a complete export of all your XXXX data or restrict by date. The export includes all accompanying receipt attachments and PDF documents where relevant.
If you need to restrict the results further you can do this by using the filters on the relevant list and running the export from there.
That sounds pretty definitive, great, it’ll be a big zip file with everything.
“Complete export” is even in bold, that shows intent.
The zip file expands into a folder with the word “complete” In it.
BUT as you’ve now doubt guessed from all the signposting, it is not however complete. There are a bunch of things not included.
It’s not stated what things aren’t included, because you know, that would be too easy.
TL;DR, If you use the world ‘complete’ in your data extraction, include everything
At the very least tell me what’s missing.
Good to have it confirmed why I’m leaving this service.
Just because a number is long, doesn’t mean that it’s secure…
For many years, despite repeated requests, a South African bank has been sending me bank statements.
The thing is that I don’t live in South Africa, have never visited, nor banked there. But I do have a particularly “good” email address that gets a lot of misdirected emails… I usually don’t read them, but this week I did, as I replied for the latest attempt of “please stop”.
The email included a PDF statement, password protected with the South African ID number of the customer. I suspect that protection is why the service centre seem so unbothered by the repeated requests to stop this information.
A few years back, I got very embedded in PII/GDPR. We were designing a data warehouse setup that allowed analysis of user activity, while protecting privacy, and enabling easier compliance with GDPR deletion requests. There was discussion about the feasibility of SHA2 hash reversal… and we took a lot of time to communicate the infeasibility with the legal team.
So this week, I started to wonder: If I were a bad actor (which I am not), could I feasibly crack this ID with some Python?
What’s the challenge?
There are three sets to consider in this:
The Absolute Keyspace: without any knowledge of the identifiers, the number of combinations?
The Available Keyspace: if the ID has constraints, how many valid combinations are there?
The Effective Keyspace: if we know anything more, how many combinations are applicable?
A good security identifier should make discerning the differences between these difficult: it should involve a reasonable amount of calculation before it becomes obvious that an ID is valid, and correct.
What do we know?
South African ID Numbers are 13 decimal digits.
A single decimal can be 0-9, ten in total, and so each digit has a a cardinality of 10.
(Cardinality being the maths word for ‘number of choices’ in a set)
We can use this information to calculate the absolute keyspace:
print(summary_to_brute_force(keyspace_absolute))
At slow rate: 18,518,518.52 hours, or: 771,604.94 days
At optimised: 617,283.95 hours, or: 25,720.165 days
At this stage, it would not be worth the cost to brute-force this.
The information in the ID, or the document, is not valuable enough.
However, given the effective keyspace we’re already down to, the impact of age reduction feels less useful in this scenario, if you had less information this reduction could be more useful.
Number in valid age range: 949,000.0
At slow rate: 1.76 hours, or: 0.07 days
At optimised: 0.06 hours, or: 0.002 days
In summary
From an absolute key space of 11,000,000,000,000, we’ve excluded over 99.99999% of the possible numbers, and have only 949,000 to check against the file.
Graphing this is really hard, differences in scale make it really difficult to communicate.
Since I’m not a data-vis genius, this uses a log-scale.
Python to generate Graph
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = df['Count'][::-1]
data.plot(kind='barh', logx=True, title="Keyspace Reduction\nnote the log-scale",ax=ax)
vals = ax.get_xticks()
ax.set_xticks(ax.get_xticks()[::2])
ax.set_xlabel("Keys in Keyspace")
Conclusions
This analysis shows that in common with recent attacks on the Tetra encryption system , if you’re not using all of the absolute keyspace, your protection is far weaker than may appear from a big number.
These national/structured IDs do not make good secrets: the structure inherently reduces the size of the effective keyspace, and makes it very easy to exclude ranges of people (by age or gender).
While phishing is a problem, and emails need to be/appear authentic – we need to use mechanisms to achieve this at the email level: SPF, DKIM, DMARC, BIMI. While imperfect, these are far better than including information directly related to the ID/information being protected.
In this scenario even with a naive implementations, it would be entirely feasible to brute-force this particular email/pdf combination, which would expose customer information.
Now I don’t know how valuable that information in the statement is, but I wonder if it be used as part of a social engineering attack?
A plea to companies: If I message asking you to stop send me statements, maybe stop?
Getting past uncertainty or unwillingness to use IPv6 is actively costing you money.
I (mostly) have IPv6 deployed in my home network, alas a bug on my router currently prevents it working completely, but for years it’s been enabled, and mostly as a novelty to feel like I was ready for the grand future – but without really noticing any differences.
Last week however, I had a chat that made me realise that using IPv6 is now a cost-saving measure… and it’s maybe time to get over our resistance to do it by default.
The Joy of NAT
Typically you deploy an AWS VPC using internal IPv4 addresses, and (if you’re as allergic to avoidable self-managed service as I am), a managed NAT gateway.
As with nearly everything AWS, there’s an hourly cost, and also charges for the volume of data transferred.
In an AWS group I’m a member of, someone asked “We’re paying a lot for NAT, what can we do” – and I said “I mean, you could try the IPv6 Egress Gateway, but I dunno if the APIs you’re using support IPv6”.
The Egress only IPv6 gateway only charges ‘standard’ egress rates and it has no rental cost. I’ve been aware of its existence for years, but have never deployed it or had reason to.
I was expecting to be told “Only one of my APIs supports IPv6” but the person reported back “Actually nearly all my APIs are on IPv6, but I can’t deploy the IPv6 easily because of <reasons>”.
This was not what I was expecting: despite, remembering that for the last few years wherever I can deploy a dual-stack endpoint, that I have been…
So why haven’t we
There are a few reasons why we haven’t been deploying IPv6 more routinely, and we should try to change these:
IPv6 didn’t have an advantages. In most cases the IPv4 setup works ok… So what’s the point of adding it to a working setup…
The deployment tooling doesn’t support it – this was the case here, the person was using the lovely AWS CDK to deploy their VPC, and the current VPC ‘construct’ doesn’t support IPv6 easily
We like NAT’s Security by default: With NAT, your compute resources aren’t exposed to the internet, all that endpoints see are 1 or 2 shared IPs… There’s no way to accidentally open inbound connections, and that security by default is pleasing. Even though the IPv6 Egress Only gateway doesn’t allow inbound connections, you do feel more exposed, and so much of security is an annoying vibes type thing
Some steps to take
It’s sometimes easy to forget that we have run out of IPv4 addresses, and that we’re muddling through with Carrier Grade NAT and other annoying technologies that work, but make everyone’s lives just that little bit worse… But we can improve this incrementally, and anyone who’s worked with me knows I love making things better gradually.
Enable IPv6 on any managed services you use by default. If you’re running CDN hosts on CloudFront, or an API on API Gateway, those can all run IPv6, and are running it outside of your VPC. There is no security risk to you, and by creating those, you give other people benefit
Enable IPv6 in your VPC, but just on your public subnets. If adding IPv6 makes you nervous, start with ‘just’ doing it on your public subnets, and use it to given your load-balancers IPv6 addresses. This makes life better for other people, and gives learning for the next step
Enable IPv6 with Egress only gateway, for any subnets that have NAT: this is where you start saving money, as hopefully you’re NAT’ing less traffic and saving on those charges
Not every use-case will save money, I haven’t generally had the patterns of bandwidth usage that would benefit from this… but if you’re using a lot of NAT bandwidth, maybe it’s time to look at IPv6 as a potential cost-saving.
But to those of use who have been building on AWS (and other providers) for many years, it’s not a surprise, and we all have stories where we’ve done similar.
When you’re solving a problem, you look at what the managed services that you have available, considering factors like:
Your teams experience with the service
Limitations on the service, and what it was intended for, against what you’re doing
What quotas may apply that you hit
How the pricing model works
While pricing for managed-services is generally based on usage, sometimes specific costs will apply more to your workload, e.g. if you’re serving small images, you’ll be more impacted by the per-request cost than the bandwidth charges.
I would be surprised if an experienced architect hasn’t faced a situation where “Service X would be perfect for this, if only it didn’t have this restriction Y, or wasn’t priced for Z”.
My example
We’d built out a system that was performing 3 distinct processing steps on large files.
The system had built out incrementally, and we had the 3 steps on three different auto-scale groups, fed by queues.
While some of the requests could be processed from S3 as a stream, one task required downloading the file to a filesystem, and that download took time.
The users wanted to reduce the end-to-end processing time. Some of the tasks were predicated on passing prior steps, and so we didn’t want to make the steps parallel.
Attempt 1: “EFS looks good”
We used the ‘relatively’ new Elastic File System service from AWS… The first component downloaded the file, subsequent operations used this download.
This also had the advantage that the since the ‘smallest’ service was first, you paid for that download on the cheapest instance, and the more expensive instances didn’t have to download it.
We developed, deployed, and for the first morning it was really quick… until we discovered that we were using burst quota, and spent the afternoon rolling back.
Filesystem throughput was allocated based on the amount stored on the filesystem, but as this was a transient process, we didn’t replenish it quickly enough, and didn’t like the idea of just making large random files to earn it.
Now you can just pay for provisioned throughput, perhaps in a small part because of a conversation we had with the account managers.
Attempt 2: “Sod it, just combine them”
The processes varied in complexity, there was effectively a trivial, a medium, and a high complexity task… So the second solution we approached was combining all the tasks onto a single service… the computing power for the highest task would zoom through the other two tasks, and so we combined them into what I jokingly called “the microlith”.
We didn’t touch the other parts of the pipeline, or the database, they remained in other services, but combining the 3 steps worked.
What did we gain
The system was faster, and even more usefully to operators, more predictable.
Once processing had started you could tell, based on the file size, when the items would be ready…
Much like “lower p90 but higher maximum” feels better for user experience, this consistency was great.
What did we lose
Two of the three components had external dependancies, and this did mean this component was one of the less ‘safe’ to deploy, and while tests built up to defend against that… the impact of failed deploy was larger than you’d want.
In Conclusion
There are always times when breaking your patterns makes sense, the key is knowing what you’re both gaining and losing, and taking decisions for the right reasons at the right times.
Prime video refining an architecture to better meet scaling and cost models, making it less “Pure”, isn’t the gotcha against these services that some people would have you believe.
Here’s the one wild secret they don’t want you to know.
Please forgive me the horrible SEO title and URL…
I recently ported a number out of Skype UK to a much cheaper SIP VOIP provider. Skype served me well for a number I really just needed for compliance reasons… and it seemed obvious that I should actually transfer it to a supplier that would cost me about 12GBP per year, rather than 57 EUR as I was paying.
Number porting in the UK is a magical mystery tour, for all the times it “just works” other times you’ll feel like you’re playing the worst possible game of DND, with a horrible enemy and a disinterested Game Master.
Skype initially said “you don’t need to do anything special, just ask the new provider to port in” but turns out there was this one secret trick they didn’t think I’d want to know…
Skype expect your ‘surname’ as provided by your new supplier, to actually be your Skype username.
I discovered this after two failed porting attempts (for which my new super cheap provider charges me, which I understand, since the ongoing rental is so cheap).
I had asked Skype multiple times what to do: before the first port, after the first failure, after the second failure.
Most of these interactions were pretty infuriating “Porting is done by your new provider… it’s nothing to do with us” – despite that fact that My New Provider asks Skype to port, and Skype then says yes/no… They are pretty involved in the situation.
Skype wouldn’t tell the new provider what was wrong, beyond “the surname didn’t match” – This is entirely correct, because since Skype UK don’t seem to implement any porting code/verification scheme, for security they can’t and shouldn’t tell the port destination…
However they wouldn’t then tell me, via an authenticated channel, what they were expecting. Merely “you did it wrong try again… porting is nothing to do with us.”
It was only when I then went on another chat, talking about going to the ombudsman, and generally being a pain, did the porting team authorise the port, and revealed to the provider “yeah, the surname needs to be the username”.
This is not mentioned on their documentation. Or provided in live-chat because it’s seemingly a rare occurrence.
So if you’ve come to this post, having had problems porting out, try again telling your new provider that your ‘surname’ is in fact your username.
I hate being on the end of the Dangling Hello, and the 15 minutes of massively predicting what the person is wanting. But I still find it very hard to bundle in all up in the first message.
Equally, 4 notifications in quick succession can feel like literal torture.
You can still ask how people are doing, but you can just include that upfront, in a single message.
Hello X, hope you’re good, can you tell me what’s going on with TICKET-123
Me, Slack, This Year
Priority Tagging, ideally lower
Low Priority exists as well as High Priority on emails.
Flagging a gossipy/catchup IM as such in the opening.
Clarify & Summarise
The discomfort at being That Guy who pastes back the summary of what you agreed is less than the pain when you discover that you weren’t all sharing understanding.
When half the team thought “advance by 2 seconds” meant delete 2 seconds, and the other half thought add 2 seconds..
Always a default
When arranging things, I’m going to offer a default, always.
“I’m free all day” vs “I’m free all day, how about 11”
Make it easier to say “Great” done.
Stick to Core Hours
I’m a freelancer, I work self-defined hours… but that’s not mine to share with others.
While it’s useful for me to get thoughts out of my head into an email, that doesn’t mean I need to get them forced onto other people…
If I’m sending an email, I’ll set it to send later
If it’s an IM, I’ll set Slackbot to remind me or maybe the person, during the next working day
In Conclusion
We all drowning in a sea of notifications, if you can make yours just a little better, you make it easier for people around you to help you.
As someone thoroughly locked into the Apple ecosystem, until recently I’ve not been able to easily ask to BBC Radio services through my HomePods. I had to follow some Reddit posts to install shortcuts “hey Siri play Bbc 6 music” and suddenly I find myself listening to more BBC Radio as a consequence.
Previously I think you had to listen to stations ‘enough’ for Siri to recognise the activity, then it could be a suggested shortcut. It was a bit ugly, and down to how intents originally worked.
The Siri APIs have got more developed over time, and now it is possible to do this in a way that doesn’t require upfront declaration.
It’s funny then, that given the choice of “adding a play with Siri intent to BBC Sounds” or “delaying podcast release to open platforms”, that Auntie chose the latter…
This has the strange result:
The BBC’s first party actions have me listening to less BBC Podcast Audio – why listen to a topical podcast 4 weeks later, and having to remember to check BBC Sounds doesn’t match my workflow
Actions by a Third-Party, have me listening to more BBC Live audio
I know there are always backlogs, but Siri intents have been around for a while now…
Having installed some invasive ‘online proctoring’ software, I tried to ask the company how to uninstall it.
I was, finally, getting around to doing AWS certifications, and one of the way you can do that is via the Pearson OnVue proctoring system. You run software that limits what you can use your machine for, stops you using multiple windows to look up answer. It asks for a fair bit of access when you run, unsurprisingly.
I installed it, I ran the test to see if it worked, and shortly afterwards was having problems with my computer and the clipboard – and wondered if that was a ‘feature’ of this software.
I have since resolved these separately, but I wanted to uninstall OnVue, but there is NO detail of how to do this on the website.
I have asked all their contact channels, “how do I uninstall it” – I get back a variety of canned responses:
“you can’t cancel an exam via email”
“if you run a test exam you can see if the software works on your system”
Now given it doesn’t look like it was an “installed” app and just an application that ran from a zip file, uninstalling it could be as simple as not running it ever again, and deleting the download.
But I don’t know if it hasn’t installed a few system extensions or similar, I’ve had a quick look and I don’t think it has but is it too much to expect a webpage owned by Pearson that says that – right now the search results for “remove onvue macintosh” are swamped by advertising pages for Mac removal software.
A document that states what to do online, an answer in the knowledge base so agents respond – is really the bare minimum.
If you make me run/install it, give me a clear way to remove it.