State-Issued Identifiers aren’t generally good passwords

Just because a number is long, doesn’t mean that it’s secure…

For many years, despite repeated requests, a South African bank has been sending me bank statements.

The thing is that I don’t live in South Africa, have never visited, nor banked there. But I do have a particularly “good” email address that gets a lot of misdirected emails… I usually don’t read them, but this week I did, as I replied for the latest attempt of “please stop”.

The email included a PDF statement, password protected with the South African ID number of the customer. I suspect that protection is why the service centre seem so unbothered by the repeated requests to stop this information.

A few years back, I got very embedded in PII/GDPR. We were designing a data warehouse setup that allowed analysis of user activity, while protecting privacy, and enabling easier compliance with GDPR deletion requests. There was discussion about the feasibility of SHA2 hash reversal… and we took a lot of time to communicate the infeasibility with the legal team.

So this week, I started to wonder: If I were a bad actor (which I am not), could I feasibly crack this ID with some Python?

What’s the challenge?

There are three sets to consider in this:

The Absolute Keyspace: without any knowledge of the identifiers, the number of combinations?
The Available Keyspace: if the ID has constraints, how many valid combinations are there?
The Effective Keyspace: if we know anything more, how many combinations are applicable?

A good security identifier should make discerning the differences between these difficult: it should involve a reasonable amount of calculation before it becomes obvious that an ID is valid, and correct.

What do we know?

South African ID Numbers are 13 decimal digits.

A single decimal can be 0-9, ten in total, and so each digit has a a cardinality of 10.

(Cardinality being the maths word for ‘number of choices’ in a set)

We can use this information to calculate the absolute keyspace:

katex is not defined

Since this work was derived in a Jupyter notebook, I’ll also include some python as we go along…

Python Code for absolute keyspace

number_of_digits = 13
keyspace_absolute = 10 ** number_of_digits
print(f"Number of combinations: {keyspace_absolute:,}")

Number of combinations: 10,000,000,000,000

Is this feasible?

So without knowing anything about the ID, there are 10 trillion combinations to check.

Python can attempt to open a PDF with a password around 150 times per second. This would be our basic implementation.

More specialised tools like John the Ripper raise that rate 4,500 per second, that’s around 30 times faster.

We’ll put these into a summary function, as we’ll be calling this a few times.

Python for Summary Function

test_rate_slow = 150
test_rate_fast = 4500
def summary_to_brute_force(combinations):
    hours_to_test_slow = (combinations / test_rate_slow) / 60 / 60
    hours_to_test_fast = (combinations / test_rate_fast) / 60 / 60
    return f"At slow rate: {hours_to_test_slow:,.2f} hours, or: {hours_to_test_slow / 24:,.2f} days" + \
           f"\nAt optimised: {hours_to_test_fast:,.2f} hours, or: {hours_to_test_fast / 24:,.3f} days"

print(summary_to_brute_force(keyspace_absolute))
At slow rate: 18,518,518.52 hours, or: 771,604.94 days
At optimised: 617,283.95 hours, or: 25,720.165 days

At this stage, it would not be worth the cost to brute-force this.

The information in the ID, or the document, is not valuable enough.

Scoping the “Available Keyspace”

The South African ID format is described here and this OECD PDF.

The format is YYMMDDSSSSCAZ:

YYMMDD is the date of birth
SSSS separating people born on the same day
- Female entries start 0-4, males entries start 5-9
C represents citizenship status
A was previously used to represent race, but is now unspecified
Z is a checksum digit, using the Luhn algorithm

How does this help us?

The Z check digit reduces the key space by a factor of 10: we “only” have to brute-force for the first 12 digits, and calculate the 13th.

Since A is unspecified, we will leave its cardinality unchanged at 10.

The C citizenship status can be 0, 1, or 2. This digit now has cardinality of 3 instead of 10.

Dates

The YYMMDD digits are dates, these have constraints:

Years are from 00-99
Months are from 01-12
Days are from 01-31

If we just consider those digits individually, we can calculate the cardinality like this:

katex is not defined

But that’s going to consider many impossible days: month 19 doesn’t exist, nor does day 35.

So we could just consider these 6 digits as a combined date field, and get a more useful answer:

katex is not defined

(Yes, I am ignoring leap-years in this calculation… they’re not material to this calculation)

Our new understanding of the ID number format comes together, and we can compare the Absolute with the Available keyspaces:

katex is not defined

So even without any knowledge of the target, only 0.1095% of the original key space needs to be searched.

Python for Available keyspace

cardinality_of_birthdays = 100 * 365 # we're ignoring leap years
cardinality_of_serial_numbers = 10000
cardinality_of_citizenship_states = 3
cardinality_of_the_a_digit = 10
keyspace_available = cardinality_of_birthdays * cardinality_of_serial_numbers * cardinality_of_citizenship_states * cardinality_of_the_a_digit
print(f"Number of valid ID numbers: {keyspace_available:,}")
print(summary_to_brute_force(keyspace_available))

Number of valid ID numbers: 10,950,000,000
At slow rate: 20,277.78 hours, or: 844.91 days
At optimised: 675.93 hours, or: 28.164 days

One month with optimised checking could be feasible, especially if you rented some machines… but can we do better?

What’s the Effective Keyspace?

A badly recreated email from a bank, with a banner "your electronic statement". The intro reads: Dear Mr G. Customer Please Find Attached your Statement for May • Information that only you will know is displayed in the eStatement verification block. This is done so you can be sure your statement is from BANKCO. • You will be required to enter the 13 digits of your identity number to view your statement. There is an additional box: Verification Info account number: *******5678 ID Number: *********1234 — A recreation of the email sent by the bank

Revisiting the email, it contains some info I’ve ignored until now:

The last 4 digits of the ID as verification
The recipient is addressed as ‘Mr Customer’

Going back to the format YYMMDDSSSSCAZ.

We know the values for C & A, so those now have cardinality of 1.

C ‘only’ had a cardinality of 3, so that’s excludes 67% of possibilities
A had a cardinality of 10, so that excludes of 90% of possibilities

These combine however, so the remaining amount from knowing C & A is: $katex is not defined$

Let’s reconsider the SSSS block: which we’ll refer to as S1, S2, S3, S4.

Since our ID is male we know that the S1 must be 5/6/7/8/9, so cardinality of that digit is now 5
We know S4, so it has cardinality of 1

Again these combine, so the total remaining is: $katex is not defined$

katex is not defined

Checking in the formula again:

katex is not defined

We’re now down to 18.25 million possible keys to check.

Python for effective keyspace

cardinality_of_birthdays = 100 * 365 # ignoring leap years for now
cardinality_of_serial_numbers = 500
cardinality_of_citizenship_states = 1
cardinality_of_the_a_digit = 1
total_number_of_email_matching_combinations = cardinality_of_birthdays * cardinality_of_serial_numbers * cardinality_of_citizenship_states * cardinality_of_the_a_digit
excluded_by_using_email_information = keyspace_available - total_number_of_email_matching_combinations
print(f"Number of numbers matching email: {total_number_of_email_matching_combinations:,}")
print(summary_to_brute_force(total_number_of_email_matching_combinations))

Number of numbers matching email: 18,250,000
At slow rate: 33.80 hours, or: 1.41 days
At optimised: 1.13 hours, or: 0.047 days

Even the naive, 1.41 days is really starting to look feasible, and with John the Ripper, we’re already doing it in little over an hour.

But what about the check digit?

Earlier we ignored the check number, since we can calculate it… but we were supplied it in the email.

We can use it to see if the ID is a potential match, and only check matching ones against the file.

Luhn format checkdigits use simple modulo 10 arithmetic.

This means only 10% of the generated IDs will be checked against the PDF password.

Python for checksum validated keyspace

using_checkdigit_exclusion = total_number_of_email_matching_combinations / 10
print(f"Entries matching email check digit: {using_checkdigit_exclusion:,.0f}")
print(summary_to_brute_force(using_checkdigit_exclusion))

Entries matching email check digit: 1,825,000
At slow rate: 3.38 hours, or: 0.14 days
At optimised: 0.11 hours, or: 0.005 days

Our ‘effective’ keyspace is now 1,825,000 entries.

So even with a naive implementation just in Python, we can do it in less than a day.

Age scoping

A friend pointed out that searching all ages between 0-100 is a bit pointless, so we could change that to be a range of 18-70?

Because the birthday field covers 100 years cleanly, we calculate the number of years we want to test.

katex is not defined

However, given the effective keyspace we’re already down to, the impact of age reduction feels less useful in this scenario, if you had less information this reduction could be more useful.

Python for age reduced keyspace

min_age = 18
maximum_age = 70
age_range = maximum_age - min_age
percentage_of_people_in_age_range = age_range / 100
total_number_in_age_range = using_checkdigit_exclusion * percentage_of_people_in_age_range
print(f"Number in valid age range: {total_number_in_age_range:,}")
print(summary_to_brute_force(total_number_in_age_range))

Number in valid age range: 949,000.0
At slow rate: 1.76 hours, or: 0.07 days
At optimised: 0.06 hours, or: 0.002 days

In summary

From an absolute key space of 11,000,000,000,000, we’ve excluded over 99.99999% of the possible numbers, and have only 949,000 to check against the file.

Python to generate table

import pandas as pd
data = [
            [ keyspace_absolute, 100],
            [ keyspace_available, 100 * keyspace_available / keyspace_absolute],
            [ total_number_of_email_matching_combinations, 100 * total_number_of_email_matching_combinations / keyspace_absolute],
            [ using_checkdigit_exclusion, 100 * using_checkdigit_exclusion / keyspace_absolute],
            [ total_number_in_age_range, 100 * total_number_in_age_range / keyspace_absolute],
      ]
df = pd.DataFrame(data, columns=["Count", "Percentage of Total"], index=["Absolute Keyspace", "Available Keyspace", "Email Keyspace", "Using Email checkdigit", "Limiting by Age"])
brute_force_label = f"Hours @ {test_rate_slow}/s"
brute_force_faster_label = f"Hours @ {test_rate_fast:,}/s"
df[brute_force_label] = df["Count"] / test_rate_slow / 60 / 60
df[brute_force_faster_label] = df["Count"] / test_rate_fast / 60 / 60
df.style.set_properties(**{'font-family': "Menlo, Consolas, Monospace"}) \
  .format(subset=["Count"], thousands=",", precision=0) \
  .format('{:.6f} %', subset=["Percentage of Total"]) \
  .format(thousands=",", precision=1, subset=[brute_force_label, brute_force_faster_label])

Set of potential ID Numbers	Size of set	Percentage of Absolute	Hours @ 150/s	Hours @ 4,500/s
Absolute Keyspace	10,000,000,000,000	100.000000%	18,518,518.5	617,284.0
Available Keyspace	10,950,000,000	0.109500%	20,277.8	675.9
Email Keyspace	18,250,000	0.000182%	33.8	1.1
Using Email Checkdigit	1,825,000	0.000018%	3.4	0.1
Limiting by Age	949,000	0.000009%	1.8	0.1

Visualisation

Graphing this is really hard, differences in scale make it really difficult to communicate.

Since I’m not a data-vis genius, this uses a log-scale.

Python to generate Graph

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = df['Count'][::-1]
data.plot(kind='barh', logx=True, title="Keyspace Reduction\nnote the log-scale",ax=ax)
vals = ax.get_xticks()
ax.set_xticks(ax.get_xticks()[::2])
ax.set_xlabel("Keys in Keyspace")

Conclusions

This analysis shows that in common with recent attacks on the Tetra encryption system , if you’re not using all of the absolute keyspace, your protection is far weaker than may appear from a big number.

These national/structured IDs do not make good secrets: the structure inherently reduces the size of the effective keyspace, and makes it very easy to exclude ranges of people (by age or gender).

While phishing is a problem, and emails need to be/appear authentic – we need to use mechanisms to achieve this at the email level: SPF, DKIM, DMARC, BIMI. While imperfect, these are far better than including information directly related to the ID/information being protected.

In this scenario even with a naive implementations, it would be entirely feasible to brute-force this particular email/pdf combination, which would expose customer information.

Now I don’t know how valuable that information in the statement is, but I wonder if it be used as part of a social engineering attack?

A plea to companies: If I message asking you to stop send me statements, maybe stop?

The One Boring Reason Why People Use the AWS Service

One of my clients recently started using a relatively new AWS CI/CD Service, and I just stumbled on a defensive/marketing type post from one of the traditional providers. And it made me realise how much vendors can miss the reason people choose to go with the AWS/GCP/Azure service, even if it’s inferior.

Aside: I’m not going to link to the article because they don’t deserve the clicks.

Back to their post, it went through a familiar structure:

“But it doesn’t have all the features, our lovely features”
“You can’t self-host, you’re LOCKED-IN!”
“Why not buy into our broader platform?”

I’ll go through these in turn, before getting to the actual reasons.

“It doesn’t have the features…”

It doesn’t. It’s version 1 of an AWS product… they always launch very lean and gain new things.

And yes, it only supports 3 integrations while Vendor supports around 30. Turns out though those 3 are the most important ones. Others will be added I’m sure, but only where people will use them.

“You can’t self-host, you’re LOCKED-IN”

Good. I literally don’t want to.

I know that some Ops-Teams feel happier that they can touch a container or an instance, but this is a product that can be replaced quite easily, include by this Vendor should the need arise.

They do have a SaaS offering you can pay for, but it’s relatively expensive for small-teams. (And we’ll come onto legal things later)

“Why not buy into our broader platform?”

Lock-in to your cloud provider is bad, but if you use all of their products you can get a great unified experience… which sounds a little like, erm, lock-in.

The simple reason people choose the service on their Cloud… procurement

Companies generally make buying stuff difficult. Every new vendor is a new round of legal review, potentially procurement exercises. It’s a painful affair.

This Vendor does sell their SaaS platform on the AWS marketplace, but it’s another End User License Agreement (EULA) that needs to be accepted. And that means it has to evaluated by a legal-team: like most other EULAs the lawyers will probably go “Yeah, it’s got a bunch of stuff in it that nobody could ever enforce, so proceed at a tiny risk”.

When you already have a cloud-provider, and the legal/finance agreements are in place, it’s just easier to use the provided service.

The ‘default’ product may well be inferior, have less features, and even be more expensive: but if I can click “use this” without involving legal – it’s the one I’ll likely choose.

Two Apps Better Than One?

Mixing Subscription and Transactional VOD in the same application can give a bit of a confusing user experience.

When Sky launched Sky Store, which lets you rent films, it felt unnecessary alongside their existing subscription services: Sky Go (for Sky TV subscribers) and Now TV (for everyone else). It seemed little more than an attempt to get extra space on Smart TV menus.

Since then though, LOVEFiLM has finally acknowledged its longtime parent Amazon. I churned from LOVEFiLM a while back¹, but I’ve a few months left on Amazon Prime. This now gives me access to “Prime Instant Video” via the “Amazon Instant Video” app².

It’s good because thanks to all the exclusive content deals Amazon made ³, I’m able to catch on the series that weren’t previously available to me. ⁴.

But, the Amazon Instant Video app mixes stuff that’s ‘free’, with stuff I have to buy or rent. There are categories designed to help me filter; but if I search for a series directly, I’m back to the jeopardy of “free or not” after seeing a search result.

Netflix doesn’t have that: if I see it, I can play it. The logic of Sky Store becomes clear.

Yes, NowTV has three subscription tiers of Movies, Entertainment and Sport: but those are really clear facets. I know which of those I’ve paid for, so I search for an entertainment show, knowing I can watch it.

Multiple apps may be the online equivalent of grabbing extra shelf-space, but I can see the UX benefit in separating subscription from purchase & rental.

And their come-back emails would not let me forget this
Brand recognition since the rebrand is apparently poor
Alongside all the non-exclusive deals both Amazon and Netflix have
I’ve yet to figure out the rights-deal that’s made the BBC series Miranda appear with 4OD branding in Amazon

Definition of Slippery Slope

BT are being forced to block access to specific piracy websites, lucky that they have the technology hanging around for the IWF watch-list then?

BT are being forced to block access to a piracy site.

This will no doubt use the BT Cleanfeed infrastructure used for the IWF. You either have something clever that proxies everything, or your redirect the blacklisted IPs to a filtering proxy. The former is expensive, the latter breaks wikipedia anonymous updates.

Anyway, I wrote about this point that the Aussie No Clean Feed were making made a while back. Given politicians and the judiciary a toolkit that can be applied generally, and they will.

This raises some depressing questions:

How long until this ruling applies to other ISPs?
How long until the IWF watch-list becomes broader to save content owners going after each ISP?
How long until refusing to use the IWF list, like some smaller ISPs, becomes illegal?
At what point is using VPN services outlawed: I use one when I’m on public WiFi but it would bypass any ISP provisions.

I’m sure none of us are really surprised, but it’s sad to be proven right.

Hashtags as plausible deniability from compliance

Rather than hosting discussions themselves, when Broadcasters provide a Hashtag, are they enabling people to have the conversation, but without the compliance implications of fully associating things with your brand-name.

Various broadcasters have done “chat around content” applications: the Apprentice and The X-Factor being two examples. These are expensive to run, because the second the content is on bbc.co.uk or itv.com then you face the wrath of compliance.

This is not a compliance rant, broadcasters need compliance, and at its best it helps programme makers get the most out of their rushes. It is an overhead though, and given we’ve yet to really see how we can “monetize” those people talking about shows, is it worth paying that penalty for highly vocal, but very small minority?

Enter the hashtag: Now shows have been transmitting with cues of suggested hashtags for many years. Now really only used on Twitter, but you could argue that a tag is really a platform neutral way to flag your content.

You’re not hosting the discussion, which means you’re not paying for moderation, and you’re not liable for the compliance. The broadcaster is saying “you guys can go and huddle over there, but it’s not really anything to do with us, understood?”

This has implications for second-screening because I think it means that broadcasters are going to be loathed to actually build wide-scale chat-around-content style applications: their name being associated with it just causes too much expectation. Can you imagine installing the “BBC Socialiser” app and getting the “THIS APPLICATION MAY CONTAIN BAD THINGS” pop-up from the iTunes store? It’s not what people would expect from the BBC.

Most people are still lazy, lean-back linear content consumers – the TV is a familiar friend at the end of the work-day who doesn’t expect much response, but over time people will want nicer ways to contextualise their tweets about content.

And so to the meaningless predictions…

Much as they will be loathed to let go, Broadcasters will realise that they can’t justify self-providing these services, and they will give more data away about schedules and items on-air in a form that can be better used to tag content by 3rd-party services
Services like Facebook, Twitter and Google+ will provide ways to embed this metadata in posts, so the a unique identifier of a show could be associated with a post, in a similar way to geo-tagging appeared.

The combination of those two things mean that you could create an app that was specialised twitter client. I don’t want a new social network for telly, but a client that embeds the magical codes to make everything more findable feels like a workable compromise.

Transition periods are the worst: technology, privacy and injunctions

Technology is disrupting privacy in a way that we can’t fight back from, will it all be easier once we just accept it?

Transitional times are the worst. Much like the music industry trying to retain their existing business model based on recorded music, or broadcasters using DRM to maintain rights windows on content that is transmitted in-the-clear; it’s always difficult to move on. Once you’ve accepted change, it might not be as easy as it was before, but you’re at least not fighting the inevitable.

We’re currently fighting that battle with privacy. As people tag us in Facebook, other people check us into insalubrious venues, we’re stuck in an ongoing battle to remove things that we don’t want stuck to our profile. We hide behind privacy settings on sites, only to watch a friend share a private RSS feed or one poorly-written API client leaking all the information to google. Our friends re-tweet from private accounts disclosing partially-incriminating thoughts. Strangers can sometimes see one-side of a conversation, not enough to know exactly what was said, but certainly enough for my mum to admonish me for some months ago.

Today we’ve had fun with super-injunctions, Twitter and parliamentary privilege. English courts trying to uphold rulings that Scotland and the Peoples’ Republic of Twitter are not subject to. And sure the identity of CTB is a nice bit of gossipy tittle-tattle, but what about when it’s the name of someone accused of a serious crime?

Our reporting restrictions are far more extensive than those of America, and while I don’t want to routinely have ‘perp-walks’ in the UK, I’d rather not have trials abandoned because our protections are unworkable in the modern world.

Away from the legal sphere, with the rise of computer vision and recognition projects, (look at the flurry of activity around the Kinect), and the availability of powerful on-demand computing resources (like GPU heavy instances from Amazon), privacy will soon be a problem that can be brute-forced away. Facebook is already rolling out photo recognition (this does seem to be taking longer than most of their phased roll-outs as I know a few people who had it months ago).

Embarrassing images we thought ‘anonymous’ because the face wasn’t shown will be tied down to people through bizarre combinations of EXIF tags, 3d room mapping, carpet recognition and host of other recognition metric that I can’t even imagine. That mole on your chest will no longer just be a minor cancer risk; it’s a data point that can be correlated.

Anyway, we’re in the transitional phase: We’re still trying to hold onto old-models of privacy which in a few years won’t be possible to have without moving to the “Google Opt-Out town“.

The other side of this transition we’ll probably have less privacy, but nobody will really have privacy, and somehow that will make it alright – that or we’ll have to change our names after we leave university, and dispose of all of our electrical devices, have that mole removed, and if we want to run for political office be very careful what we get up-to at college.

The Months After Everyone Else Kindle Review

I got a Kindle for Christmas, and months after everyone else got one, I write about what I like and what I don’t.

I got a Kindle for Christmas from my lovely parents.

Why I like it:

Form factor, screen size, weight, battery life
Reading more long form copy in a while, I think because I can get the right amount of copy on screen to match my natural skimming style
The inherent task switched that you have by picking it up, and the single-tasking of it. As someone else said (not that I can find the link) the web-browser will be useful for emergencies, but that’s about it. No push alerts, growls, or games to distract
Instapaper’s integration is really lovely (and finally helps me address the “popping” of To-Read items instead of the the “pushing” of them onto the stack)

Things I don’t like:

Limited choice of Newspapers – I would pay for the Guardian on it if I could. With the 3G variant, it’s a tablet thing always up-to-date with that.
Similar to that, I will not pay for the economist again. I don’t have to pay for the Online access, the iPhone app or website over and above my subscription, so much as I would love the have the economist on my Kindle – until Amazon/Publishers sort out a discount for subscribers I’m not
I can’t think how to do it efficiently, but the screensaver could be so much more than just a book image… (that said that completely breaks my previous statement that I like the mono-purpose of it)

I’m not going to pretend whose commercial teams are at fault here for these, but they are the main gripes I’ve found so far. Given those are policy, rather than technology, I hope they’ll shake out in time.

A Hunt related suggestion for styleguide writers

A family friend provides a solution to spoonerisms.

Many years ago a friend of the family got divorced: “Alan and Julie” were longstanding friends, and when “Alan” got into another relationship, we found it difficult to say “Alan” without “And Julie”.

We flipped them, it was always “Sue and Alan”.

In the light of recent naughtieness, can I suggest it’s always “Culture Secretary, Jeremy Hunt”.

Is scientific tear-down fair use?

Ben Goldacre is being asked to take down an extract of a show illustrating woeful misunderstandings of the MMR vaccines, and the risks associated with it.

Ben Goldacre has been asked by the lovely Lawyers at Global Radio to take-down his 44 minute extract of Jeni Barnett’s piece she did on MMR. Jenni, who later admitted she was woefully ill-prepared and started off an emotive debate on her blog with the standard pathos laden phrases like “as a mother…”, spouted a load of quasi-plausible pseudo-science about how awful vaccines were.

As Goldacre and others have pointed out many times, the Wakefield claims are totally refuted/withdrawn/dismissed now. There is no evidence that immune systems are overloaded by vaccination. There is a plethora of evidence that Measles is returning.

I hope he finds some legal representation, because at a time when we’re questioning the impact finance reporting can have on the real world economy, we should ask the same about science. But “as a mother…” people don’t tend to have opinions about the state of the credit default swaps market.