State-Issued Identifiers aren’t generally good passwords

Just because a number is long, doesn’t mean that it’s secure…

For many years, despite repeated requests, a South African bank has been sending me bank statements.

The thing is that I don’t live in South Africa, have never visited, nor banked there. But I do have a particularly “good” email address that gets a lot of misdirected emails… I usually don’t read them, but this week I did, as I replied for the latest attempt of “please stop”.

The email included a PDF statement, password protected with the South African ID number of the customer. I suspect that protection is why the service centre seem so unbothered by the repeated requests to stop this information.

A few years back, I got very embedded in PII/GDPR. We were designing a data warehouse setup that allowed analysis of user activity, while protecting privacy, and enabling easier compliance with GDPR deletion requests. There was discussion about the feasibility of SHA2 hash reversal… and we took a lot of time to communicate the infeasibility with the legal team.

So this week, I started to wonder: If I were a bad actor (which I am not), could I feasibly crack this ID with some Python?

What’s the challenge?

There are three sets to consider in this:

  • The Absolute Keyspace: without any knowledge of the identifiers, the number of combinations?
  • The Available Keyspace: if the ID has constraints, how many valid combinations are there?
  • The Effective Keyspace: if we know anything more, how many combinations are applicable?

A good security identifier should make discerning the differences between these difficult: it should involve a reasonable amount of calculation before it becomes obvious that an ID is valid, and correct.

What do we know?

South African ID Numbers are 13 decimal digits.

A single decimal can be 0-9, ten in total, and so each digit has a a cardinality of 10.

(Cardinality being the maths word for ‘number of choices’ in a set)

We can use this information to calculate the absolute keyspace:

\begin{align*}
Keyspace_{Absolute} = &10 \times 10 \times 10 \times 10 \times 10 \\
& \times 10 \times 10 \times 10 \times 10 \\
& \times 10 \times 10 \times 10 \times 10  \\
 = & 10^{13}\\ 
 = & 10,000,000,000,000
\end{align*}

Since this work was derived in a Jupyter notebook, I’ll also include some python as we go along…

Python Code for absolute keyspace

Number of combinations: 10,000,000,000,000

Is this feasible?

So without knowing anything about the ID, there are 10 trillion combinations to check.

Python can attempt to open a PDF with a password around 150 times per second. This would be our basic implementation.

More specialised tools like John the Ripper raise that rate 4,500 per second, that’s around 30 times faster.

We’ll put these into a summary function, as we’ll be calling this a few times.

Python for Summary Function

print(summary_to_brute_force(keyspace_absolute))
At slow rate: 18,518,518.52 hours, or: 771,604.94 days
At optimised: 617,283.95 hours, or: 25,720.165 days

At this stage, it would not be worth the cost to brute-force this.

The information in the ID, or the document, is not valuable enough.

Scoping the “Available Keyspace”

The South African ID format is described here and this OECD PDF.

The format is YYMMDDSSSSCAZ:

  • YYMMDD is the date of birth
  • SSSS separating people born on the same day
    • Female entries start 0-4, males entries start 5-9
  • C represents citizenship status
  • A was previously used to represent race, but is now unspecified
  • Z is a checksum digit, using the Luhn algorithm

How does this help us?

The Z check digit reduces the key space by a factor of 10: we “only” have to brute-force for the first 12 digits, and calculate the 13th.

Since A is unspecified, we will leave its cardinality unchanged at 10.

The C citizenship status can be 0, 1, or 2. This digit now has cardinality of 3 instead of 10.

Dates

The YYMMDD digits are dates, these have constraints:

  • Years are from 00-99
  • Months are from 01-12
  • Days are from 01-31

If we just consider those digits individually, we can calculate the cardinality like this:

\begin{align*}
dates &= 10 \times 10 \times 2 \times 10 \times 4 \times 10 \\
&= 80,000
\end{align*}

But that’s going to consider many impossible days: month 19 doesn’t exist, nor does day 35.

So we could just consider these 6 digits as a combined date field, and get a more useful answer:

\begin{align*}
dates &= years \times days \\
& = 100 \times 365 \\
&= 36,500
\end{align*}

(Yes, I am ignoring leap-years in this calculation… they’re not material to this calculation)

Our new understanding of the ID number format comes together, and we can compare the Absolute with the Available keyspaces:

\begin{align*}
Keyspace_{Absolute} = &10^{13}\\
= &10,000,000,000,000 \\
Keyspace_{Available} = &valid\_dates \times serial\_numbers \\
& \times citizenship\_status \times A\_column \\
= &36,500 \times 10,000 \times 3 \times 10 \\
= &10,950,000,000
\end{align*}

So even without any knowledge of the target, only 0.1095% of the original key space needs to be searched.

Python for Available keyspace

Number of valid ID numbers: 10,950,000,000
At slow rate: 20,277.78 hours, or: 844.91 days
At optimised: 675.93 hours, or: 28.164 days

One month with optimised checking could be feasible, especially if you rented some machines… but can we do better?

What’s the Effective Keyspace?

A badly recreated email from a bank, with a banner "your electronic statement". The intro reads: Dear Mr G. Customer Please Find Attached your Statement for May • Information that only you will know is displayed in the eStatement verification block. This is done so you can be sure your statement is from BANKCO. • You will be required to enter the 13 digits of your identity number to view your statement. There is an additional box: Verification Info account number: *******5678 ID Number: *********1234
A recreation of the email sent by the bank

Revisiting the email, it contains some info I’ve ignored until now:

  • The last 4 digits of the ID as verification
  • The recipient is addressed as ‘Mr Customer’

Going back to the format YYMMDDSSSSCAZ.

We know the values for C & A, so those now have cardinality of 1.

  • C ‘only’ had a cardinality of 3, so that’s excludes 67% of possibilities
  • A had a cardinality of 10, so that excludes of 90% of possibilities

These combine however, so the remaining amount from knowing C & A is: \frac{1}{3} \times \frac{1}{10} = \frac{1}{30}

Let’s reconsider the SSSS block: which we’ll refer to as S1, S2, S3, S4.

  • Since our ID is male we know that the S1 must be 5/6/7/8/9, so cardinality of that digit is now 5
  • We know S4, so it has cardinality of 1

Again these combine, so the total remaining is: \frac{5}{10} \times \frac{1}{10} = \frac{1}{20}

\begin{align*}
SSSS_{possible} &= 10^{4} = 10,000 \\
SSSS_{from\_email} &= 5 \times 10 \times 10 \times 1 \\
&= 500
\end{align*}

Checking in the formula again:

\begin{align*}
Keyspace_{Absolute} = &10,000,000,000,000 \\
Keyspace_{Available} = &10,950,000,000 \\
Keyspace_{Effective} = &valid\_dates \times serial\_numbers \\
& \times citizenship\_status \times A\_column \\
= &36,500 \times 500 \times 1 \times 1 \\
= &18,250,000
\end{align*}

We’re now down to 18.25 million possible keys to check.

Python for effective keyspace

Number of numbers matching email: 18,250,000
At slow rate: 33.80 hours, or: 1.41 days
At optimised: 1.13 hours, or: 0.047 days

Even the naive, 1.41 days is really starting to look feasible, and with John the Ripper, we’re already doing it in little over an hour.

But what about the check digit?

Earlier we ignored the check number, since we can calculate it… but we were supplied it in the email.

We can use it to see if the ID is a potential match, and only check matching ones against the file.

Luhn format checkdigits use simple modulo 10 arithmetic.

This means only 10% of the generated IDs will be checked against the PDF password.

Python for checksum validated keyspace

Entries matching email check digit: 1,825,000
At slow rate: 3.38 hours, or: 0.14 days
At optimised: 0.11 hours, or: 0.005 days

Our ‘effective’ keyspace is now 1,825,000 entries.

So even with a naive implementation just in Python, we can do it in less than a day.

Age scoping

A friend pointed out that searching all ages between 0-100 is a bit pointless, so we could change that to be a range of 18-70?

Because the birthday field covers 100 years cleanly, we calculate the number of years we want to test.

keyspace_{Age Scoped} = keyspace_{effective} \times \frac{years\_to\_test}{100}

However, given the effective keyspace we’re already down to, the impact of age reduction feels less useful in this scenario, if you had less information this reduction could be more useful.

Python for age reduced keyspace

Number in valid age range: 949,000.0
At slow rate: 1.76 hours, or: 0.07 days
At optimised: 0.06 hours, or: 0.002 days

In summary

From an absolute key space of 11,000,000,000,000, we’ve excluded over 99.99999% of the possible numbers, and have only 949,000 to check against the file.

Python to generate table

Set of potential ID NumbersSize of setPercentage of AbsoluteHours @ 150/sHours @ 4,500/s
Absolute Keyspace10,000,000,000,000100.000000%18,518,518.5617,284.0
Available Keyspace10,950,000,0000.109500%20,277.8675.9
Email Keyspace18,250,0000.000182%33.81.1
Using Email Checkdigit1,825,0000.000018%3.40.1
Limiting by Age949,0000.000009%1.80.1

Visualisation

Graphing this is really hard, differences in scale make it really difficult to communicate.

Since I’m not a data-vis genius, this uses a log-scale.

Python to generate Graph

Conclusions

This analysis shows that in common with recent attacks on the Tetra encryption system , if you’re not using all of the absolute keyspace, your protection is far weaker than may appear from a big number.

These national/structured IDs do not make good secrets: the structure inherently reduces the size of the effective keyspace, and makes it very easy to exclude ranges of people (by age or gender).

While phishing is a problem, and emails need to be/appear authentic – we need to use mechanisms to achieve this at the email level: SPF, DKIM, DMARC, BIMI. While imperfect, these are far better than including information directly related to the ID/information being protected.

In this scenario even with a naive implementations, it would be entirely feasible to brute-force this particular email/pdf combination, which would expose customer information.

Now I don’t know how valuable that information in the statement is, but I wonder if it be used as part of a social engineering attack?

A plea to companies: If I message asking you to stop send me statements, maybe stop?

2023 Comms Resolutions

What things can you do in 2023 to make you communications more efficient and considerate in the world.

I don’t really like New Years resolutions for reasons beyond the scope of this post.

This year however, am going to try and make a few changes to how I communicate, in work and otherwise.

“No Hello”

No Hello on instant messaging.

I hate being on the end of the Dangling Hello, and the 15 minutes of massively predicting what the person is wanting. But I still find it very hard to bundle in all up in the first message.

Equally, 4 notifications in quick succession can feel like literal torture.

You can still ask how people are doing, but you can just include that upfront, in a single message.

Hello X, hope you’re good, can you tell me what’s going on with TICKET-123

Me, Slack, This Year

Priority Tagging, ideally lower

Low Priority exists as well as High Priority on emails.

Flagging a gossipy/catchup IM as such in the opening.

Clarify & Summarise

The discomfort at being That Guy who pastes back the summary of what you agreed is less than the pain when you discover that you weren’t all sharing understanding.

When half the team thought “advance by 2 seconds” meant delete 2 seconds, and the other half thought add 2 seconds..

Always a default

When arranging things, I’m going to offer a default, always.

“I’m free all day” vs “I’m free all day, how about 11”

Make it easier to say “Great” done.

Stick to Core Hours

I’m a freelancer, I work self-defined hours… but that’s not mine to share with others.

While it’s useful for me to get thoughts out of my head into an email, that doesn’t mean I need to get them forced onto other people…

  • If I’m sending an email, I’ll set it to send later
  • If it’s an IM, I’ll set Slackbot to remind me or maybe the person, during the next working day

In Conclusion

We all drowning in a sea of notifications, if you can make yours just a little better, you make it easier for people around you to help you.

Perfect is indeed the enemy of good

The desire to do things well stops us doing them at all.

I re-connected with someone on linked-in the other week. (Yes, I actually use it like that). And he sent a lovely, long detailed reply. One that I was delighted to read. One that I want to reply to.

But I haven’t.

Anytime someone sends me a nice, long, structured message, on pretty much any medium, it falls into the awful silo of “well i need to sit down and write a nice reply”.

And it stays in that silo, along with all the other things like that.

So instead, I’ll write a little blog post about not being able to write, using up some of my daily word-quota in the process, and making the writing of the reply, even less likely.

 

On email etiquette

Lovely seeing you recently by the way, how are the kids? Great, that’s lovely, can you do me a favour?

Is it better to skip past the faux-pleasantries and to save everyone some time?

Hi,

How are you doing, long no time no speak, how are the kids? That new house you bought? Your family, they’re doing well? The cat? Oh…run over, that’s really sad.

How’s that project whose name I can’t remember with the things and stuff? And the weather?

BTW CAN YOU HELP ME BECAUSE I NEED SOMETHING?

I’m looking for new opportunities at the moment (Technical Product Management: check-out my Linked-In). I’m speaking to people in my network, including the sleeper-cells I’ve not spoken to in some time.

I’m trying to avoid emails like the ones above. People are busy: even before you open a message from someone you’ve not spoken to in years, the subtext is pretty obvious.

Sure I’ll genuinely say “Hope you are well” but anything else seems insincere.

Am I wrong to skip the dance, get to the point quickly and save everyone some time? Or am I being rude by not playing the game?

paper saving

Until we really are paperless, a simple idea to save paper when printing out emails.

While we’re meant to be in the era of the paperless office, I still print more than I’d like.

Why don’t Outlook, and web browsers, have something in the pagination engine that detects when there are fewer than 5 lines of text on the final page of a printout. When it finds this, it shrinks the text/spacing (by a level most people wouldn’t notice on a multiple page document) and repaginates it to avoid that overspill and save 1 sheet printing.

This only really works for plain text and HTML, where there isn’t (usually) explicit pagination, but would be largely transparent to users (Word did have a “shrink by 1 page” button, I’m unsure if this still exists, and using it requires user intervention.).

No longer would the phrase “please consider the environment before printing this email” languish alone on its own bit of paper.