State-Issued Identifiers aren’t generally good passwords

Just because a number is long, doesn’t mean that it’s secure…

For many years, despite repeated requests, a South African bank has been sending me bank statements.

The thing is that I don’t live in South Africa, have never visited, nor banked there. But I do have a particularly “good” email address that gets a lot of misdirected emails… I usually don’t read them, but this week I did, as I replied for the latest attempt of “please stop”.

The email included a PDF statement, password protected with the South African ID number of the customer. I suspect that protection is why the service centre seem so unbothered by the repeated requests to stop this information.

A few years back, I got very embedded in PII/GDPR. We were designing a data warehouse setup that allowed analysis of user activity, while protecting privacy, and enabling easier compliance with GDPR deletion requests. There was discussion about the feasibility of SHA2 hash reversal… and we took a lot of time to communicate the infeasibility with the legal team.

So this week, I started to wonder: If I were a bad actor (which I am not), could I feasibly crack this ID with some Python?

What’s the challenge?

There are three sets to consider in this:

The Absolute Keyspace: without any knowledge of the identifiers, the number of combinations?
The Available Keyspace: if the ID has constraints, how many valid combinations are there?
The Effective Keyspace: if we know anything more, how many combinations are applicable?

A good security identifier should make discerning the differences between these difficult: it should involve a reasonable amount of calculation before it becomes obvious that an ID is valid, and correct.

What do we know?

South African ID Numbers are 13 decimal digits.

A single decimal can be 0-9, ten in total, and so each digit has a a cardinality of 10.

(Cardinality being the maths word for ‘number of choices’ in a set)

We can use this information to calculate the absolute keyspace:

\begin{align*} Keyspace_{Absolute} = &10 \times 10 \times 10 \times 10 \times 10 \\ & \times 10 \times 10 \times 10 \times 10 \\ & \times 10 \times 10 \times 10 \times 10 \\ = & 10^{13}\\ = & 10,000,000,000,000 \end{align*}

Since this work was derived in a Jupyter notebook, I’ll also include some python as we go along…

Python Code for absolute keyspace

number_of_digits = 13
keyspace_absolute = 10 ** number_of_digits
print(f"Number of combinations: {keyspace_absolute:,}")

Number of combinations: 10,000,000,000,000

Is this feasible?

So without knowing anything about the ID, there are 10 trillion combinations to check.

Python can attempt to open a PDF with a password around 150 times per second. This would be our basic implementation.

More specialised tools like John the Ripper raise that rate 4,500 per second, that’s around 30 times faster.

We’ll put these into a summary function, as we’ll be calling this a few times.

Python for Summary Function

test_rate_slow = 150
test_rate_fast = 4500
def summary_to_brute_force(combinations):
    hours_to_test_slow = (combinations / test_rate_slow) / 60 / 60
    hours_to_test_fast = (combinations / test_rate_fast) / 60 / 60
    return f"At slow rate: {hours_to_test_slow:,.2f} hours, or: {hours_to_test_slow / 24:,.2f} days" + \
           f"\nAt optimised: {hours_to_test_fast:,.2f} hours, or: {hours_to_test_fast / 24:,.3f} days"

print(summary_to_brute_force(keyspace_absolute))
At slow rate: 18,518,518.52 hours, or: 771,604.94 days
At optimised: 617,283.95 hours, or: 25,720.165 days

At this stage, it would not be worth the cost to brute-force this.

The information in the ID, or the document, is not valuable enough.

Scoping the “Available Keyspace”

The South African ID format is described here and this OECD PDF.

The format is YYMMDDSSSSCAZ:

YYMMDD is the date of birth
SSSS separating people born on the same day
- Female entries start 0-4, males entries start 5-9
C represents citizenship status
A was previously used to represent race, but is now unspecified
Z is a checksum digit, using the Luhn algorithm

How does this help us?

The Z check digit reduces the key space by a factor of 10: we “only” have to brute-force for the first 12 digits, and calculate the 13th.

Since A is unspecified, we will leave its cardinality unchanged at 10.

The C citizenship status can be 0, 1, or 2. This digit now has cardinality of 3 instead of 10.

Dates

The YYMMDD digits are dates, these have constraints:

Years are from 00-99
Months are from 01-12
Days are from 01-31

If we just consider those digits individually, we can calculate the cardinality like this:

\begin{align*} dates &= 10 \times 10 \times 2 \times 10 \times 4 \times 10 \\ &= 80,000 \end{align*}

But that’s going to consider many impossible days: month 19 doesn’t exist, nor does day 35.

So we could just consider these 6 digits as a combined date field, and get a more useful answer:

\begin{align*} dates &= years \times days \\ & = 100 \times 365 \\ &= 36,500 \end{align*}

(Yes, I am ignoring leap-years in this calculation… they’re not material to this calculation)

Our new understanding of the ID number format comes together, and we can compare the Absolute with the Available keyspaces:

\begin{align*} Keyspace_{Absolute} = &10^{13}\\ = &10,000,000,000,000 \\ Keyspace_{Available} = &valid\_dates \times serial\_numbers \\ & \times citizenship\_status \times A\_column \\ = &36,500 \times 10,000 \times 3 \times 10 \\ = &10,950,000,000 \end{align*}

So even without any knowledge of the target, only 0.1095% of the original key space needs to be searched.

Python for Available keyspace

cardinality_of_birthdays = 100 * 365 # we're ignoring leap years
cardinality_of_serial_numbers = 10000
cardinality_of_citizenship_states = 3
cardinality_of_the_a_digit = 10
keyspace_available = cardinality_of_birthdays * cardinality_of_serial_numbers * cardinality_of_citizenship_states * cardinality_of_the_a_digit
print(f"Number of valid ID numbers: {keyspace_available:,}")
print(summary_to_brute_force(keyspace_available))

Number of valid ID numbers: 10,950,000,000
At slow rate: 20,277.78 hours, or: 844.91 days
At optimised: 675.93 hours, or: 28.164 days

One month with optimised checking could be feasible, especially if you rented some machines… but can we do better?

What’s the Effective Keyspace?

A badly recreated email from a bank, with a banner "your electronic statement". The intro reads: Dear Mr G. Customer Please Find Attached your Statement for May • Information that only you will know is displayed in the eStatement verification block. This is done so you can be sure your statement is from BANKCO. • You will be required to enter the 13 digits of your identity number to view your statement. There is an additional box: Verification Info account number: *******5678 ID Number: *********1234 — A recreation of the email sent by the bank

Revisiting the email, it contains some info I’ve ignored until now:

The last 4 digits of the ID as verification
The recipient is addressed as ‘Mr Customer’

Going back to the format YYMMDDSSSSCAZ.

We know the values for C & A, so those now have cardinality of 1.

C ‘only’ had a cardinality of 3, so that’s excludes 67% of possibilities
A had a cardinality of 10, so that excludes of 90% of possibilities

These combine however, so the remaining amount from knowing C & A is: $\frac{1}{3} \times \frac{1}{10} = \frac{1}{30}$

Let’s reconsider the SSSS block: which we’ll refer to as S1, S2, S3, S4.

Since our ID is male we know that the S1 must be 5/6/7/8/9, so cardinality of that digit is now 5
We know S4, so it has cardinality of 1

Again these combine, so the total remaining is: $\frac{5}{10} \times \frac{1}{10} = \frac{1}{20}$

\begin{align*} SSSS_{possible} &= 10^{4} = 10,000 \\ SSSS_{from\_email} &= 5 \times 10 \times 10 \times 1 \\ &= 500 \end{align*}

Checking in the formula again:

\begin{align*} Keyspace_{Absolute} = &10,000,000,000,000 \\ Keyspace_{Available} = &10,950,000,000 \\ Keyspace_{Effective} = &valid\_dates \times serial\_numbers \\ & \times citizenship\_status \times A\_column \\ = &36,500 \times 500 \times 1 \times 1 \\ = &18,250,000 \end{align*}

We’re now down to 18.25 million possible keys to check.

Python for effective keyspace

cardinality_of_birthdays = 100 * 365 # ignoring leap years for now
cardinality_of_serial_numbers = 500
cardinality_of_citizenship_states = 1
cardinality_of_the_a_digit = 1
total_number_of_email_matching_combinations = cardinality_of_birthdays * cardinality_of_serial_numbers * cardinality_of_citizenship_states * cardinality_of_the_a_digit
excluded_by_using_email_information = keyspace_available - total_number_of_email_matching_combinations
print(f"Number of numbers matching email: {total_number_of_email_matching_combinations:,}")
print(summary_to_brute_force(total_number_of_email_matching_combinations))

Number of numbers matching email: 18,250,000
At slow rate: 33.80 hours, or: 1.41 days
At optimised: 1.13 hours, or: 0.047 days

Even the naive, 1.41 days is really starting to look feasible, and with John the Ripper, we’re already doing it in little over an hour.

But what about the check digit?

Earlier we ignored the check number, since we can calculate it… but we were supplied it in the email.

We can use it to see if the ID is a potential match, and only check matching ones against the file.

Luhn format checkdigits use simple modulo 10 arithmetic.

This means only 10% of the generated IDs will be checked against the PDF password.

Python for checksum validated keyspace

using_checkdigit_exclusion = total_number_of_email_matching_combinations / 10
print(f"Entries matching email check digit: {using_checkdigit_exclusion:,.0f}")
print(summary_to_brute_force(using_checkdigit_exclusion))

Entries matching email check digit: 1,825,000
At slow rate: 3.38 hours, or: 0.14 days
At optimised: 0.11 hours, or: 0.005 days

Our ‘effective’ keyspace is now 1,825,000 entries.

So even with a naive implementation just in Python, we can do it in less than a day.

Age scoping

A friend pointed out that searching all ages between 0-100 is a bit pointless, so we could change that to be a range of 18-70?

Because the birthday field covers 100 years cleanly, we calculate the number of years we want to test.

keyspace_{Age Scoped} = keyspace_{effective} \times \frac{years\_to\_test}{100}

However, given the effective keyspace we’re already down to, the impact of age reduction feels less useful in this scenario, if you had less information this reduction could be more useful.

Python for age reduced keyspace

min_age = 18
maximum_age = 70
age_range = maximum_age - min_age
percentage_of_people_in_age_range = age_range / 100
total_number_in_age_range = using_checkdigit_exclusion * percentage_of_people_in_age_range
print(f"Number in valid age range: {total_number_in_age_range:,}")
print(summary_to_brute_force(total_number_in_age_range))

Number in valid age range: 949,000.0
At slow rate: 1.76 hours, or: 0.07 days
At optimised: 0.06 hours, or: 0.002 days

In summary

From an absolute key space of 11,000,000,000,000, we’ve excluded over 99.99999% of the possible numbers, and have only 949,000 to check against the file.

Python to generate table

import pandas as pd
data = [
            [ keyspace_absolute, 100],
            [ keyspace_available, 100 * keyspace_available / keyspace_absolute],
            [ total_number_of_email_matching_combinations, 100 * total_number_of_email_matching_combinations / keyspace_absolute],
            [ using_checkdigit_exclusion, 100 * using_checkdigit_exclusion / keyspace_absolute],
            [ total_number_in_age_range, 100 * total_number_in_age_range / keyspace_absolute],
      ]
df = pd.DataFrame(data, columns=["Count", "Percentage of Total"], index=["Absolute Keyspace", "Available Keyspace", "Email Keyspace", "Using Email checkdigit", "Limiting by Age"])
brute_force_label = f"Hours @ {test_rate_slow}/s"
brute_force_faster_label = f"Hours @ {test_rate_fast:,}/s"
df[brute_force_label] = df["Count"] / test_rate_slow / 60 / 60
df[brute_force_faster_label] = df["Count"] / test_rate_fast / 60 / 60
df.style.set_properties(**{'font-family': "Menlo, Consolas, Monospace"}) \
  .format(subset=["Count"], thousands=",", precision=0) \
  .format('{:.6f} %', subset=["Percentage of Total"]) \
  .format(thousands=",", precision=1, subset=[brute_force_label, brute_force_faster_label])

Set of potential ID Numbers	Size of set	Percentage of Absolute	Hours @ 150/s	Hours @ 4,500/s
Absolute Keyspace	10,000,000,000,000	100.000000%	18,518,518.5	617,284.0
Available Keyspace	10,950,000,000	0.109500%	20,277.8	675.9
Email Keyspace	18,250,000	0.000182%	33.8	1.1
Using Email Checkdigit	1,825,000	0.000018%	3.4	0.1
Limiting by Age	949,000	0.000009%	1.8	0.1

Visualisation

Graphing this is really hard, differences in scale make it really difficult to communicate.

Since I’m not a data-vis genius, this uses a log-scale.

Python to generate Graph

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data = df['Count'][::-1]
data.plot(kind='barh', logx=True, title="Keyspace Reduction\nnote the log-scale",ax=ax)
vals = ax.get_xticks()
ax.set_xticks(ax.get_xticks()[::2])
ax.set_xlabel("Keys in Keyspace")

Conclusions

This analysis shows that in common with recent attacks on the Tetra encryption system , if you’re not using all of the absolute keyspace, your protection is far weaker than may appear from a big number.

These national/structured IDs do not make good secrets: the structure inherently reduces the size of the effective keyspace, and makes it very easy to exclude ranges of people (by age or gender).

While phishing is a problem, and emails need to be/appear authentic – we need to use mechanisms to achieve this at the email level: SPF, DKIM, DMARC, BIMI. While imperfect, these are far better than including information directly related to the ID/information being protected.

In this scenario even with a naive implementations, it would be entirely feasible to brute-force this particular email/pdf combination, which would expose customer information.

Now I don’t know how valuable that information in the statement is, but I wonder if it be used as part of a social engineering attack?

A plea to companies: If I message asking you to stop send me statements, maybe stop?

Falsehoods Smart-Device people believe about Home Networks

A few years ago someone posted a great article about the bad assumptions programmers make about names; here’s a similar list about assumptions about home networks and smart devices.

We all remember the excellent Falsehoods people believe about names don’t we?

Having lived with a few smart devices sharing my network for a while, I thought we need a similar one about smart devices and home networking.

Items marked with a * contributed or inspired by @davidmoss

The WiFi is always available
The WiFi is continuously connected to the internet
The WiFi network isn’t hidden
The WiFi network isn’t restricted by MAC address so they can be hidden from the user
The WiFi network doesn’t use strong authentication like WPA2
The WiFi network definitely doesn’t use authentication mentioning the word ‘Enterprise’
The user knows the exact authentication type is use for the WiFi, so no need to auto-detect it*
There is only a single WiFi network
The name of the WiFi network is ASCII*
There is only a single access point for the WiFi network
Any device connected to the home-network is trusted to control the smart devices on it
Smart devices and their controllers are on the same network
Devices on the network can connect directly to each other
The network is simple, and doesn’t use other technologies such as powerline¹
All networks have a PC type device to install/configure/upgrade devices (and that device is running Windows)*
There is always a DHCP Server*
Devices will always get the same IP address on the internal network from the DHCP server
DHCP device names don’t have to be explanatory, because nobody ever sees them
Devices can have inbound connections from the internet ²
The network is reliable without packet loss
The connectivity is sufficient for all devices on the network
The performance characteristics of the network is constant and doesn’t change across time
The Internet connectivity isn’t metered, and there’s no problem downloading lots of data
Encryption of traffic is an overhead that isn’t needed on embedded devices
Predictable IDs like Serial-Numbers are good default security tokens
Unchangeable IDs like Serial-Numbers are acceptable security tokens
The device won’t be used as a platform for attacks, so doesn’t need hardened from threats internal and external to the network. ³
Devices can be shipped and abandoned. They won’t be used for years, as so any future software vulnerabilities can be ignored
IPv6 is for the future, and doesn’t need to be supported⁴

What have I missed?

These should be layer 2 transparent, but they can disrupt Multicast which can break bonjour ↩
aside from security implications, ISPs are moving to a carrier-grade NAT to work around IPv4 address exhaustion, so inbound ports may not be possible ↩
many devices have a pretty complete Linux stack, at least complete enough for attackers to use ↩
Chicken and Egg this one ↩

Security is hard, but the easy bits aren’t

The hard bits of security are hard, but the easy bits aren’t. As infrastructure gets more dynamic, we need to make sure it isn’t everyone else redefining it.

Another week, another story about security.

Actually multiple stories about security.

And what’s upsetting with these ones are the fact that the fixes for them are already available.

I don’t cut-code anymore. I’m not a particular adept coder, and I think my code is a bit ugly. But I still know what bad practice smells like and what upsets me is how often we have repeat the mistakes of old. ¹ ²

Yes there are always deadlines, but if we’re working with advanced software defined infrastructures, then we have to restrict who can redefine those.

If you’re in a Product Manager role, don’t be afraid to ask what you’re doing for security, or what response plans are if something is compromised. Be mindful of the risk to your reputation or risk if you don’t give developers time to improve security instead of piling ever more features on. The mitigations for the most obvious attacks are documented, and usually relatively easy to implement.

And now to the details

Code Spaces had all their data wiped, we don’t know all the details but it sounds like:

They hadn’t enabled 2factor auth on their AWS account
Their backups weren’t to a different AWS account, or better still to another provider.

If you’re running a production service, and you’re hosting data for anyone else, then your backups need to be rock solid. Backing up to the same provider, in the same account, is like copying all the files from your desktop into a folder called “backup”. Sure you’ve two copies but when that disk goes bang they’re both gone.

And yes, 2 Factor is a pain when you’re logging into services, but if you’re hosting customer data that’s a pain you need to cope with. Providers usually let you set up many secondary accounts with reduced privileges, so use those tools to protect your services, and let people do just what they need in order to do their jobs.

On a similar theme people are leaving their AWS keys in android apps. Amazon offers a ticket granting service that’s ideal for this, but that’s more work, but work that you should be doing.

Some people aren’t even using those permissioning tools to embed keys with limited access, which just to reiterate, you shouldn’t be doing anyway. Instead they are embedding their main access key pair, which means that attackers could access and delete all data, and spin up thousands of instances just for fun/profit.

Security is hard, the recent problems found in libraries like OpenSSL are hard for an individual coder to work around, but decent libraries are still better than going it alone.

The 80:20 rule is ever present, will you ever make your app fully secure; unlikely. Can you prevent the most obvious attacks with application of best practices, which many programming languages can do for you; yes.

Don’t leave keys lying around, give apps or services any more permissions than they need, or use predictable IDs for sensitive data…

Do sanitise data you’re given, protect from XSS attacks, turn on 2-Factor Authentication for anything serious and always keep decent backups hosted on separate infrastructure…

These lists go on, but they not new: Best practice years ago, is still best practice now.

Don’t get me started on file-moving scripts that don’t use incoming and outgoing folders to avoid race-conditions ↩
Or when we tolerate software from vendors that can’t run as anything other than root or Administrator ↩

Security becoming life and death

When medical devices are hacked, is it finally time to get that security should be implicit as a requirement.

(Given many of my posts are second rate Gruber posts on the mac, this one is a second rate Schneier)

I like Chip+PIN. I don’t think EMV is perfect: it has the complexity of a committee driven standard created by competing companies, and it has flaws and oversights. I’ll still wager it’s more secure than someone looking at a signature, and since skimming attacks get immediately moved abroad (when the cloned cards are created from the legacy mag-stripe) behavioural analysis makes spotting fraud a bit easier.

I do not feel the same way about Verified By Visa which I continue to curse every time I use it.

Anyway I very much disliked the UK Cards Association’s response to the excellent Cambridge Computer Laboratory when they’ve published flaws and potential attacks, demanding they take the papers down. They played the near standard “oh it’s very hard to do right now, we don’t think anyone could really do that, please, they’re very clever and most people won’t be” line. The only problem is that with each new vulnerability, the Cambridge Team appear to be producing more plausible attacks. UK Cards were rightly told to go away.

It would have been nicer to hear:

“We thank the CCL for their work in exposing potential attacks in the EMV system. At the moment we think these are peripheral threats, but we will work with EMV partners to take the findings onboard, and resolve these as the standard evolves”

This is course blows the “Chip+PIN is a totally secure” line out the water – which matters because they’re trying to move the liability onto the consumer, admitting the system is even partially compromised lessens that.

At the end of the day, this is just money. There’s always been fraud, there always will be. Not life and death.

I used to work in Broadcast. Many of those systems were insecure relying on being in a partitioned network. DNS and Active Directory were frowned on, being seen as potential points of failure rather than useful configuration and security tool. The result was a known, but brittle system. Hardening of builds was an afterthought and the armadillo model of crunchy perimeter, soft centre, meant that much like the US Predator Drone control pods, once inside passage made easy.

Depressing, yes? Particularly because so many of these problems were solved before, and solved well. But it was just telly. Not life and death.

I mean, it’s not like you can remotely inject someone with a lethal dose of something.

Except it is: A few months back someone reversed engineered the protocol of their insulin pump, able to control it with the serial number. This was bad enough. Devices that inject things into humans shouldn’t be controllable without some of authentication beyond a 6 digit number.

At the time the familiar: “it’s too difficult, you still need the number, you’ve got to be nearby” response was provided.

Two months later, another security person has now managed to decode the magical number, and used a long distance aerial to be able to send commands to the pump.

I’m sure it’s still “too hard to be viable”: because the death of someone isn’t something that has major consequences that could have the kind of support that makes hard things viable…

Security is hard to do well, and we need to start embedding it in everything – it is now a matter of life and death. But it’s hard, and hard for the psychology just as much as a technical. You should really use an existing algorithm implementation because the chances are it’s better than yours: but that’s licensing and IPR, so just roll your own cipher believing your application is too trivial to be a target for hacking. Besides your proprietary wire-protocol is proprietary, it’s already secret. People aren’t going to bother to figure it out.

Security makes things harder: you can’t just wire-sniff your protocol anymore to debug stuff. Your test suites become more complicated because you can no longer play back the commands and expect the device to respond. That little embedded processor isn’t powerful enough to be doing crypto: it’s going to up the unit price, it’s going to increase power usage and latency.

Many programmers, still, belong to the “if I hit it and hit it until it works” school of coding. I don’t mean test-driven-development, I’m meaning those coders who think if it compiles, it ships. These people don’t really adapt well to working in a permissions based sandbox; it’s harder to split your processes up so that only the things that need the privileges have them (we’ve all done ‘chmod 777 *’ to get an application up and running).

Until everyone realises that every device with smarts is a vector, from Batteries, to APIs, to websites we’re increasingly at risk. I guess that massive solar flare could take things out for us.