Hunting for secrets with the DumpsterDiver

6 min readJul 7, 2018

Imagine, that one of an account to your test database in QA environment got compromised. ‘What a big deal?’ — you may think. The attacker may dump some test data, annoy you by dropping a table or put some trolling data into it. No serious risk at all.

Now, imagine you’re using such database, but this time deployed in cloud environment and keys for this test account got compromised. Such test accounts usually have quite loose permissions to make developers’ lives easier. What attacker can do with it? For example, he can run a crypto currency miners on a c4.8xlarge instances in every possible region and you’re going to pay thousands of dollars for the attacker’s profit. An attacker can even go a step further making your company a total bankrupt by removing all company’s resources, including backups.

Cloud keys may leak from your company in a numerous ways — if you’re interested in this topic, check out my presentation. However, in this article I want to focus only on key leaks from a company’s shared storage. Recently I made a little research, during which I found lots of files with sensitive data. After spending many hours on reviewing my findings, I quickly conclude that there is no way to manually search secrets in terabytes of various files. This blog post describes my story how I found a method to automate a process of finding sensitive information in big volumes of data.

How to find keys?

Here are examples of AWS Secret and Azure Shared key:

lxRV/uiC4knZQxyIZxSSlQ2xNlZMjo4km+LnjNiF
M3mmbjOlIZr11OZoULqUWyFA1EpOdZAEcmaC64E/Ft9MRfDEYE7qDJm+9ezGQY15==

So, what characteristics do those keys have? First of all, they have fixed length, e.g. AWS Secret Key is always 40 bytes long while Azure Shared Key is 66 bytes long. What is more, keys contain only characters from Base64 charset. The last characteristic we can find about the keys is their high randomness. Is it possible to count randomness? Surprisingly yes, you can count it in bits using the Shannon entropy.

The entropy

If you want to understand what is the entropy and how you can count it, there is no better explanation than this article. Using the Claude Shannon’s formula, let’s compare the entropy of a single character (the average amount of information delivered by a single message from a source of information) in the following strings:

- 404e554d243c1a11d13c96b60129504a31b0abd has 3.57 entropy.

- ChuckNorriscountedtoinfinitytwentytwice has 3.81 entropy.

- 2r9pAuQxUFAstrWhEy4G4WiVx5iJ74Hja5AWgHq9 has 4.67 entropy.

You can count the entropy on your own using this simple script.

If a tested string satisfies the below mentioned conditions, then we can say with quite high probability that we’re dealing with a key:

The string contains only Base64 characters (in other words, only consider strings between non Base64 characters, e.g. between “”, or ‘’)
The string’s length is between MIN_LEN and MAX_LEN values (e.g. if MIN_LEN is 40 and MAX_LEN is 66 bytes, then you can find AWS Secret key and Azure Shared Key)
The entropy of a string’s single character is higher than ENTROPY value (all AWS secret keys have the entropy higher than 4.2, while the entropy of Azure Share Key is always higher than 5.8)

Those conditions became a base of automating a process of finding keys in any text file.

The DumpsterDiver

Inspired by the TruffleHog tool I wanted to create its enhanced version, which can scan not only git repositories but any text file and which reports much less false positives. Ladies and gentlemen let me introduce the DumpsterDiver:

Here are the main characteristics of this tool:

it can analyze any text file
it can analyze git repositories
it can unpack and analyze the content of common archives (e.g. .zip or .tgz)
it can be customized to your needs to limit false positives (depending if you hunt for any key, or only for certain keys you can define different criteria)
it can find hardcoded passwords
it can search multiple grep words which supports wildcards
it is available as a Docker image
it writes the output in JSON format

DumpsterDiver in action

Let’s create the following test file (located in ./source_folder/):

{
    "aws_auth": {
        "aws_access_key_id": "AKIAJIS5NP79GW2AYZHA",
        "aws_secret_access_key": "lxRV/uiC4kmZQryIZxSSlQ6xNlZMjo4kn+LnjNiF"
    },
    "azure_auth": {
        "account": "foobar",
        "key": "M3mmbjOlIZr11OZoULqUWyFA1EpOdZAEcmaC64E/Ft9MRfDEYE7qDJm+9ezGQY15==",
        "container": "assets"
    },
    "mail_auth": {
     "login": "john.doe@evilcorp.com",
     "password": "M5UWx/N-yjuZ"
    }}

Let’s assume we’re searching for any key in this file, then the command would look like this:

$> python3 DumpsterDiver.py -p ./source_folder/ --level 3

If we’re hunting only for AWS Secret key then we should use the following command:

$> python3 DumpsterDiver.py -p ./source_folder/ --min-key 40 --max-key 40

Now, let’s say we’d like to find files containing containing any email address in evilcorp.com domain and don’t want to be notified about high entropy findings (so we have to set high entropy value to create a condition which is never satisfied, in our case it is any value > 6). For this purpose we can use the following command:

$> python3 DumpsterDiver.py -p ./source_folder/ -a --entropy 6 --grep-words *@evilcorp.com*

DumpsterDiver for finding passwords

Looking for high entropy is quite effective method, but only when you’re dealing with long strings. For shorter strings like 8–12 characters this method may generate a lot of false positives. So again, let’s analyze characteristics of typical, complex password:

It is 8–12 characters long.
It contains upper and lower case letters, at least one digit and a special character.

For that purpose we can use a Python library for… counting password’s complexity, for example the passwordmeter. Here are few examples how does it work in practice:

>>> import passwordmeter>>> pass1 = “12345678”>>> passwordmeter.test(pass1)(0.09042979745209619, {‘charmix’: ‘Use a good mix of numbers, letters, and symbols’, ‘casemix’: ‘Use a good mix of UPPER case and lower case letters’, ‘notword’: ‘Avoid using one of the ten thousand most common passwords’, ‘phrase’: ‘Passphrases (e.g. an obfuscated sentence) are better than passwords’})>>> pass2 = “password123”>>> passwordmeter.test(pass2)(0.2650850122376166, {‘charmix’: ‘Use a good mix of numbers, letters, and symbols’, ‘casemix’: ‘Use a good mix of UPPER case and lower case letters’, ‘phrase’: ‘Passphrases (e.g. an obfuscated sentence) are better than passwords’})>>> pass3 = "M5UWx/N-yjuZ">>> passwordmeter.test(pass3)(0.9399445538875157, {'phrase': 'Passphrases (e.g. an obfuscated sentence) are better than passwords'})

Using the passwormeter.test() method we can easily filter out all words which doesn’t follow password best practices 😃 We shouldn’t be worried about trivial passwords as they should be easily brute forced.

Consuming results

The results of the DumpsterDiver’s scan are not only printed in the terminal window, but all the findings are written in JSON format too (by default in ‘results.json’ file).

JSON is pretty common and can be easily converted to any other format e.g. to .csv and then consumed by other systems or provided to your boss in a fancy Excel table 😉 The verbose output with any errors is written to ‘errors.log’ file.

False positives

If your search is a general one (e.g. with parameter --level 3) don’t be surprised that you will see some false positives. I’ve added some basic rules to filter out the obvious false positives, such as filtering ordered alphabet characters or verifying if the finding contains digits (all keys always contain at least one digit). You don’t have to specify any flag to use those filters as they work by default. Do you still see too many false positives? Let me know and I will try to put new filters.

Awaiting for your feedback

So how do you like the DumpsterDiver? Are you convinced enough to give it a shot? Are you missing any features in it or maybe you have an idea how it can be done better? Any feedback is more than welcome!