Alexa to Dictionary (atod) and the robots.txt

We know already how to create pyhooks in python getting information from the stack and understanding the structures managed by different functions in Windows. We know as well how to connect to a database using python and storing there all the information coming from an e-mail accessed via IMAP. Now we are going to play with the access to the web in python using urllib2 and a bunch of other things.

The file robots.txt is, sometimes, misunderstood. You find sometimes wild cards and the spiders only consider this characters if they are in the user-agent field of the file. As a consequence, you can find really interesting things using some dorks in Google or Bing. We will talk on search engines hacking in another post, but by the way, here is a screen shot.


Using Dorks to locate juicy information in Google.

If you are managing a web platform or application, one of your angular stones in security should be having a strong access and authorization control. Even if you have a robots.txt, any “arachnid bot” will be able to index your information if it is not accessible. Sometimes this basic point is abandoned and, for different reasons, the robots.txt is used as a tool to apply defence by obscurity. Instead of configure properly the junction permissions, these ones are included in the robots.txt and that will be the first place a cyber criminal will look at.

Even the spiders could decide not attend the instructions in the user-agent field or maybe your robots.txt is not a spider but an automated process trying to look for the folders and files you don’t want to be indexed }:).

Here is where atod comes to the rescue. As a part of every web based pentesting, during the information gathering phase, we will launch some spidering over the application but it is a really good practice not to stop there. Performing a directory brute forcing is really recommended as a part of this stage since you can locate vulnerabilities like confidential information leakage and directory listing. It is really easy to find good dictionaries to perform this task as a part of tools like dirbuster, ZAP or Skipfish but using them is not as cool as make one by your own :).

As a part of the SPSE mocked exam, the instructor ask you to access the Alexa top 1K and gather all the information on the Disallowed folders, printing at the end the top 40. I went a step forward and coded atod.


Normal output of

Atod is a multiprocess python script which analyses the disallowed folders found out in the robots.txt file over  the top most visited 1.000.000 domains. It uses two different schemas; http and https, so you will be checking actually the double of URLs than the existent domains. As output it offers two files, one containing your new fashion dictionary to use with your preferred tool and another one containing not only the disallowed folders but the number of times they were found. Both files are sorted so the most relevant results are at the beginning of the file.

Specifically this script automates the next sequence:

  1. Download the file from Alexa
  2. Uncompress the content
  3. Launch N parallel processes
  4. Each process will look for the robots.txt file for the different URLs
  5. Every disallowed folder is registered in a dictionary along with the times it has been found
  6. The dictionary is post-processed to avoid inaccessible routes in the server
  7. Finally all the folders are sorted and dumped to a file
  8. In parallel, the raw results are stored in a statistics file

Atod is initially configured to analyse only 1000 URLs. It will take only a couple of minutes but if you want to grab information from more than 100K, you should be patient even with multiprocessing in place.

Trying to push the scripts to its limits I have crawl all 1.000.000 URLs. As a result I got 804098 entries in my dictionary. Using the statistics file I have graved the top 10 disallowed domains and here are the results.


Top 10 disallowed folders in the top 1 million most visited domains.

Taking a quick look at the results we found really interesting things already in the top 10. The most part of the sites don’t want the spiders to index anything. If they are protecting properly the junctions is a different question… Really weird technologies in modern applications like cgi-bin which is a possible entry point for really dangerous vulnerabilities like shellshock. A lot of WordPress on the base of the sites pointed out by the presence of /wp-admin/ and /xmlrpc.php which is itself a good point to look for vulnerabilities.

But if we go further and analyse the top 100 we will find really interesting locations like:

  • /user/password/
  • /?q=user/password/

  • /INSTALL.pgsql.txt

  • /INSTALL.mysql.txt

  • /installation/

  • /update.php

  • /profiles/

  • /?q=admin/

  • /api/

  • /test/

These are only a sample, I invite you to go through your assessment. Atod is not parametrized but you can change easily the number of processes and the URLs you want to access once the file is downloaded. To do it just change the constants NPROCESS and LIMIT in the source code. I didn’t want to add a parser to the code since the script is quite simple and it would add unnecessary complexity.


Atod constants to change.

Atod is multi-platform as it is coded in python so feel free to run it granting it execution permissions and typing ./ in Linux or Python.exe in Windows.

I am not a python expert so the code is clearly improbable. If you find any error or problem running the script, please let me know and I will come back to you. Hopefully this will be only the first version and it will be more in the future, just a matter of time. It would be really interesting to know if that folders/files are really giving a 200 code as response when requested, wouldn’t be?.

You can download atod from my Github. Alternatively you can find the link to my Github account at the bottom of this page. Mind the only file you need from the repository is the python file, the .zip files in the same directory are my results on the top 1M analysis. I encourage you to run the script and retrieve your own results.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s