An Algorithm a Day Keeps the SEGFAULT Away: 2012

Sunday, October 28, 2012

How I Cracked Reddit's CAPTCHA Pt. 2

Alright, so now that I went through the general background information, let's get into some specifics and actually interesting information!

To actually get a good sample set, I used a web scraping Chrome plugin in to go to the login page and download the page's CAPTCHA image. This method led me to having roughly 300 CAPTCHA images to aid me in cracking Reddit's security system.

From here, I decided to buy a textbook on Artificial Intelligence/Machine Vision. I ended up buying this book, which seems to be pretty popular and influential when it comes to AI. I read over algorithms related to machine vision such as edge detection, character recognition, ellipse detection, and more. With these algorithms in my back pocket, I decided to try my hand at the supposedly impossible.

Step 1: Remove Extra Colors and Background Noise

To give you an idea of how the program worked, I'll be providing an example here. The following image is a real CAPTCHA that came from Reddit.

QVHZUR

To begin, I decided to write an application that converted the image to one with 4 bpp color. I thought it would serve as a way to lower the amount of colors and hopefully make it easier to detect what was in the image. As luck would have it, the conversion actually really helped! The following is the image with 4 bpp color.

4 BPP CAPTCHA

You might not notice that much of a difference, but there's actually one very large difference between this and the previous image; the grid is now mostly a solid gray with a bit of white, rather than many different colors spanning white to gray. Also, this meant that the foreground letters were basically the only white objects in the image. So, with this knowledge in hand, I wrote a BFS-like algorithm that traverses the image's pixels looking for the white color (-1 in Java) and replaced it with black. Any other color was set to -1, or white. This lead to an amazing milestone; I had successfully removed the grid and created an image with only the letters and some artifacts!

Success!

Step 2: Removing Artifacts

Obviously, the next step was to remove those extraneous artifacts. To do this, I wrote a few algorithms to determine whether a black pixel was extraneous or not. One of these algorithms was to find every black pixel and check how many surrounding pixels were black. If I found many, then the pixel most likely belonged to a letter, and if there were very few, then it was most likely extraneous. Another of these algorithms was to count how many black pixels were in each row. If there were a lot, then the word was presumably there. If there were few, I would erase lines 0-i (i being the row with a few black pixels) or i-height, depending on where in the image the algorithm was.

Alright, getting there...

Alright, this ends part 2. Sorry about the crazy amount of posts, but I'm a student and I don't have a lot of time to write this all out immediately. The next part should cover separating the letters and possibly recognition of them. Thanks for reading and I hope you'll stick around for the next post!

Brandon

Thursday, October 25, 2012

How I Cracked Reddit's CAPTCHA Pt. 1

It was a brisk March day when I hurried into one of my favorite professor's offices to basically beg him to sign off on me having an independent research class so I could stay a full-time student. "Sure," he said, "but what exactly do you want to do?" My original idea was to research malware through NFC/RFID chips because I honestly believe that could be a big issue in the future (maybe more on that later). He responded that he had no idea about the subject, but allowed me the opportunity to research it for college credit.

"Maybe I'd learn something I could share with him," he questioned? "What's the harm in allowing a student to have college credit for researching security vulnerabilities?" Well, I decided to forego my original idea and instead crack the "homepage of the Internet."

NOTE: I will not be sharing any of my code for multiple reasons. One reason is that I do love reddit and would be saddened to see it get overrun by script kiddies and spambots. The second reason is that, if you're smart enough, you should be able to do this on your own. I'm not here to hold your hands to do something malicious; I'm here to inform people on how our current online security systems are unsafe.

A Little Background: What's a "Reddit?"

Reddit is the self-described "homepage of the internet." It's an extremely popular social media platform that involves users sharing, commenting on, and judging links to pictures, articles, music, videos, or self-typed posts. Basically, it's what your homepage should be. The site has millions of users and is currently ranked 136 on Alexa, though I imagine it should be much lower but Reddit's demographic probably doesn't install the Alexa Toolbar... You don't need an account to look at comments or links, but you do need one to comment, vote on links, "subscribe" to subreddits, and more. Creating an account is easy, as visible here:

All one needs to do is supply a username, a password, and a proper response to their CAPTCHA. Here lies our attack vector: to create an account on one of the most popular websites in the world, you only need to provide a name, password, and the response to a small vision-based puzzle.

A Little More Background: CAPTCHA, Eh?

A CAPTCHA is defined as a "Completely Automated PublicTuring test to tell Computers and Humans Apart" (thanks Wikipedia!). Basically, it's the funny-looking jumbled up letters you have to type in to "prove that you're human" on many sites. The point of these images isn't to annoy you; it's that they're supposed to be impossible (read: very hard) for computers to comprehend. If computers could understand how to decipher these images, then scripts could be written to create thousands or millions of spam accounts that would overrun a website. (See, this is why I'm not sharing my code)! Anyways, there are three types of obfuscation in a CAPTCHA:

Warped letters
Color-based tricks
Background noise

As you can see above, Reddit employs all of these. The laters are indeed warped and move in a way that is pretty abnormal. The purpose of this effect is that a machine would have a hard time thinking that slanted 'J' is a normal-form 'J.' They use a trick where the background is black, the intermediate level is white/gray, and the foreground letters are white/gray. This makes it extremely difficult for computers to understand what exactly is a letter and what's the background. Finally, the distorted grid in the back is obviously noise used to make it difficult for a machine to understand what's in the image. See, our sight allows us to view the image as a whole and know what we're looking at. Computer software isn't so lucky; it has to go through the image pixel-by-pixel and judge what it's looking at on a small scale rather than a holistic view.

This ends part one of my write up on how I cracked Reddit's CAPTCHA system. The next part should go over how I separated the letters from the background and then from each other. I hope this series is interesting to some, but if not, at least it's interesting to me!

Interview Material: What's the deal with Linked Lists?

Hello! I'm going to start a series on my blog (well, technically it will be all that's here as of now) where I go over a lot of the information one needs for a Software Engineering job interview. My first post in the series will be going over Linked Lists, their uses and how to implement them in Java. So, let's start then!

What exactly are Linked Lists, you say?

In computer science, there are many abstract data types that are used to store and access data. One of these numerous abstract data types (which will be referred to as ADT from now on) is a Linked List. A Linked List is a pretty fundamental concept that most applicants should understand. The structure is pretty simple and can be summarized in a bullet-point list:

A collection of "nodes" that point to each other, where the last node points to NULL
Usually, the structure comes in 2 types, a singly linked list or a doubly linked list

Singly linked lists only have one entry point, and it's usually the "head" (or first) element

Usually, each node only points to the next node

Doubly linked lists have a head and a tale element so the list can be accessed from either side

Each node can point to the previous and the next one, allowing you to traverse the list from the tail

Inserting takes O(1) time for doubly linked lists or O(n) time for singly linked lists, while inserting in place takes O(n) time.
Likewise, deleting the head (or tail in a doubly linked list) takes O(1) time while deleting in place is O(n)
Searching the structure for a node takes O(n) time
Memory-wise, the structure only takes up roughly O(n) space.

Alright, so let's discuss these runtimes. To have an example on hand, we'll use the following diagram as a representation of a singly linked list. Please not that the bolded integer is the head and the arrows are pointers to the next node.

5 -> 4 -> 11 -> 32 -> 2 -> 15 -> 4 -> NULL

First, I mentioned insertion. Insertion's runtime is all based on the linked list's implementation. If the implementation uses a tail node (doubly linked list), then insertion is just pointing the tail node to the new node and then swapping them. This should take constant time. To insert a node in a singly linked list, the structure has to go through all of the nodes until it gets to null and then appends the new node. This would obviously be in O(n) time.

Next, I mentioned deletion. Deletion's runtime is O(1) because it usually is implemented as removing the head or tail node, which would be O(1) time. To accomplish tail-end deletion, you would write a function that points the tail node's previous node's next node to NULL (say THAT 10x fast!). Though, deletion of a certain node would involve traversing to that node in the list, and then set the previous node's next node to its next node's next node. Yeah. Head hurting yet? Here's sample code on how to delete a specified value:

public Node deleteNode( Node head, int data) {
Node n = head;
if (n.data == data) {
return head.next; //Return LL without head node
}
while (n.next != null) {
if (n.next.data == data) {
n.next = n.next.next; //Delete n.next, which is the node we want
return head;
}
n = n.next;
}
return head;
}

Now, we went over a bit of searching in the code up there. If you remember, I noted that it's runtime is O(n). If you see how we traversed the linked list above, you would notice that we have to go to every node looking for a certain value. The worst case scenario in searching an n-long list for a value d is that d is at the end. That means we would have to go through the entire length of the list, which means search is dependent on the length of the list. Therefore, search is O(n).

Great, I understand how to implement one, but why?

Linked lists aren't inherently useful, themselves, but are useful in implementing other data structures. Both stacks and queues are often implemented using linked lists as their underlying structures. Also, a hash table can use a linked List to store the chain of items associated with the same hash value. Please note that these are just a few example usages of linked lists and are hardly the only use cases!

I realize I haven't implemented the entire class for a linked list, but I promise I'll upload the code soon! I'm trying to figure out how to upload code to Blogger without it looking insane, so hopefully there's a solution and it'll be up soon! Thanks for reading my first post, I hope to have many more soon!