Step Three: Privacy!
Status Update - pyptlib

Deliverable #1 for this summer’s project is this:

“A library for parsing pluggable transport configuration options

This will be a python library that authors of SOCKS proxies can use to integrate their proxies with Tor.”

A first pass at this is available from github and pypi.

I tried to follow the spec very closely in developing the API.

Next steps for the library are testing, improvement documentation including pydoc strings, and error checking for higher level protocol conformance, for instance the order in which proxy reply lines are output.

After that, it will be time to use the library in building the pluggable transport framework.

New GSoC 2012 Project! Pluggable Transports in Python

Here is my introductory post to tor-dev announcing the new project:

Some information about me:

I worked for EFF/Tor Project last year for GSoC 2011, my project was a blocking-resistant transport evaluation framework: https://gitweb.torproject.org/user/blanu/blocking-test.git

I am also the author of a pluggable transport written in python: https://github.com/blanu/Dust/tree/master/py/dust/services/socks2

I’ve been working on censorship resistance technology since 2001. Here are some of my projects:
http://blanu.net/Dust-FOCI.pdf
http://blanu.net/BayesianClassification.pdf
http://blanu.net/Arcadia.pdf
http://blanu.net/Freenet2001.pdf

Some information about the project:

The overall goal of the project is to make it easy for pluggable transports to be written in python. There has been a lot of interest in doing pluggable transports in python, but currently they are all written from scratch. For C transports, obfsproxy can be used to do a lot of the heavy lifting, making it relatively easy to write a new C-based transport. I’ve heard there is also a port of obfsproxy to C++. A the author of a python transport, I am of course an advocate of writing transports in python. Fortunately, so are some other Tor folks, so soon it will be easy to write python transports!

The deliverables for this project are as follows:

A library for parsing pluggable transport configuration options

This will be a python library that authors of SOCKS proxies can use to integrate their proxies with Tor.

A framework (both server and client-side) for writing pluggable transports in python

The framework will provide a SOCKS proxy server already integrated with the pluggable transport library. All the protocol author will need to do is provide the obfuscation and de-obfuscation functions and a main function to do command line parsing and call the framework.

A python implementation of the obfsproxy command line tool

This will be a command line program using the framework that will accept the same command line options as the existing obfsproxy tool. It will support the selection of an obfuscation function, although not all of the protocols currently supported by obfsproxy will initially be available in python.

A python implementation of the obfs2 protocol implemented as an obfsproxy module

The obfs2 protocol will be implemented as a plugin for the framework and made available to the command line tool.

Conversion of Dust to an obfsproxy module

The Dust protocol will be implemented as a plugin for the framework and made available to the command line tool.

py2exe packaging for obfsproxy

The command line tool will be packaged into a standalone executable for Windows.

Optional deliverables if there is sufficient time: obfsproxy modules for other protocols, experiment with other packaging systems


Current status:

I’m working on a spec of the API for the option parsing library. It should be available soon.

Google Summer of Code Final Status Update

Last week I said that I was going to work on finishing up some graphs this week, but I decided that was not the best use of my time. Graphs are cool and interesting, but I decided instead that I should work on transitioning to doing things useful to the general Tor developer community. So instead I decided to try to work on closing bug #2860 using the blocking-test framework. The bug asks for research to be done on connection patterns of Tor connections. In order to close this I need to add Tor (without obfsproxy) capturing and a SYN/FIN/RST packet counting detector, as well as some new traffic generation tasks. So this is my new goal for the first post-GSoC task for the project.  One fortunate thing is that we just received about 1GB of traffic data for Tor and SSL connections, so we’re currently working on parsing that into the blocking-test format and doing some analysis on it. Actually, with this data I may be able to close bug #2860 by just writing a SYN/FIN/RST detector, so that will save some time.

So Summer of Code is now over, but the project continues as a part of the greater Tor project. Hopefully it will be useful for the Tor community.

Google Summer of Code Penultimate Status Update

Scoring and reporting for detectors and encoders is pretty much done. The graphs aren’t quite ready yet, for aesthetic reasons, but I have some nice tables to report:

Length detector: 16% accuracy
Timing detector: 89% accuracy
Entropy detector: 94% accuracy

SSL was correctly identified 25% of the time, 70% of the time other protocols were misidentified as being SSL, and 5% of the time SSL was misidentified as another protocol.
obfsproxy was correctly identified 55% of the time, 10% of the time other protocols were misidentified as being obfsproxy, and 35% of the time obfsproxy was misidentified as another protocol.
Dust was correctly identified 48% of the time, 4% of the time other protocols were misidentified as being Dust, and 48% of the time Dust was misidentified as another protocol.

obfsproxy is distinguishable from SSL 96% of the time.
obfsproxy is distinguishable from Dust 98% of the time.
Dust is distinguishable from SSL 56% of the time.

I’m sure these numbers are confusing as several different sorts of things are being measured, all with similar numbers. It is probably quite unclear how all of these percentages add up. I think this is where the graphs will help. In particular I think it’s hard to visualize accuracy when there are both false positives and false negatives to take into account.

I think the most interesting result here is the entropy detector. It’s the simplest and also the most effective. Not only that, but the entropy detector only looks at the entropy of the first packet, so it’s inexpensive. The original reason that I implemented this detector was that there was an argument against Dust that it could be trivially filtered by setting an entropy threshold above which all traffic is filtered. Surprisingly, Dust has lower entropy than SSL, so this attack will not work to specifically target Dust. This is unexpected because Dust uses totally random bytes, whereas the SSL handshake does not. This is a result of a general issue with how Shannon entropy is defined. Shannon entropy is normally defined across a large statistical sampling, in which case all three protocols tested will converge to maximum since they are all encrypted and therefore apparently random. However, the attacker does not in this case get a large statistical sample. Since they don’t know a priori which connections are Dust connections, they only get however many packets are in each trace to use as samples. The current version of Dust that I’m testing has most of the advanced obfuscation stripped out as I haven’t had time to add it back in after completely reimplementing the protocol to use TCP instead of UDP and all of the modified assumptions that come with that change. So this version of Dust is basically just the ntor protocol with no packet length or timing hiding in use. The first packet is therefore 32 random bytes consisting of the curve25519 ephemeral ECDH key, as compared to the 1k first packet of SSL. This is the reason for the low entropy calculation, because given only 32 bytes to sample almost all possible bytes are going to have a probability of 0 or 1. There is an upper bound on entropy when sample sizes are small. Achieving maximum first-order entropy here would require at least 512 bytes so that each value 0 to 255 could occur exactly twice.

There are of course many ways to measure entropy and you could say that I’m just doing it wrong. Some alternative measurements have already been suggested that I might implement as well. I am not suggesting that Dust is in any way immune to entropy-based detection. In fact, Dust is currently detected very well by the current entropy detector (it passed 100% of the packet length and timing tests, so all detection of Dust was done by the entropy detector) precisely because it has strangely low entropy because of it’s small first packets. I bring this up only as an interesting example of how intuition can fail when trying to use Shannon entropy to measure individual messages instead of channels.

As far as the state of the project goes, I think the project can be considered done. Looking at the project documentation, everything didn’t work out exactly as originally planned, I think it all worked out for the best and the project is where I wanted it to be by the completion of Summer of Code. I have generation of HTTP and HTTPS traffic, both over Tor and plain. I have two encoders: obfs2 and Dust. I have three detectors: length, timing, and entropy. The string detector didn’t work out, but the entropy detector worked out better than planned, so on balance I think it was a win. I didn’t implement all of the detector modes because upon further reflection it doesn’t really make sense to just try all of these different modes. You want the mode that works best. So for packet lengths and timing that is full conversations and for entropy that’s the first packet (could be first N packets, but I tried N=1 and it worked great). Random sampling of packets just seems like a way to limit your success at this point when non-random sampling of just the first packet seems to be so effective. All of the utilities have been written, and more. I have the specified utilities for separating the pcap files into streams, and extracting the strings, lengths, and timings. Additionally, I have some utilities for extracting the entropy of streams, tagging the streams with what protocol they are using (by port), and extracting dictionary and corpus models for latent semantic analysis. I also have some utilities for graphing distributions of properties of different protocols which I hope to finish up next week during the wrap-up period.

Looking to the future, I have changed my goals somewhat from what I had in mind at the beginning of the project. After having attended the TorDev meetup in Waterloo and the PETS conference, I am much more interested in integrating my work with the Tor project instead of just testing the various protocols that have been proposed in order to see how my own protocol stacks up against the competition. As part of this transition, I have rewritten Dust from scratch, both in terms of implementation and in terms of the protocol, in order to be better suited for use as a pluggable transport for Tor. My goal for Dust is to have it shipped with both Tor clients and bridges so that it can be used in the field to contact bridges despite DPI filtering of Tor connections. My goal now for the blocking-test project is to focus on blocking resistance for Tor. So instead of adding all of the protocols under the sun, I’m going to concentrate on protocols that have been implemented or are under consideration to become pluggable transports. Additionally, the next protocol I’m going to add after GSoC is over is just plain Tor. There is at least one bug filed about a suspicion that Tor can currently be fingerprinted based on some packet characteristics and I’d like to be able to close that bug because we now have the capability to easily generate the necessary information.

Finally, I would like to offer some insight into what I think is the future of blocking-resistant transports now that I have done some work trying to break them. Fundamentally, the only thing an attacker can do is limit the bitrate of the connection. Given any sort of channel, we can encode arbitrary information over that channel. However, the smaller the channel the slower the bitrate. The goal of encodings should be to maximize the bitrate given the constraints of the censor. It’s easy to develop very slow encodings such as natural language encodings over HTTP. However, in order to maximize the throughput you need to have a good definition of the attack model. All encoding for blocking-resistant purposes lowers the efficiency above just transmitting the normal content (assumed it has already been compressed). Therefore all encoded conversations are going to be longer than unencoded conversations. I think this is the future of writing detectors because it’s a fundamental constraint on encoding. I think this should even be able to defeat something like Telex (although let’s not get into the details of attacks on Telex specifically, I’m talking about a general idea here). Given a (static, for the sake of argument simplicity) website, I can download every page on that website. Then I can compare the length of intercepted downloads and detect when they don’t match up. For dynamic websites you can do the same thing with a statistical model. I think, therefore, that the future of blocking-resisting protocols is to encode single logical connections over an ensemble of multiple actual connections. Ideally these connections would mimic normal traffic in terms of number of connections, duration, directional flow rates, etc.. It’s something to think about anyway. It could be an interesting problem.

Post-TorDev Status Update

With the TorDev meetup and everything it’s been a while since I sent out an update on the state of the project, although I think we more or less caught up on the status at the meetup. The bulk of the implementation work had already been completed, so what I’ve been working on since then is just cleaning up the code and rewriting some parts. In particular this week I rewrote the detectors to be mostly python. The core Bayesian model is still in BUGS/JAGS and there is still R code to interface with that. However, the output of the models after training is now converted to a CSV file and the actually detection code is a python program which reads the CSV file with the trained model’s prediction data and the compares it to the data of a stream to be classified. This means that you only need R for training the models. If we can put together a canonical model for each protocol then you can use it to classify new traffic just using python. I think this is a win because R was the bulkiest dependency. It also segments the process more nicely in that there can be some people gathering traces, other people building models, and then other people running detectors against new traces.

In terms of code cleanup, I’ve been adding comments and refactoring large files (in particular pavement.py) to be more readable. I also have the scoring tasks implemented so that, given the output of the detectors, the various detectors and encoders are scored based on their number of correct guesses, false positives, and false negatives. The only functionality left to implement is some sort of reporting, probably in the form of graphs of the scores so that detectors and encoders can be compared. The rest of the work is just going to be refining everything and getting both the workflow and the code structure to be smooth.

Also, I’ve stopped working on the string detector because I realized that it’s irrelevant for the two encoders I actually have support for right now: obfs2 and Dust. They both already use totally randomized packet contents and so will not be vulnerable to this kind of detector. Other encodings will be vulnerable, but the string detector can wait until we support these encodings. So instead I’ve implemented an entropy detector, which should work very well against both obfs2 and the current implementation of Dust. The entropy detector is actually the easiest model so far as the entropy of a trace is just a single number, so that’s convenient.

Week #7 Status Update

This week I worked on finishing up the detectors. The main change is that they now output CSV files with the results. I also discovered some very interesting things in my research this week:

I’d like to include the HTTP and PERSEUS transports, but I guess we’ll see if time permits.
The pluggable transports spec I thought I had read before, but I totally missed some key elements, mainly that full Dust (including invites) is possible with the existing spec. This has renewed my enthusiasm for getting Dust working as a transport now that I realize it can exist fully on par with obfsproxy without having to be rewritten in C. I’ve started work on a new TCP implementation as the old implementation was biased towards UDP and this made it difficult to get the SOCKS proxy working well.
I’d like to fulfill those trace requests by adding support for Chromium traffic generation. The hard part is actually finding appropriate websites, both plain and JS-intensive that allow access over HTTPS without requiring a login.

Next week I’m going to implement scoring of traces and I think also some sort of scoring of detectors. I’ve noticed, for instance, that for whatever reason my “bag of words” detector doesn’t seem to be working very well. I will attempt to fix it, but it would also be good to have some way to automate evaluation of how well different detectors work. I have a start for this in the validateTimings.r file, where I just run the detector on the training data and see how well it does then. I just need to tie this into a scoring mechanism so that it just gives one number for each detector that I can look at to judge overall goodness of fit. I think this will probably just be an accuracy number, perhaps with # of false positives and # of false negatives available upon request.
Week #6 Status Update

This week I finished up the first draft of the detectors. I finished up the Bayesian detectors with the addition of the timing detector. I did this one a little differently than the previous ones. With the length and string detectors I used a multinomial distribution where the distribution of all options were counted across a conversation. This didn’t feel right for timings, so instead I used a Poisson distribution and instead of modelling the whole conversation I treat the model as a just a generator of timings. This sort of model is not really what I think people normally do with timing information in order to fingerprint protocols. The reason I went with this model is because what I’m trying to build detectors for is statistical properties of protocols. Timing is usually used in semantic detection of protocols, for which you need to already suspect what the protocol might be. For instance, if you think that an SSL handshake might be Tor then you can look a the timings of specific packets in different phases of the handshake to fingerprint the Tor implementation or specific usage of SSL.

The reason I decided to go with statistical properties instead of semantic properties (where you need to know the details of the specific protocol) is that they are the most generally applicable to all protocols. Semantic detection only works for one protocol, and probably only one implementation and version of that protocol. Fortunately, this focus is not in any way a limitation of the blocking-test framework. I’m just trying to lay down a good base for protocol detection. There is an endless parade of semantic detection code that hopefully will eventually follow as people publish papers on new detection techniques tailored specifically to detect Tor.

In other news, I found the prospects gensim package irresistable and found the time to implement a “bag of words” string detector. I didn’t end up using LDA because LDA is for topic detection and while I can imagine abstractly how this could be used for protocol classification, I was unclear on how to actually implement this. However, I did manage to get a Latent Semantic Indexing (LSI) model running. So far the results are not as good as with packet lengths. This is somewhat to be expected as obfsproxy is designed to have no detectable strings and SSL will vary depending on the server. I hope it will produce good results for detecting Tor (without obfsproxy) once I get more protocols captured, although I’ve heard that newer versions of Tor already attempt to thwart this by looking like Apache, so maybe it won’t work after all. The use of LSI for protocol classification is, as far as I know, untested, so it will be interesting to see what kind of results it gets. For strings of 1 to 4 bytes, we can compare it to the Bayesian string detector, although I think it should be able to go up to very large strings, certainly as large as is useful.

Next week I’m going to work on cleaning up the detectors by polishing up their output and the tasks that run them. I’m also going to work on some slides summarizing this work for the Tor Developer Meetup and PETS.

Week #5 Status Update

This week I continued working on detectors. My main focus was the string detector. Unfortunately, the string detector ran up against the limits of my current detector methodology. Using a multinomial model for strings means that for a sequence of N bits the model has 2^N dimensions. This is represented in my current code as an array of size 2^N. It worked fine for 1 and 2 byte substrings, but 4 byte substrings crashed R, most likely because I had the 32-bit version of R. I spent some time adding a paver task to download and compile R so that I could get the 64-bit version working, but this is most likely a dead end as increasing N will simply cause me to run out of memory.

I have a number of ideas for how to solve this problem, but I’m not sure which will work best. The most obvious is to use some sort of sparse array data structure since the vast majority of entries are zeros. I’m not sure if this is possible to do since I need this data structure to work with both R and Jags. I think that internally they are just not designed to work this way and I don’t really want to start messing around with their internals. Another solution is to bin the actual strings into a workable number of bins, say 2^16. This will give an increasing number of false positives as N increases, but maybe 2^16 bins is enough to work. It’s also possible that 4 byte strings are plenty long to uniquely identify protocols and that 64-bit R will be an inefficient but workable solution.

Another alternative was mentioned to me by a friend I ran into last night that does data mining. He reminded me that the substring matching I’m doing is analogous to search engine ranking algorithms and document similarity analysis as they also use a “bag of words” model. In particular, since I’m using a Dirichlet prior, he suggested that I consider Latent Dirichlet Allocation (LDA). I did find a nice python package for LDA and related algorithms, gensim, so that’s promising. I’m hesitant to stray too far from Bayesian modelling because I don’t understand all of the implications of switching to LDA for classification. I did take a search engines class and so am familiar with the variety of search engine algorithms commonly in use. However, I’ve never considered them in this context for protocol classification and there may be some hidden assumptions that make them inappropriate for this task. On the other hand, one thing I learned in the search engines class is that you use whatever gives the best results.

I don’t want to wander too far into the forest of machine learning and classification algorithms, as my project is not to create the world’s greatest protocol classifier, but rather to have a framework to run protocol classifiers on traces. I think therefore that I’m going to leave my 1 and 2 byte string detectors as they are and move on to the timing detector. The in the following week I’ll try to implement something that can handle larger strings, perhaps as an additional detector. For instance, I might write an LDA detector, or something similar if I find a more appropriate algorithm between now and then.

Bayesian Classification of Internet Protocols

My focus on the project at the moment is writing detectors which will classify traffic traces by protocol using the characteristics of the packets. I thought it might be useful to write about how exactly I’m doing this classification.

A typical way to do classification is to write some rules manually that specify what packet characteristics should cause them to be classified as one protocol or another. However, in order to do this you must already know what the rules should be. This can also be done manually by examining traces and looking for patterns. Unfortunately, I haven’t already spent the time pouring over traces to know what the rules should be and I don’t have time to do it now. My job is to write code, not to look through traces. However, I need some detectors to plug into my testing framework in order to see if its working. So what I need is some automated classifiers which I can train on some traces that have already been tagged by protocol so that they can then automatically tag new traces.

This is essentially a machine learning problem, and this is a field that is both wide and deep. Using machine learning to classify traces by protocol could probably be the topic of a number of PhD dissertations, and it probably already has been. For my own purposes in building out the testing framework I’m going to stick to what I know, which is Bayesian simulation using MCMC methods.

The overall strategy is pretty simple. You construct a model of the protocol which includes some hidden variables of interest. You feed training data to the model to generate an estimate of the distribution of the hidden variables. Then you generate some new, predictive data from the model using the estimates for the hidden variables. You compare the predictive data to the new data that you want to classify and compare the fit. There is one model per protocol and whichever model fits best is the classification.

As an example, consider the packet length detector. Each packet stream has a distribution of packet lengths which follows a multinomial distribution. (For a particular packet, it would be categorical, but for a whole stream it is multinomial.) This distribution is easily visualized as just a histogram with a bar for each possible packet length from 0 to 1500 where the height is the count of the number of packets observed of that length. Once the training data is used to generate a probability distribution across packet lengths, we can simulate any number of conversations, obtaining specific counts for packets of different lengths. New conversations that need to be classified can be compared to simulated conversations from different protocols to see which protocol is the closest match.

Week #4 Status Update

The focus for last week was starting to write detectors. I had to pause for a while to rewrite the traffic capturing code to not use loopback/localhost as it was giving atypical (large) packet sizes in the results. I’ve now rewritten it so that obfsproxy server and the Tor bridge run on one machine and obfsproxy socks and the Tor client on another so that all obfsproxy communications will be over the Internet, resulting in typical packet sizes. All of the remote control of the other machine is done using a combination of Fabric and paver scripts, so it’s all still automated with the same tasks as before.

The first detector that I’ve written is the packet length detector. I’ve trained it on an initial set of traces and then used it to classify some new traces and the results are okay, but not yet ideal. It succesfully classifies SSL traffic as being SSL, but for obfsproxy traffic it is sometimes classified as obfsproxy, sometimes inconclusive, and sometimes misclassified as SSL. There are a number of refinements that I can do to increase accuracy. First, I might not have enough data as each run only generates two obfsproxy connections, as opposed to around 10 for SSL, because all traffic goes over a single obfsproxy connection. Second, it might help to bin the packet lengths, although I’ll need to experiment to find the optimal bin size. Additionally, I probably need greater diversity of traffic. With the current tagged traces I can only tell if a protocol is either SSL or obfsproxy (or inconclusive). It would help to have traffic from other protocols to get a good baseline to compare against. So I might add, at a minimum, non-obfsproxy Tor traffic and unencrypted HTTP traffic.

This week I’m going to be continuing the work on detectors. I might spend some time on these optimizations I’ve mentioned, but the main focus is going to be on completing the first draft of the detector set by implementing the string and timing detectors.