Twithinks Blog: A Brief Introduction


We are a team of students and we have been playing with data we've gathered from Twitter for a while. Attracted by the hype around the 2012 Presidential Election, we wondered, what it would be like if the President were to be elected by Twitter users! We tried to see what we can find with some simple analysis, and discovered lots of things that were deeply interesting to us. We hope you find them interesting, too. Some results are presented in this blog, and you can play with the interactives on the home page. Please leave any comments/feedback you might have on Hack News.



Big Data DIY (for techies) Oct 8, 2012


In the era of cloud storage and computing, scaling Big Data Analysis is really not that difficult. Hosting a few terabytes on AWS or Google is almost a standard procedure, only if you have the money to pay for the servers. But today we want to show that you can build a database system for Big Data even on a student budget.

Here are some rough numbers about the volume of Twitter data we have been dealing with:

Tweets: 2TB and growing
Users, relations and other data: ~80GB

In total, there are billions of rows. They are stored on a single machine that costs only about ~3000 dollars. Here is some general experience:

Hardware

We've of course had to put together the hardware ourselves. The specs of the machine we use are: 32GB Rem, 2 x 128GB SSD on RAID 0, 4 x 3TB hard drives on RAID 10. There are three layers of storage: memory, SSD and HDD, with decreasing performance and costs. We've of course wanted to put the most frequently accessed data, such as database table indices and our client-facing data, on memory. We used our SSD array for most non-volatile storage except for tweets. Tweets, which are usually sequentially scanned, conveniently reside on the HDD array.

Software

There are many database systems to choose from, but on a single node, good old MySQL is probably the best choice. Putting a few billion rows in a table usually hogs it down and the main bottleneck is that you can't load the entire index into memory. We decided to "divide and conquer" and deployed a variety of tricks: partitioning, bulk loading, logging and sorting by index. They are implemented in Java as an optimization layer on top of MySQL. The chart below shows how the system works as a whole.


back comment


Twitter Mentions & Republican Primaries Oct 9, 2012


To start with something simple, we look at the Twitter mentions of the election candidates. It is surprising that even such a simple statistic reveals lots of insights. The following graph counts the Twitter mentions of the word “Romney” on a daily basis during the Republican Primaries. A few patterns to note:



Now let's look at the relative twitter mentions of each candidate relative to one another. The next plot shows the republican primaries, a race among Romney, Santorum, Gingrich and Paul.

Wow! The winner of each primaries tended to be the one with the greatest number of twitter mention or the strongest up-trend momentum.

At the first sight, it might seem crazy that the noisy micro-blogosphere, in which people complain about overbaking their pancakes or brag about their vodka melon hangovers, has any predictive value on any significant event. Moreover, more serious researchers will point out various biases, such as user demographics, negative tweets, etc. However, on a second thought this might not be too bizarre an idea after all. Being mentioned a lot may be good or bad, depending on how favorable tweets are. But not being mentioned is certainly bad; that means no one cares about the candidate. Certainly, this won't explain all the biases and the reason behind the correlation is defintiely worth further investigation.


back comment


Twitter Mentions & the Presidential Race Oct 19, 2012


Now that we have tested our muscles, we are ready to move on to the big game, the Presidential Election. Here is a simple plot of the percentage of Twitter mentions for the two candidates since January.


Not surprisingly, Obama, as the incumbent President, receives way more mentions than Romney for most of the time. Nonetheless, there are a few spikes in Romney’s mentions plot that are worth investigating. What are some events that let Romney get more attention on the Twitter sphere? Are they good for Romney? We wrote some code to help us identify some of the major events:

Well, Twitter is full of interesting information, blended in sarcasm, humor and irony. It reminds me of a quote:

With the absence of the false, you have the truth and noise.


back comment


Putting a GPS Tracker on Romney Oct 20, 2012


We'd like to test an interesting hypothesis: Twitter knows Romney's whereabouts. The idea is that if Romney is present in a State on a day, there would be a jump in the share of Romney mention in that State compared to nation-wide mentions. So examining the jump would allow us to track down the campaign schedule of Romney. Based on the magnitude of the jump, we made some statistical estimations of the probability of Romney having an event in a state / having no event at all. The following table shows the top three most likely estimates. Romney's actual location is highlighted in green.

Date Most Likely Second Likely Third Likely
Oct 1 CO 62% MI 13% No Event 11%
Oct 2 CO 57% No Event 15% MI 10%
Oct 3 IN 37% NC 26% No Event 14%
Oct 4 IN 32% PA 29% No Event 20%
Oct 5 No event 31% OH 26% FL 19%
Oct 6 FL 37% No event 21% MI 18%
Oct 7 NC 42% FL 27% No event 16%
Oct 8 VA 31% NC 25% No event 21%
Oct 9 OH 45% PA 17% No event 16%
Oct 10 OH 33% PA 22% No event 21%
Oct 11 NC 43% OH 20% IN 18%
Oct 12 OH 32% NC 27% No event 21%
Oct 13 OH 38% NC 29% No event 16%
Oct 14 FL 31% No event 22% WI 21%

The results are surely not earth-shattering but they're fairly accurate. This is an illustration of the power of big data in practice. Even though we only keep tiny amount of information from each tweet, the information accumulated still allows us to distill fairly strong signals that reflect real world events.

back comment


Obama, Economy, Huh? Oct 21, 2012


You don’t need to follow the election to know that this election is about the ECONOMY, and this is the one area where Obama is clearly on the defensive. We want to see whether Twitter can tell us anything interesting on this front. Therefore we look for tweets about Obama that also talk about the state of the economy. We then compare it with two important economic indices. The first one is the S&P500. And we have the following plot.


You may be thinking that they don't correlate at all. You're right. Not only they don't correlate, they actually anti-correlate! Now let's see something that correlates: obama and unemployment rate.


Hmm, so bad market and high unemployment rate reminds people of Obama? We can learn two things from these plots: 1. People talk about the economy when it sucks. 2. Obama, as the incumbent president, is taking the responsiblity of economic performance.

Additionally, since unemployment data is always released one month later, the pattern on Twitter mention is somewhat predictive of the unemployment rate. I'll bet that the October unemployment rate will be higher than that of September!

back comment