Twithinks Blog: A Brief Introduction
We are a team of students and we have been playing with data we've gathered from Twitter for a while.
Attracted by the hype around the 2012 Presidential Election, we wondered, what it would be like if the President were to be elected by Twitter users!
We tried to see what we can find with some simple analysis, and discovered lots of things that were deeply interesting to us. We hope you find them interesting, too.
Some results are presented in this blog, and you can play with the interactives on the home page.
Please leave any comments/feedback you might have on Hack News.
Big Data DIY (for techies)
Oct 8, 2012
In the era of cloud storage and computing, scaling Big Data Analysis is really not that difficult. Hosting a few terabytes on AWS or Google is almost a standard procedure, only if you have the money to pay for the servers. But today we want to show that you can build a database system for Big Data even on a student budget.
Here are some rough numbers about the volume of Twitter data we have been dealing with:
Tweets: 2TB and growing
Users, relations and other data: ~80GB
In total, there are billions of rows. They are stored on a single machine that costs only about ~3000 dollars. Here is some general experience:
We've of course had to put together the hardware ourselves. The specs of the machine we use are:
32GB Rem, 2 x 128GB SSD on RAID 0, 4 x 3TB hard drives on RAID 10.
There are three layers of storage: memory, SSD and HDD, with decreasing performance and costs. We've of course wanted to put the most frequently accessed data, such as database table indices and our client-facing data, on memory. We used our SSD array for most non-volatile storage except for tweets. Tweets, which are usually sequentially scanned, conveniently reside on the HDD array.
There are many database systems to choose from, but on a single node, good old MySQL is probably the best choice. Putting a few billion rows in a table usually hogs it down and the main bottleneck is that you can't load the entire index into memory. We decided to "divide and conquer" and deployed a variety of tricks: partitioning, bulk loading, logging and sorting by index. They are implemented in Java as an optimization layer on top of MySQL. The chart below shows how the system works as a whole.
Twitter Mentions & Republican Primaries
Oct 9, 2012
To start with something simple, we look at the Twitter mentions of the election candidates. It is surprising that even such a simple statistic reveals lots of insights.
The following graph counts the Twitter mentions of the word “Romney” on a daily basis during the Republican Primaries. A few patterns to note:
- The spikes match well with major events like the primary nights and televised debates. Interestingly, only CNN and FOX debates caused major spikes.
- Significant weekly periodicity of Twitter mentions. People tweet more about the election during weekdays (at work? in class? :D).
- If we ignore event-driven spikes, Romney is getting more mentions over time. Does this correlate with his odds of winning?
Now let's look at the relative twitter mentions of each candidate relative to one another. The next plot shows the republican primaries, a race among Romney, Santorum, Gingrich and Paul.
Wow! The winner of each primaries tended to be the one with the greatest number of twitter mention or the strongest up-trend momentum.
At the first sight, it might seem crazy that the noisy micro-blogosphere, in which people complain about overbaking their pancakes or brag about their vodka melon hangovers, has any predictive value on any significant event. Moreover, more serious researchers will point out various biases, such as user demographics, negative tweets, etc.
However, on a second thought this might not be too bizarre an idea after all. Being mentioned a lot may be good or bad, depending on how favorable tweets are. But not being mentioned is certainly bad; that means no one cares about the candidate.
Certainly, this won't explain all the biases and the reason behind the correlation is defintiely worth further investigation.
Twitter Mentions & the Presidential Race
Oct 19, 2012
Now that we have tested our muscles, we are ready to move on to the big game, the Presidential Election. Here is a simple plot of the percentage of Twitter mentions for the two candidates since January.
Not surprisingly, Obama, as the incumbent President, receives way more mentions than Romney for most of the time. Nonetheless, there are a few spikes in Romney’s mentions plot that are worth investigating.
What are some events that let Romney get more attention on the Twitter sphere? Are they good for Romney? We wrote some code to help us identify some of the major events:
- On Apr 12, most of the discussion focused on the “stay-at-home-mom” comment by Hilary Rosen on Ann Romney. Here is a tweet supporting Ann:
"also disappointed in hilary rosen's comments about ann romney. they were inappropriate and offensive."
Of course, not all of them are defending Ann.
- On Jul 11, Romney spoke at NAACP and got booed, according to Twitter.
- On Jul 26, Romney attended the Olympic Opening Ceremony, sweating about his “Olympic gaffes”.
- On Aug 10, Romney picked Ryan as the running mate. The most retweeted post: “Really Romney?”
Well, Twitter is full of interesting information, blended in sarcasm, humor and irony. It reminds me of a quote:
With the absence of the false, you have the truth and noise.
Putting a GPS Tracker on Romney
Oct 20, 2012
We'd like to test an interesting hypothesis: Twitter knows Romney's whereabouts.
The idea is that if Romney is present in a State on a day, there would be a jump in the share of Romney mention in that State compared to nation-wide mentions.
So examining the jump would allow us to track down the campaign schedule of Romney. Based on the magnitude of the jump, we made some statistical estimations of the probability of Romney having an event in a state / having no event at all.
The following table shows the top three most likely estimates. Romney's actual location is highlighted in green.
||No Event 11%
||No Event 15%
||No Event 14%
||No Event 20%
||No event 31%
||No event 21%
||No event 16%
||No event 21%
||No event 16%
||No event 21%
||No event 21%
||No event 16%
||No event 22%
The results are surely not earth-shattering but they're fairly accurate. This is an illustration of the power of big data in practice. Even though we only keep tiny amount of information from each tweet, the information accumulated still allows us to distill fairly strong signals that reflect real world events.
Obama, Economy, Huh?
Oct 21, 2012
You don’t need to follow the election to know that this election is about the ECONOMY, and this is the one area where Obama is clearly on the defensive. We want to see whether Twitter can tell us anything interesting on this front. Therefore we look for tweets about Obama that also talk about the state of the economy. We then compare it with two important economic indices. The first one is the S&P500. And we have the following plot.
You may be thinking that they don't correlate at all. You're right. Not only they don't correlate, they actually anti-correlate! Now let's see something that correlates: obama and unemployment rate.
Hmm, so bad market and high unemployment rate reminds people of Obama? We can learn two things from these plots: 1. People talk about the economy when it sucks. 2. Obama, as the incumbent president, is taking the responsiblity of economic performance.
Additionally, since unemployment data is always released one month later, the pattern on Twitter mention is somewhat predictive of the unemployment rate. I'll bet that the October unemployment rate will be higher than that of September!