Today Twitter announced that it purchased Backtype, a tiny company doing Big Data. The website says the company does “social media analytics” which is pretty much saying “Oprah sits on a couch.”
It’s the buzzy-est of buzz words, but if you dig into this you find that the company doing things that everybody wants. Like analyzing hits per link.
When a content creator shares their work the Backtype-created Storm product steps in. Giving you a realtime conversational graph, ability to search comments, and an influence score. Add to that comparison shopping through analysis of top sites and trending links.
But you may be asking, can’t I already do this?
Yes, you can, if you want it 24 hours later.
Stats on the web fall into two categories, instant low tech stats and delayed high tech stats. The market is saturated with the latter (delayed high tech stats) because the core innovations already exist. The reason is complicated but it boils down to the fact that today’s top hardware was built for the pre-Twitter/Facebook world.
To build for realtime processing requires a whole new set of operators where speed, size, and queries go supernova. To give you an example of this, see Backtype’s stats:
- 100-200 machines (operating as one)
- 100 million messages
- 300 queries/second
This is the world that Twitter lives in, millions of messages/second. If you remember the early days of Twitter with all the downtime and Fail Whale messages, that was due to the technological limitations of the time.
They proved that quick, short messages are beloved by us humans to the tune of billions. Since then they have been massively scaling, customizing, and driving the industry. Not only do they need a way to process billions of messages without Fail Whal-ing, but they need to offer (paid) services on top of it.
This is where Backtype comes in. The team built a fascinating service on top of Twitter that does stream processing, continuous computation, and distributed RPC (remote queries of 100s of machines).
The simple translation of this is “live analytics”. The complicated version, pulled from Twitter’s:
“Imagine you have a cluster of 100 computers. Hadoop’s distributed file system makes it so you can put data…in…and pretend that all the hard drives on your machines have coalesced into one gigantic drive….it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way.
“…the second main component of Hadoop is its map-reduce framework, which provides a simple way to break analyses over large sets of data into small chunks which can be done in parallel across your 100 machines.”
By buying this technology Twitter is pushing it’s edge with Big Data.
An advantage they started building years ago to make sure the product stopped failing all the time. It has taken them years, millions, and transformed the company into a professionally respectable “Big Data operation”, that is world class and in many ways unique.
Now they have some freedom to play around and Backtype provides the playground. Links stats, emerging trends, and viral memes are just the beginning.
We are about to see how realtime we can get…