Tag Archives: big data

Ecology goes Big Data – 15,000 sensors to measure everything in the soil

This project, called NEON, should revolutionize the study of ecology, and with it global warming.

The Economist – NEON light

Once this network is completed, in 2016 if all goes well, 15,000 sensors will be collecting more than 500 types of data, including temperature, precipitation, air pressure, wind speed and direction, humidity, sunshine, levels of air pollutants such as ozone, the amount of various nutrients in soils and streams, and the state of an area’s vegetation and microbes.

 

I just can’t believe they were able to get $400+ million in funding for this project.

 

Continue reading

Nate Silver predicts our next President – by keeping a running forecast

If you haven’t heard of Nate Silver then you are in for a ride. Nate is very, very famous in two distinct areas, baseball and politics, for his ability to predict things.

For baseball he developed, PECOTA, a system for predicting future performance of baseball players, and sold it to Baseball Prospectus in 2003.

From there he moved into politics and went on a run, correctly predicting the winner in 49 out of 50 states for the 2008 presidential election, and all 35 of the Senate races.

That made him some enemies, specifically all those existing pollsters who were proved wrong time and time again.

They still don’t like him, but he is the reigning king of political predictions and now a blogger for the New York Times. Where he maintains a running forecast for the 2012 presidential election.

This screenshot shows the forecasted winner in November:

 

Continue reading

A real Moneyball story – the reinvention of pitcher Brandon McCarthy

SEVEN PITCHES. That’s how long it took for the verdict to come in. On April 5, in the first inning of his first start in an A’s uniform, Brandon McCarthy went groundout, groundout, groundout. It was a one-inning sabermetric masterpiece. For the game, he lasted eight innings — the second-longest start of his career — and threw just 89 pitches.

McCarthy’s filthy stuff was no laughing matter. “He’s not trying to strike you out,” says Hunter, who had long dominated the lanky pitcher — until last season. “He’s trying to get a ground ball. He’s keeping guys off balance, and he’s hitting his spots. He’s learned how to pitch.” (“The first time I got him out last year,” says McCarthy, “I was like, ‘Oh my god, I really did something!’ That just wasn’t possible before.”) A’s manager Bob Melvin says McCarthy’s new pitching approach reminds him of Greg Maddux, the 300-game winner and surefire Hall of Famer. Says Melvin: “He takes great pride in being able to throw the ball where he wants.” And when he wants.

He learned about FIP, or fielding independent pitching, a statistical aggregate that combines what a pitcher can control (homers, walks, strikeouts), ignores what he can’t (luck, defense) and is a truer barometer than ERA. He also learned about BABIP, or batting average on balls in play, a stat that indicates whether a pitcher has been especially lucky (under .300) or unlucky (over .300). He learned about WAR, or wins above replacement, the all-inclusive, apples-to-apples metric that tells how valuable a player is to his team. He learned about ground ball rates, strikeout-to-walk ratios and more.

via Saviormetrics – ESPN The Magazine

San Diego Padres go “big data” with dynamic ticket pricing

The Padres have become the latest Major League team to implement dynamic pricing for single-game ticket sales.

By using advanced computer programming, dynamic pricing will give the team the ability to adjust ticket costs higher or lower based on market demand and such factors as pitching matchups, the team’s performance, weather and potential milestones.

“Over time, as the game gets closer, ticket prices will normalize, but generally, fans who buy early will save money.

Single-game tickets will go on sale at the Petco Park box office, online at www.padres.com/tickets on Feb. 11.

All told, nearly three-quarters of single-game tickets will go on sale at or below 2011 prices.

via MLB.com

San Diego Padres go "big data" with dynamic ticket pricing

The Padres have become the latest Major League team to implement dynamic pricing for single-game ticket sales.

By using advanced computer programming, dynamic pricing will give the team the ability to adjust ticket costs higher or lower based on market demand and such factors as pitching matchups, the team’s performance, weather and potential milestones.

“Over time, as the game gets closer, ticket prices will normalize, but generally, fans who buy early will save money.

Single-game tickets will go on sale at the Petco Park box office, online at www.padres.com/tickets on Feb. 11.

All told, nearly three-quarters of single-game tickets will go on sale at or below 2011 prices.

via MLB.com

What Is a Data Scientist?

After performing some research on data visualization, I came across an explosion of articles on “data scientists”. This term seems to be the latest craze in Social Networking. All these web 2.0 companies with millions of users and their data, but no income. Why not find a way to turn that data into profits.

The trouble is that all this is very confusing…to everybody. So let’s start at the beginning.

What Is a Data Scientist?

According to Ryan Kim of Giga Om: “It’s someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. It’s a set of skills that go beyond many existing job titles and it’s increasingly in demand.”

Which is pulled from Bit.ly’s data scientist who goes into the taxonomy:

  • Obtain: pointing and clicking does not scale.
    • Getting a list of numbers from a paper via PDF or from within your web browser via copy and paste rarely yields sufficient data to learn something `new’ by exploratory or predictive analytics. Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line, e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing (Fall 2010).
  • Scrub: the world is a messy place.
    • Whether provided by an experimentalist with missing data and inconsistent labels, or via a website with an awkward choice of data formatting, there will almost always be some amount of data cleaning (or scrubbing) necessary before analysis of these data is possible. As with Obtaining data, herein a little command line fu and simple scripting can be of great utility. Scrubbing data is the least sexy part of the analysis process, but often one that yields the greatest benefits. A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.
  • Explore: You can see a lot by looking.
    • Visualizing, clustering, performing dimensionality reduction: these are all part of `looking at data.’ These tasks are sometimes described as “exploratory” in that no hypothesis is being tested, no predictions are attempted. Wolfgang Pauli would call these techniques “not even wrong,” though they are hugely useful for getting to know your data. Often such methods inspire predictive analysis methods used later.
  • Models: always bad, sometimes ugly.
    • Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it `better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret…
  • iNterpret: “The purpose of computing is insight, not numbers.”
    • Consider the task of automated digit recognition. The value of an algorithm which can predict ’4′ and distinguish from ’5’ is assessed by its predictive power, not on theoretical elegance; the goal of machine learning for digit recognition is not to build a theory of ’3.’ However, in the natural sciences, the ability to predict complex phenomena is different from what most mean by ‘understanding’ or ‘interpreting’…

The ability and the skills to perform all of these processes makes the person highly valuable and, in my opinion, very interesting. The career pages of Facebook, Google, Twitter, Foursquare, LinkedIn and a host of other social sites are boiling over with requests for these folk. It is a super hiring phase to get the best employees.

It appears that LinkedIn is leading in innovation right now with their Data Science Team, though I suspect that Facebook, Amazon, and Google are not far behind. Some of the features that the LinkedIn team has built are:

  • “People You May Know”
  • “Jobs You May be Interested In”
  • Groups You May Like”
  • “Talent Match” (Post a job on LinkedIn and see the magic)
  • “Signal” (news feed feature)

Which is funny because all sites seem to have these features. I can just imagine an arms race of sorts to build the next great “data science” feature so they can keep bringing us back to their sites (and hopefully get some ad money along the way).

 

Twitter’s edge is Big Data

Today Twitter announced that it purchased Backtype, a tiny company doing Big Data. The website says the company does “social media analytics” which is pretty much saying “Oprah sits on a couch.”

It’s the buzzy-est of buzz words, but if you dig into this you find that the company doing things that everybody wants. Like analyzing hits per link.

When a content creator shares their work the Backtype-created Storm product steps in. Giving you a realtime conversational graph, ability to search comments, and an influence score. Add to that comparison shopping through analysis of top sites and trending links.

But you may be asking, can’t I already do this?

Yes, you can, if you want it 24 hours later.

Stats on the web fall into two categories, instant low tech stats and delayed high tech stats. The market is saturated with the latter (delayed high tech stats) because the core innovations already exist. The reason is complicated but it boils down to the fact that today’s top hardware was built for the pre-Twitter/Facebook world.

To build for realtime processing requires a whole new set of operators where speed, size, and queries go supernova. To give you an example of this, see Backtype’s stats:

  • 100-200 machines (operating as one)
  • 100 million messages
  • 300 queries/second

This is the world that Twitter lives in, millions of messages/second. If you remember the early days of Twitter with all the downtime and Fail Whale messages, that was due to the technological limitations of the time.

They proved that quick, short messages are beloved by us humans to the tune of billions. Since then they have been massively scaling, customizing, and driving the industry. Not only do they need a way to process billions of messages without Fail Whal-ing, but they need to offer (paid) services on top of it.

This is where Backtype comes in. The team built a fascinating service on top of Twitter that does stream processing, continuous computation, and distributed RPC (remote queries of 100s of machines).

The simple translation of this is “live analytics”. The complicated version, pulled from Twitter’s:

“Imagine you have a cluster of 100 computers. Hadoop’s distributed file system makes it so you can put data…in…and pretend that all the hard drives on your machines have coalesced into one gigantic drive….it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way.

“…the second main component of Hadoop is its map-reduce framework, which provides a simple way to break analyses over large sets of data into small chunks which can be done in parallel across your 100 machines.”

By buying this technology Twitter is pushing it’s edge with Big Data.

An advantage they started building years ago to make sure the product stopped failing all the time. It has taken them years, millions, and transformed the company into a professionally respectable “Big Data operation”, that is world class and in many ways unique.

Now they have some freedom to play around and Backtype provides the playground. Links stats, emerging trends, and viral memes are just the beginning.

We are about to see how realtime we can get…

Twitter's edge is Big Data

Today Twitter announced that it purchased Backtype, a tiny company doing Big Data. The website says the company does “social media analytics” which is pretty much saying “Oprah sits on a couch.”

It’s the buzzy-est of buzz words, but if you dig into this you find that the company doing things that everybody wants. Like analyzing hits per link.

When a content creator shares their work the Backtype-created Storm product steps in. Giving you a realtime conversational graph, ability to search comments, and an influence score. Add to that comparison shopping through analysis of top sites and trending links.

But you may be asking, can’t I already do this?

Yes, you can, if you want it 24 hours later.

Stats on the web fall into two categories, instant low tech stats and delayed high tech stats. The market is saturated with the latter (delayed high tech stats) because the core innovations already exist. The reason is complicated but it boils down to the fact that today’s top hardware was built for the pre-Twitter/Facebook world.

To build for realtime processing requires a whole new set of operators where speed, size, and queries go supernova. To give you an example of this, see Backtype’s stats:

  • 100-200 machines (operating as one)
  • 100 million messages
  • 300 queries/second

This is the world that Twitter lives in, millions of messages/second. If you remember the early days of Twitter with all the downtime and Fail Whale messages, that was due to the technological limitations of the time.

They proved that quick, short messages are beloved by us humans to the tune of billions. Since then they have been massively scaling, customizing, and driving the industry. Not only do they need a way to process billions of messages without Fail Whal-ing, but they need to offer (paid) services on top of it.

This is where Backtype comes in. The team built a fascinating service on top of Twitter that does stream processing, continuous computation, and distributed RPC (remote queries of 100s of machines).

The simple translation of this is “live analytics”. The complicated version, pulled from Twitter’s:

“Imagine you have a cluster of 100 computers. Hadoop’s distributed file system makes it so you can put data…in…and pretend that all the hard drives on your machines have coalesced into one gigantic drive….it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way.

“…the second main component of Hadoop is its map-reduce framework, which provides a simple way to break analyses over large sets of data into small chunks which can be done in parallel across your 100 machines.”

By buying this technology Twitter is pushing it’s edge with Big Data.

An advantage they started building years ago to make sure the product stopped failing all the time. It has taken them years, millions, and transformed the company into a professionally respectable “Big Data operation”, that is world class and in many ways unique.

Now they have some freedom to play around and Backtype provides the playground. Links stats, emerging trends, and viral memes are just the beginning.

We are about to see how realtime we can get…