Take Away the ‘Big Data’ Hype, and Data is just Data. Data can be managed.

The rise of the data scientist

As I write this, a major annual data science conference (Strata 2015) has just concluded in San Jose. President Obama provided a keynote speech highlighting the release of 138,000 datasets of public information, and the appointment of the first Chief Data Scientist. Closer to home, we know that years ago, Gartner predicted Australia would experience a surge in big data-related jobs by 2015. A quick search of local job postings reveals there are plenty of data science roles on the go. In fact, The Australian reported on research that shows a 77 per cent increase in data scientist jobs in Australia in 2014. We have also read that James Cook University, Monash University, the University of Technology (UTS) and Victoria University plan to offer new data science-related courses in 2015.

Here at DiUS, one of the SIGs (Special Interest Groups) that we run out of our Sydney and Melbourne offices tracks our keen interest in the burgeoning field of data science. We have had trouble naming the thing. Options thrown around include: the Big Data SIG, the Data Science SIG, the Data Insights SIG and inevitably, the Big SIG. “Big data” is but a subset of the field, albeit with its own hype machine.

Don’t fall prey to Big Data hype

To successfully focus on mining, analysing and acting on data to achieve true business value, organisations must not allow the idea of big data to be a barrier. The “big” in “big data” means the volume of data is large enough to be difficult to manage. It is a sliding definition of size over time, growing with our computational capabilities. Hopefully the increase in organisations seeking to work with data scientists means more will cut through the big data hype and get onto the business of data analysis.

There’s no denying the predominance of the term in the mainstream media however, and as the industry matures, vendors have brought a battery of cutting-edge tools to the marketplace, brandishing both impressive capabilities and prices. To an organisation – small or large – looking to lift its game by harnessing these resources, the first step may seem daunting, extensive and expensive.

To smaller businesses, it might even seem like “big data” is a concept for large organisations that can afford to invest in the big tools that can handle the required data processing and analysis. Not true! Data analysis is universal. It works for big companies and small, and it’s best to keep it simple when you’re getting started.

The secret is to start with small, simple, manageable data sets
and a process that will help you get some wins on the board,
which will motivate you and your organisation to continue
using data to improve your business.

Step by Step Approach to Begin Harnessing Data for Your Business

What you need: measured steps, a very disciplined focus and a small cast of stakeholders

Step 1: The business focuses on one insight to pursue

This is a prioritisation exercise. Your key business stakeholders create a shortlist of instances / activities / processes where more information would improve their ability to add value to the business. From that list, the group selects one goal to tackle first.

Step 2: A data scientist accesses the data

focus-access-assess-iterate

The data scientist works with simple, open-source tools to support this need by first identifying the data sources and accessing them. He or she follows a process of data cleaning, exploratory analysis, model building and validation.

During this stage, the data scientist may provide feedback on whether the nominated insight is sufficiently focused. If it’s too high level, then a large data set is required. The point is to begin with a very manageable set of data, so you may need to revisit step one and refine.

Step 3: The data product is presented

Stakeholders assess and critique A word of advice: data visualisation is the best communicator, but complex, multi-dimensional charts are not required at the outset. They may look impressive but, at least in the early stages, it’s better to stick to simple, accessible presentation formats so that all business stakeholders can easily process the information.

The business stakeholders must assess and indicate how the data product could help with their identified area of enquiry, and what limitations remain. The data scientist rates the data quality and its implications on the results presented. Statistical diagnostics should be run and reviewed.

Tip: Look to pull in outside datasets. The Australian Bureau of Statistics website and various State and Federal open-data initiative can be useful resources. The external perspective provided may either reinforce your existing data or highlight its limitations. Both are useful outcomes.

Step 4: Iterate

agile-methodology

The familiar Agile methodology precepts of short iterations and regular review work just as well for data science as they do for software development. Having serviced the initial request to everyone’s satisfaction, the stakeholders return to Step 1, and pick a new business goal on which to gain insight. They should not reflexively pick the next item on the initial list. Rather, in light of the knowledge gained from the exercise just concluded, the list should be re-assessed in order to decide upon the next insight to pursue.

Don’t be afraid to begin again

If your enterprise has already gone big with its data analysis technology investment but it’s really not getting you anywhere, or if missteps were made and you now find your data analysis projects stalled or back at the drawing board, you can apply the above recommended methodology to begin again. Re-focus, start small and iterate.

The “bigness” threshold

Professors Trevor Hastie and Rob Tibshirani of Stanford are two of the most prominent names in the field of Statistical Learning, having produced a number of successful algorithms and one of the best books on the topic. As they put it, they’ve worked in big data all the way back to when it was called statistics and, in their lectures, there is some bemusement at the perception that we are only now in uncharted waters.

A key objective in their work has always been to identify the smallest subset of the data that can give an effective model. The techniques to do this are many and varied, and have been evolved by several smart people over the past few decades. Today’s data scientist has many tools at his or her disposal to execute what the professors have been pioneering for years – reducing the volume of data needed to get the job done to below the current “bigness” threshold.

At some point, however, an organisation may experience such growth that the volume of data being worked may cross that line. The factors driving this will be varied. The dimensions of the stored data may be large in terms of breadth (the number of fields or factors for each data record entry) or depth (the number of data records). At the outset, a tight focus can help to sidestep this but you’ll still be working with a very large data set. For some organisations it is inevitable that the number of insights being served, their timeliness and the evolved complexity of the predictive modelling will drive up the data throughput.

Hopefully though, by this point, the organisation is an experienced consumer and evaluator of data technologies, and its stakeholders, having partaken of the success of data insights, are invested in its future growth.