What will it take to make Big Data clean, standardized, and scalable for effective analysis and broad application?
When planning the program for the Strata Conference November 19-21 in Barcelona, Spain, O’Reilly spent a lot of time thinking about what’s important in data now, what’s next, what’s changing, and what data professionals and executives need to know to stay ahead of the curve. Here are three of the trends we’ve been discussing:Note: Strata will tackle these trends and challenges in depth. And nextrends readers can get 20% off registration to Strata Barcelona using the promo code SWISSNEX or enter to win a free pass!
Trend: Democratization of Data
As data-driven organizations start to outpace their competitors, businesses are starting to devote more energy to building their analytics capabilities. This demand, combined with shortages of data scientists, has driven a growing number of tools that put the power of advanced analytics in the hands of non-experts.
These solutions combine sophisticated algorithms, rich data sets, and intuitive interfaces. These three elements, along with the rise of parallel/distributed computing systems, open up techniques previously confined to experienced data scientists.
Challenges: As access to data and analysis proliferates, there is an increased chance of errors, conflicting results, and cognitive bias—not to mention the fact that too much information can be difficult to digest. And although routine queries have become much easier to perform, data is still not a DIY project for most organizations.
Acquiring, cleaning, managing, and organizing data for analysis is still a large part of the job. Also, getting the most out of data analysis requires understanding the nature of the data, including the inevitable quirks, inconsistencies, and errors.
Building a solid data science and engineering team—and the culture that nourishes them—requires attracting scarce data talent, a better understanding of the data science workflow (many organizations are finding that long data pipelines require a wide variety of skills and tools), and balancing the tension between specialized and general-purpose data analysis tools.
Creating a data-driven culture also requires that non-expert business users develop a basic understanding of best data practices. As more non-experts manipulate data, the potential for misunderstanding what the data is telling you increases. (Hint: correlation is not causation.) Documenting provenance of data sources is increasingly important. Collaboration and reproducibility as cultural norms can help non-experts avoid errors and create valuable learning opportunities.
Trend: The Internet of Things (IoT)
Cheap networks and sensors herald a future with an Internet that’s always connected, always on, and nothing like the world of today. IDC predicts over 30 billion connected “things” will exist in the year 2020—each one generating data.
Challenges: Right now, there are no widely accepted open standards for the IoT—which means devices may not “talk” to each other, and data streams may not be compatible.
Smart things will generate billions of concurrent records that don’t have common structures (including geospatial, semi-structured sensor data, and sometimes binary data). Extracting understandable, meaningful insights from the resulting torrent can be complex.
Security considerations span not only a single program, but a whole network of programs and devices. And an always-on society creates consumers who expect information instantaneously, 24/7, requiring near real-time analysis.
Trend: Advances in Algorithms
These rarely make the news, but advances in algorithms have had a profound—if largely unrecognized—effect on our lives and on the ways data can be used. Deep neural networks (deep learning) excel at perception tasks and have been deployed in consumer products (such as Facebook and Google Brain).
Active learning uses algorithms for handling the “easier” or routine tasks, routing the difficult ones to humans. It’s been used for things like extracting unstructured data from web pages or transcribing with speech-to-text software.
Challenges: It is no longer enough that algorithms are accurate; in many cases they need to be scalable, fast, and interpretable as well. Not to mention impenetrable— algorithms can be used for evil as well as good.
Adversarial algorithms can be designed to attack intelligent systems. Secure machine learning (systems that are immune to adversarial attacks) and other privacy and security measures will become increasingly important.
What trends and challenges are you concerned about with the rise of Big Data?