Big Data Processing 2.0
Introduction As the commercial and wide-spread adoption of Big Data processing platforms becomes industry standard – it seems inevitable that […]
5th Aug 2013
Big Data Processing 2.0
As the commercial and wide-spread adoption of Big Data processing platforms becomes industry standard – it seems inevitable that the core business sectors currently benefiting the most from this new form of derived business insight (Civil, Defense, Medical, Pharma and of course Marketing) will drive innovation in technology to new standards in Big Data processing.
Whilst the many vendors of Big Data platforms adhere to that industry demand through maturing their own service offerings…we can now begin to identify, through current practices, how Big Data processing might evolve through future innovation.
Big Data processing is still largely implemented as a batch process and not applicable for time-sensitive data. Hadoop, the foremost open-source platform for running Map Reduce jobs (parallel data processing) is well suited to being able to process massive amounts of data in a single operation and over an extended period i.e. hourly, nightly or less frequently but not as a data streaming or CEP service.
Open-source packages such as Storm and Yahoos S4 (now an Apache Incubator project) are early stage frameworks in high-performance real-time computation engines.
Although initiatives from larger service providers like Amazon Web Services have been revolutionary in delivering cost-effective solutions for Big Data processing – organizations can still drum up significant costs without the right strategy and diligent internal governance. Amazon does continue to lower costs regularly for it?s selection of services so it may seem appropriate to wonder when medium to large scale (25+ node clusters) parallelized data processing platforms will be free.
Mature unstructured data warehousing schemas
Within the data management and software community there is currently a significant lack of experience for building scalable and extensible data warehousing schemas for unstructured data.
Core challenges in schema design for unstructured data involves the trade off between data granularity, processing time and readiness for analysis and reporting (whether to store at pre or post aggregation) as well as designing the dimensionality for seemingly unknown future data needs.
Maturing security and rendering standards for real-time data streaming in Web browsers will greatly enhance ease of implementation as well as influence businesses to adopt real-time eventing in analytics.
Accessibility with designing and running map-reduce jobs or performing most data processing jobs is still a reserved skillset. As the adoption of Map Reduce continues, so should its accessibility to those outside of data science or IT skillsets be able to run it. Ideally, we should begin to see further publications and examples of business centric MR publications as the paradigm is wider adopted.
As industry experience with paradigms such as Map Reduce mature and expectations rise with changing size, cost and frequency of data processing – frameworks like Storm will play a significant role in facilitating the delivery and greater application of Big Data processing. Complementary browser technology & thin-client scripting libraries, hosting providers like Amazon as well as in-memory persistence platforms such as Redis will also play significant roles in facilitating what we may come to know as Big Data 2.0.