Hadoop has brought revolutionary change in the world of big data. However, the difficulties involved in the movement and management of data in this system are of no less magnitude. The whole systemthat entails data motion, orchestration of the process, data discovery and lifecycle management is no easy feat, despite a developed ecosystem that Hadoop possess.
Enter Apache Falcon. Aimed at simplifying the task of onboarding feed processing and management on Hadoop clusters, this open source software allows end users to obtain a thorough understanding of how, where and when their data is being managed anywhere in its lifecycle.
What does Apache Falcon do?
The three important parts of data management are cleaning the Hadoop data, making it ready for business intelligence tools, and stowing it away to its rightful place when its utility is served. Falcon applies a higher layer of abstraction to simplify the development and management of pipelines that process data. With its creative data management services, it removes the complex coding out of the equation, thereby making the process easier for the several users that operate on Hadoop’s big data.
With Falcon, you can process a single huge dataset stored in HDFS for batch, streaming as well as interactive applications. It simply means that those app developers on Hadoop have now a much easier time creating their products. By defining, deploying and managing data pipelines, the Falcon framework also leverages other HDP components such as Pig and Oozie.Oozie coordinates workflow. Falcon has open APIs which broadly orchestrate the workflow templates, which are used for data management, to provide integration between data warehouse systems.
Here’s the point wise lowdown of how Falcon operates:
- It establishes relationship between different data and processing elements in a Hadoop environment.
- It feeds management services like feed retention, archival, replications across clusters, and so on.
- It makes it easy to onboard new pipelines/workflows and providessupport for retry policies and late data handling.
- It allows integration with metastore/catalog like Hive/HCatalog.
- It provides notification to end customer on the basis of the availability of feed groups.
- It enables use cases for local processing in global and solo aggregations.
- It captures lineage information for processes and feeds.
Basically, Falcon addresses data motion, data discovery, operability or usability, process orchestration and scheduling, and policy-based lifecycle management: things that are beyond the scope of ETL. Moreover, Falcon creates additional opportunities by building on components already present in the Hadoop ecosystem.
In a larger context though, the development of data management technologies in the Hadoop ecosystem is in still a very nascent stage. The lack of expertise, multiple complexities in the processing and distribution of structured or unstructured data, and the high costs of collection and storage are some of the reasons that impede the goal of better management of data. With Apache Falcon, the baby steps towards the goal have been taken.
What do you think about it? Share in the comments section.