Cloud Wars: A Deep Dive into Data, Analytics and Reporting Services for Medallion Architecture Lakehouse in AWS, Azure, and GCS
Introduction
Crafting a medallion architecture lakehouse demands precision and foresight. Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) emerge as juggernauts, each offering a rich tapestry of data and reporting services. This blog post delves into the intricacies of these offerings, unravelling the nuances that can influence your decision-making process for constructing a medallion architecture lakehouse that stands the test of time.
1. Understanding Medallion Architecture: Where Lakes and Warehouses Converge
Medallion architecture represents the pinnacle of data integration, harmonising the flexibility of data lakes with the analytical prowess of data warehouses, combined forming a lakehouse. By fusing these components seamlessly, organisations can facilitate efficient storage, processing, and analysis of vast and varied datasets, setting the stage for data-driven decision-making.
The medallion architecture is a data design pattern used to logically organise data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture. The architecture describes a series of data layers that denote the quality of data stored in the lakehouse. It is highly recommended, by Microsoft and Databricks, to take a multi-layered approach to building a single source of truth (golden source) for enterprise data products. This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimised for efficient analytics. The terms bronze (raw), silver (validated), and gold (enriched) describe the quality of the data in each of these layers. It is important to note that this medallion architecture does not replace other dimensional modelling techniques. Schemas and tables within each layer can take on a variety of forms and degrees of normalisation depending on the frequency and nature of data updates and the downstream use cases for the data.
2. Data Services
Amazon Web Services (AWS):
- Storage:
- Amazon S3: A scalable object storage service, ideal for storing and retrieving any amount of data.
- ETL/ELT:
- AWS Glue: An ETL service that automates the process of discovering, cataloguing, and transforming data.
- Data Warehousing:
- Amazon Redshift: A fully managed data warehousing service that makes it simple and cost-effective to analyse all your data using standard SQL and your existing Business Intelligence (BI) tools.
Microsoft Azure:
- Storage:
- Azure Blob Storage: A massively scalable object storage for unstructured data.
- ETL/ELT:
- Azure Data Factory: A cloud-based data integration service for orchestrating and automating data workflows.
- Data Warehousing
- Azure Synapse Analytics (formerly Azure SQL Data Warehouse): Integrates big data and data warehousing. It allows you to analyse both relational and non-relational data at petabyte-scale.
Google Cloud Platform (GCP):
- Storage:
- Google Cloud Storage: A unified object storage service with strong consistency and global scalability.
- ETL/ELT:
- Cloud Dataflow: A fully managed service for stream and batch processing.
- Data Warehousing:
- BigQuery: A fully-managed, serverless, and highly scalable data warehouse that enables super-fast SQL queries using the processing power of Google’s infrastructure.
3 . Analytics
Google Cloud Platform (GCP):
- Dataproc: A fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters.
- Dataflow: A fully managed service for stream and batch processing.
- Bigtable: A NoSQL database service for large analytical and operational workloads.
- Pub/Sub: A messaging service for event-driven systems and real-time analytics.
Microsoft Azure:
- Azure Data Lake Analytics: Allows you to run big data analytics and provides integration with Azure Data Lake Storage.
- Azure HDInsight: A cloud-based service that makes it easy to process big data using popular frameworks like Hadoop, Spark, Hive, and more.
- Azure Databricks: An Apache Spark-based analytics platform that provides collaborative environment and tools for data scientists, engineers, and analysts.
- Azure Stream Analytics: Helps in processing and analysing real-time streaming data.
- Azure Synapse Analytics: An analytics service that brings together big data and data warehousing.
Amazon Web Services (AWS):
- Amazon EMR (Elastic MapReduce): A cloud-native big data platform, allowing processing of vast amounts of data quickly and cost-effectively across resizable clusters of Amazon EC2 instances.
- Amazon Kinesis: Helps in real-time processing of streaming data at scale.
- Amazon Athena: A serverless, interactive analytics service that provides a simplified and flexible way to analyse petabytes of data where it lives in Amazon S3 using standard SQL expressions.
4. Report Writing Services: Transforming Data into Insights
- AWS QuickSight: A business intelligence service that allows creating interactive dashboards and reports.
- Microsoft Power BI: A suite of business analytics tools for analysing data and sharing insights.
- Google Data Studio: A free and collaborative tool for creating interactive reports and dashboards.
5. Comparison Summary:
- Storage: All three providers offer reliable and scalable storage solutions. AWS S3, Azure Blob Storage, and GCS provide similar functionalities for storing structured and unstructured data.
- ETL/ELT: AWS Glue, Azure Data Factory, and Cloud Dataflow offer ETL/ELT capabilities, allowing you to transform and prepare data for analysis.
- Data Warehousing: Amazon Redshift, Azure Synapse Analytics, and BigQuery are powerful data warehousing solutions that can handle large-scale analytics workloads.
- Analytics: Azure, AWS, and GCP are leading cloud service providers, each offering a comprehensive suite of analytics services tailored to diverse data processing needs. The choice between them depends on specific project needs, existing infrastructure, and the level of expertise within the development team.
- Report Writing: QuickSight, Power BI, and Data Studio offer intuitive interfaces for creating interactive reports and dashboards.
- Integration: AWS, Azure, and GCS services can be integrated within their respective ecosystems, providing seamless connectivity and data flow between different components of the lakehouse architecture. Azure integrates well with other Microsoft services. AWS has a vast ecosystem and supports a wide variety of third-party integrations. GCP is known for its seamless integration with other Google services and tools.
- Cost: Pricing models vary across providers and services. It’s essential to compare the costs based on your specific usage patterns and requirements. Each provider offers calculators to estimate costs.
- Ease of Use: All three platforms offer user-friendly interfaces and APIs. The choice often depends on the specific needs of the project and the familiarity of the development team.
- Scalability: All three platforms provide scalability options, allowing you to scale your resources up or down based on demand.
- Performance: Performance can vary based on the specific service and configuration. It’s recommended to run benchmarks or tests based on your use case to determine the best-performing platform for your needs.
6. Decision-Making Factors: Integration, Cost, and Expertise
- Integration: Evaluate how well the services integrate within their respective ecosystems. Seamless integration ensures efficient data flow and interoperability.
- Cost Analysis: Conduct a detailed analysis of pricing structures based on storage, processing, and data transfer requirements. Consider potential scalability and growth factors in your evaluation.
- Team Expertise: Assess your team’s proficiency with specific tools. Adequate training resources and community support are crucial for leveraging the full potential of chosen services.
Conclusion: Navigating the Cloud Maze for Medallion Architecture Excellence
Selecting the right combination of data and reporting services for your medallion architecture lakehouse is not a decision to be taken lightly. AWS, Azure, and GCP offer powerful solutions, each tailored to different organisational needs. By comprehensively evaluating your unique requirements against the strengths of these platforms, you can embark on your data management journey with confidence. Stay vigilant, adapt to innovations, and let your data flourish in the cloud – ushering in a new era of data-driven excellence.

Comparison between Cloud Service Providers – GCP, AWS, Azure https://cloud.google.com/docs/get-started/aws-azure-gcp-service-comparison
LikeLike