Tag: AWS

Assuring Data Quality: How to Build a Serverless Data Quality Gate on AWS

Data is a vital element in business decision-making. Modern technologies and algorithms allow for processing and storage of huge amounts of data, converting it into useful predictions and insights. But they also require high-quality data to ensure prediction accuracy and insight value. In today’s world, the importance of data quality validation is hard to overestimate. … Continue reading Assuring Data Quality: How to Build a Serverless Data Quality Gate on AWS

Autonomous data observability and quality within AWS Glue Data Pipeline

Data operations and engineering teams spend 30-40% of their time firefighting data issues raised by business stakeholders. A large percentage of these data errors can be attributed to the errors present in the source system or errors that occurred or could have been detected in the data pipeline. Current data validation approaches for the data … Continue reading Autonomous data observability and quality within AWS Glue Data Pipeline

Azure vs AWS: Which Cloud platform to choose for Big Data & Analytics solutions?

In an increasingly data-driven business atmosphere, Enterprises are strategizing more towards deriving meaningful insights from their vast amounts of data. As per Gartner, till 2017, 75% Enterprises have already invested in technology that facilitates Data Analysis. Amongst the many cloud vendors available, Microsoft Azure and Amazon AWS are the top Cloud Platforms that Enterprises are … Continue reading Azure vs AWS: Which Cloud platform to choose for Big Data & Analytics solutions?

Amazon EMR introduces EMR runtime for Apache Spark

Amazon EMR is happy to announce Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters. EMR runtime for Spark is up to 32 times faster than EMR 5.16, with 100% API compatibility with open-source Spark. This means that your workloads run faster, … Continue reading Amazon EMR introduces EMR runtime for Apache Spark

Introducing S3Guard: S3 Consistency for Apache Hadoop

Synopsis This article introduces a new Apache Hadoop feature called S3Guard. S3Guard addresses one of the major challenges with running Hadoop on Amazon’s Simple Storage Service (S3), eventual consistency. We outline the problem of S3’s eventual consistency, how it affects Hadoop workloads, and explain how S3Guard works. Problem Although Apache Hadoop has support for using … Continue reading Introducing S3Guard: S3 Consistency for Apache Hadoop