What is Amazon Athena?
Amazon Athena is an interactive query service for analysing data in Amazon S3 using standard SQL. Athena is serverless, and you pay only for the queries that you run.
In Athena point to your data in Amazon S3, define the schema, and start querying using standard SQL. There’s no need for complex ETL jobs to prepare your data for analysis.
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.
What is Redshift Spectrum?
Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use much less of your cluster’s processing capacity than other queries. Redshift Spectrum also scales intelligently. Based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing.
You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore.
Athena vs Redshift Spectrum
Both are serverless and pay as you go. The processing costs are the same at around $5 per terabyte scanned. However with Spectrum you also pay for the Redshift cluster which can between $0.25 and £13.00 an hour depending on the vCPUs, memory and storage of the cluster.
If you already a Redshift customer then moving data out of Redshift into S3 and using Spectrum offers significant cost savings for storage of large volumes of data. The processing is unchanged.
If you are already using Athena then stick with Athena as it offers much the same capabilities as Spectrum without the cost of running a Redshift cluster. Athena is being exhanced with federated queries and with ML (abiity to call Sagemaker
It is also worth considering which analytic tool you plan to use. Not all tools are compatible with Athena and Spectrum.
The choice may boil to the complexity of the task in hand:
- Amazon Athena provides the easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers.
- Amazon Redshift provides the fastest query performance for enterprise reporting and business intelligence workloads, particularly those involving extremely complex SQL with multiple joins and sub-queries.
- Amazon EMR makes it simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto incorporating custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.
Athena Connectors
With Athena you can connect to data sources other than S3
- Athena AWS CMDB Connector
- Amazon Athena CloudWatch Connector
- Amazon Athena CloudWatch Metrics Connector
- Amazon Athena DocumentDB Connector
- Amazon Athena DynamoDB Connector
- Amazon Athena Elasticsearch Connector
- Amazon Athena HBase Connector
- Amazon Athena Connector for JDBC-Compliant Data Sources
- Amazon Athena Redis Connector
- Amazon Athena TPC Benchmark DS (TPC-DS) Connector
Spectrum Connectors
With Spectrum you can connect to external tables:
- Amazon Redshift
- AWS Glue
- Amazon Athena
- Apache Hive metastore.