How do I integrate Dynamodb into EMR

Processing DynamoDB data with Apache Hive in Amazon EMR

Amazon DynamoDB is integrated with Apache Hive, a data warehousing application that runs on Amazon EMR. Hive can read and write data to DynamoDB tables and can:

  • Querying live DynamoDB data with a SQL-like language (HiveQL).

  • Copy data from a DynamoDB table to an Amazon S3 bucket and vice versa.

  • Copying data from a DynamoDB table to the Hadoop Distributed File System (HDFS) and vice versa.

  • Perform joins on DynamoDB tables.

Overview

Amazon EMR is a web service that makes it easy to process huge amounts of data quickly and cost-effectively. To use Amazon EMR, launch a managed cluster of Amazon EC2 instances with the open source Hadoop framework. Hadoopis a distributed application that implements the MapReduce algorithm. With this algorithm, a task is assigned to multiple nodes in the cluster. Each node processes the task assigned to it in parallel with the other nodes. The expenses are ultimately reduced to a single node, which leads to the bottom line.

You can start your Amazon EMR cluster to be permanent or temporary:

  • A.Persistent-Cluster will run until you shut it down. Persistent clusters are ideal for data analysis, data warehousing, and other interactive uses.

  • A.TransientStart the cluster to process a job history and then shut down automatically. Temporary clusters are ideal for regular processing tasks such as running scripts.

For more information about the Amazon EMR architecture and management, see the Amazon EMR version guide.

When you start an Amazon EMR cluster, you specify the initial number and type of Amazon EC2 instances. You also specify other distributed applications (in addition to Hadoop) to run on the cluster. These applications include, among others. Hue, Mahout, Pig and Spark.

For more information about applications for Amazon EMR, see the Amazon EMR Release Notes.

Depending on the cluster configuration, there are one or more of the following node types:

  • Master Node - Manages the cluster by coordinating the distribution of the MapReduce executables and subsets of the raw data to the core and task instance groups. The master node also tracks the status of each performed task and monitors the health of the instance groups. There is only one leader node in a cluster.

  • Core Node - Runs MapReduce tasks and stores data using the Hadoop Distributed File System (HDFS).

  • Task Node (optional) - Performs MapReduce tasks.

Integrate with Amazon Redshift
Tutorial: Working with Amazon DynamoDB and Apache Hive