我想用pyspark和
emr-dynamodb-connector
将整个dynamodb表读入rdd,或者最好读入数据帧。我的代码如下。
dynamodb.py
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
conf = {
"dynamodb.servicename": "dynamodb",
"dynamodb.input.tableName": "user_state_test",
"dynamodb.output.tableName": "user_state_test",
"dynamodb.endpoint": "https://dynamodb.us-west-2.amazonaws.com",
"dynamodb.regionid": "us-west-2",
"mapred.output.format.class": "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat",
"mapred.input.format.class": "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat"
}
dynamoRDD = sc.hadoopRDD(
inputFormatClass="org.apache.hadoop.dynamodb.read.DynamoDBInputFormat",
keyClass="org.apache.hadoop.io.Text",
valueClass="org.apache.hadoop.dynamodb.DynamoDBItemWritable",
conf=conf
)
count = dynamoRDD.count()
print(count)
为了提供emr连接器,我使用maven构建工具按照
awslabs instructions
-
克隆回购
-
mvn clean install
.
-
该构建将在
emr-dynamodb-hadoop
emr-dynamodb-hadoop-4.8.0-SNAPSHOT.jar
. 我将这个jar复制到我的代码所在的repo中,并将其重命名为
emr-dynamodb-hadoop.jar
.
spark-submit --master "local[4]" --jars /Users/vaerk/dev/myproject/emr-dynamodb-hadoop.jar dynamodb.py
java.lang.ClassNotFoundException: com.amazonaws.services.dynamodbv2.model.AttributeValue
我的问题: