代码之家 › 专栏 › 技术社区 › Chris T.

将dataframe(带NA)映射到n×n邻接矩阵(作为data.frame对象)

adjacency-matrix reshape dataframe r

Chris T. · 技术社区 · 6 年前

我有三列 dataframe 对象记录161个国家之间的双边贸易数据,数据为二进格式,共19687行,3列(报告者)( rid pid )双边贸易流量( TradeValue )在特定年份)。 摆脱 或取1到161之间的值,并为一个国家指定相同的值 摆脱 和 . 对于任何给定的一对( , pid )其中 =/= pid , 贸易价值 摆脱 , ) = pid , ).

数据(在R中运行)如下所示:

#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")

head(example_data, n = 10)
   rid pid TradeValue
1    2   3        500
2    2   7       2328
3    2   8    2233465
4    2   9      81470
5    2  12     572893
6    2  17     488374
7    2  19    3314932
8    2  23      20323
9    2  25         10
10   2  29    9026220

数据来源于 UN Comtrade database 与多个以获取双边贸易数据,但可以看出,并非所有 pid 摆脱 或 pid 如果一个国家有相关的经济指标清单,这就是为什么有 NA 贸易价值 存在于该国和报告国之间( 摆脱 ). 这同样适用于当一个国家成为“记者”时,在这种情况下,该国没有任何报道 贸易价值 摆脱 列(因此,你可以看到 摆脱

length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)

由于大多数国家都报告了与合作伙伴的双边贸易数据,而那些没有报告的国家往往是小型经济体。因此,我想保留161个国家的完整名单,并改变这一现状 example_data 数据帧转换成161x161邻接矩阵,其中

对于那些没有参加会议的国家 摆脱 列(例如。, ==1),创建每一行并将整行(在161 x 161矩阵中)设置为0。
对于那些国家( )不共享的具有特定 摆脱 ,将这些单元格设置为0。

例如,假设在一个5×5邻接矩阵中,国家1没有报告与伙伴的任何贸易统计数据,其他四个国家报告了与其他国家(国家1除外)的双边贸易统计数据。原始数据帧如下

rid	pid	TradeValue
2	3	223
2	4	13
2	5	9
3	2	223
3	4	57
3	5	28
4	2	13
4	3	57
4	5	82
5	2	9
5	3	28
5	4	82

data.frame

	V1	V2	V3	V4	V5
1	0	0	0	0	0
2	0	0	223	13	9
3	0	223	0	57	28
4	0	13	57	0	82
5	0	9	28	82	0

数据示例 创建161 x 161邻接矩阵。然而,经过几次尝试和错误 reshape

如果有人能在这方面给我一些启发,我将不胜感激?

1 回复 | 直到 6 年前

phil_t 6 年前

我无法读取dropbox文件,但我已尝试使用您的5个国家的示例数据帧-

country_num = 5

# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0), 
                                     1, setdiff(1:country_num, example_data$pid))

# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)

# add dummy dataframe to original
example_data = rbind(example_data, add_data)

# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")

# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])

# fill in upper triangular matrix with missing values of lower triangular matrix 
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]

# change NAs to 0 according to preference - would keep as NA to differentiate 
# from actual zeros
mat[is.na(mat)] = 0

这有用吗?