代码之家 › 专栏 › 技术社区 › HSJ

如何通过复杂的条件将列转换成矩阵和存储在列表中的表[R]

purrr data-conversion igraph dplyr r

HSJ · 技术社区 · 6 年前

我有一个数据框架,其中包含家庭一天内旅行的信息。

df <- data.frame(
hid=c("10001","10001","10001","10001"),
mid=c(1,2,3,4),
thc=c("010","01010","0","02030"),
mdc=c("000","01010","0","02020"),
thc1=c(0,0,0,0),
thc2=c(1,1,NA,2),
thc3=c(0,0,NA,0),
thc4=c(NA,1,0,3),
thc5=c(NA,0,NA,0),
mdc1=c(0,0,0,0),
mdc2=c(0,1,NA,2),
mdc3=c(0,0,NA,0),
mdc4=c(NA,1,NA,2),
mdc5=c(NA,0,NA,0)
)

hid :住户ID(实际数据框中有更多住户)
mid :家庭成员ID
thc :用于指示成员每日移动顺序的字符串;
0=屋内,1=他/他访问的地方的唯一ID

因此,如果它被编码为 01020 ,这意味着他/她去了那个地方。 1 从家(0)返回到家(0),访问了其他地方 2 从家(0),然后在一天内返回到家(0)。

入侵检测系统 隐藏 分成每列, htc1 , htc2 , htc3 , htc4 和 htc5 . 最大数量 四氢叶酸 是根据家庭活动的最大长度设置的。
如果一个成员的最大代码是5,而其他成员的最大代码是3, HTC4 其他成员的“htc5”由 NA .

mdc :表示在该地点进行的活动的属性的变量。例如,1=工作,2=学校。它也在后几列中被拆分。

现在,我想得到的是一个列表,其中包含 adjacency matrix 和 node list 对于 network analysis 用于,即, igraph ,其中包含 df .

这是期望的结果:

# Desired list
[1] # It represents first element grouped by `hid`.
    # In the actual data frame, there are around 40,000
    # households which contains different `hid`.

$hid # `hid` of each record
[1]10001
[2]10001
[3]10001
[4]10001

$mid # `mid` of each record
[1]1
[2]2
[3]3
[4]4

$trip # `adjacency matrix` of each `mid`
      # head of line indicates destination area id
      # leftmost column indicates origin area id
      # for example of [1], 'mid'=1 took 1 trip from 0 to 1 and 1 trip from 1 to 0
[1] # It represents `mid`=1
  0 1
0 0 1
1 1 0
[2] # It represents `mid`=2 
  0 1
0 0 2
1 2 0
[3]
  0
0 0
[4]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0

$node # Attribute of each area defined in `mdc'
      # for instance, mdc of `mid`=4, that is `02020`, s/he had activity `2` twice
      # in area id '2' and `3` as indicated in `thc` and `thc1-4`.
      # The number does not indicate "how many times s/he took activity in the area"
     # but indicates "what s/he did in the area"
area mdc1 mdc2 mdc3 mdc4
   0   0    0    0     0
   1   0    1   NA    NA
   2  NA   NA   NA     2
   3  NA   NA   NA     2

[2] # Next element continues same information of other hid
    # Thus, from `hid` to `mdc` are one set of attributes of one element

从…转换是相当复杂的 东风 在我目前对列表和数据转换的知识中找到所需的列表。例如,创建 邻接矩阵 ,我需要参考 thc or thc1-5 正面的。对于 node ,它还需要获取最大区域ID数,并将信息存储在“MDC或MDC1-5”中。
如果你能提供任何建议来开始这项工作,我将不胜感激。

我更喜欢用 tidyverse , purrr 还有他们的家人,但我没有用 珀尔 用于列表操作。我以前使用格式化程序进行数据操作,但不熟悉列表操作。

手术后,我将看到每个家庭(非成员)的运动和活动模式。 图解 或其他包装,如 ggnetwork 或 networkD3 从每个模式的分布中寻找上升模式。

1 回复 | 直到 6 年前

Luke C 6 年前

下面是两个可以构建邻接矩阵和活动矩阵的助手函数:构建邻接矩阵(详细信息请参见注释)

build_adj_mat <- function(thc_) {
  # Convert the factor to numeric for processing
  if (is.factor(thc_)) {
    thc_ <- as.numeric(unlist(strsplit(as.character(thc_), "")))
  }

  # Create a matrix with the correc dimensions, and give names
  mat <- matrix(0, nrow = max(thc_) + 1, ncol = max(thc_) + 1)
  rownames(mat) <- colnames(mat) <- seq(min(thc_), max(thc_))

  # Add to the matrix when appropriate
  for (i in 1:(length(thc_) - 1)) {
    from = thc_[i] + 1
    to = thc_[i + 1] + 1
    mat[from, to] <- mat[from, to] + 1
  }
  return(mat)
}


## Build the activity matrix / node

build_node_df <- function(df_) {
  # get the maximum area length
  max_len <-
    max(as.numeric(unlist(strsplit(
      as.character(df_$thc), ""
    ))))
  # Build the actual matrix function
  build_act_mat <- function(loc_, act_, max = max_len) {
    if (is.factor(loc_)) {
      loc_ <- as.numeric(unlist(strsplit(as.character(loc_), "")))
    }
    if (is.factor(act_)) {
      act_ <- as.numeric(unlist(strsplit(as.character(act_), "")))
    }
    area = rep(NA, max + 1)
    for (i in 1:length(loc_)) {
      area[loc_[i] + 1] <- act_[i]
    }
    return(area)
  }
  # Call the function
  out <- mapply(build_act_mat, df_$thc, df_$mdc)
  # cbind the output with the areas
  out <- data.frame(cbind(0:max_len, out))
  # Assign proper column names
  colnames(out) <- c("area", paste("mid_", df_$mid, sep = ""))
  return(out)
}

然后将这些函数应用于 df ,为您的 hid 和 mid 输出:

build_list <- function(dfo) {
  hid_ <- as.numeric(as.character(dfo$hid))
  mid_ <- as.numeric(as.character(dfo$mid))
  trip_ <- lapply(dfo$thc, build_adj_mat)
  node_ <- build_node_df(dfo)

  return(list(
    hid = hid_,
    mid = mid_,
    trip = trip_,
    node = node_)
    )
}

输出:

> build_list(df)
$hid
[1] 10001 10001 10001 10001

$mid
[1] 1 2 3 4

$trip
$trip[[1]]
  0 1
0 0 1
1 1 0

$trip[[2]]
  0 1
0 0 2
1 2 0

$trip[[3]]
  0
0 0

$trip[[4]]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0


$node
  area mid_1 mid_2 mid_3 mid_4
1    0     0     0     0     0
2    1     0     1    NA    NA
3    2    NA    NA    NA     2
4    3    NA    NA    NA     2

我相信有办法让这个工作 dplyr 但可能更容易使用 split 从底部 R . 有了这个稍微修改过的数据帧:

df2 <- data.frame(
  hid = c("10001", "10002", "10002", "10003"),
  mid = c(1, 2, 3, 4),
  thc = c("010", "01010", "0", "02030"),
  mdc = c("000", "01010", "0", "02020")
)

现在将新的数据帧拆分为一个列表并使用 lapply 应用 build_list 各部件的功能:

split_df2 <- split(df2, df2$hid)
names(split_df2) <- paste("hid_", names(split_df2), sep = "")
lapply(split_df2, build_list)

输出:

$hid_10001
$hid_10001$hid
[1] 10001

$hid_10001$mid
[1] 1

$hid_10001$trip
$hid_10001$trip[[1]]
  0 1
0 0 1
1 1 0


$hid_10001$node
  area mid_1
1    0     0
2    1     0


$hid_10002
$hid_10002$hid
[1] 10002 10002

$hid_10002$mid
[1] 2 3

$hid_10002$trip
$hid_10002$trip[[1]]
  0 1
0 0 2
1 2 0
...
...

希望你能找到正确的方向!