我有一个数据框
df
(见下面的代码),其中包含近100000行,显示我的程序联系人列表。列表中有一列显示程序
program
和组织
O_ID
联系人与之关联,以及联系人在程序中的角色列。当联系人在多个程序中或在程序中有多个角色时,会为该联系人创建另一行,程序和联系人角色字段值会发生变化。
First Last C_ID OrgName O_ID Program Role
John Smith 10045 Acme 901 Buildings Primary
John Smith 10045 Acme 901 Buildings Communications
John Smith 10045 Acme 901 Homes Primary
Teddy Bush 10046 Acme 901 Buildings Primary
Teddy Bush 10046 Acme 901 Buildings Signatory
Jess Clinton 10050 Consult 904 Homes Signatory
Jess Clinton 10050 Consult 904 Homes Primary
Jess Clinton 10050 Consult 904 Homes Communications
出于演示目的,我尽量减少行数。具体来说,如果一个联系人在同一个组织和同一个程序中,我只希望该联系人出现在一行上(而不是现在的几行),并将联系人角色组合成一个字符串。
我尝试了这段代码,它部分有效:
ddply(df,.(df$C_ID, df$Program, df$O_ID), paste, sep=",")
以下是结果:
df$C_ID df$Program df$O_ID V1 V2
1 10045 Buildings 901 c("John", "John") c("Smith", "Smith")
2 10045 Homes 901 John Smith
3 10046 Buildings 901 c("Teddy", "Teddy") c("Bush", "Bush")
4 10050 Homes 904 c("Jess", "Jess", "Jess") c("Clinton", "Clinton", "Clinton")
V3 V4 V5 V6
1 c(10045, 10045) c("Acme", "Acme") c(901, 901) c("Buildings", "Buildings")
2 10045 Acme 901 Homes
3 c(10046, 10046) c("Acme", "Acme") c(901, 901) c("Buildings", "Buildings")
4 c(10050, 10050, 10050) c("Consult", "Consult", "Consult") c(904, 904, 904) c("Homes", "Homes", "Homes")
V7
1 c("Primary", "Communications")
2 Primary
3 c("Primary", "Signatory")
4 c("Signatory", "Primary", "Communications")
问题是
1) 列被重新排列(注意,在我的实际数据集中有更多的列),列名消失了
2) 值发生更改的唯一列应位于
Role
柱但是,即使合并的值相同,结果也会合并大多数列的值。例如,在“结果”列中
V1
(名字列),返回
c("John", "John")
.应该是“约翰”。唯一应该具有不同值的列是column
V7
c("Primary", "Communications")
df<-structure(list(First = c("John", "John", "John", "Teddy", "Teddy",
"Jess", "Jess", "Jess"), Last = c("Smith", "Smith", "Smith",
"Bush", "Bush", "Clinton", "Clinton", "Clinton"), C_ID = c(10045L,
10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L), OrgName = c("Acme",
"Acme", "Acme", "Acme", "Acme", "Consult", "Consult", "Consult"
), O_ID = c(901L, 901L, 901L, 901L, 901L, 904L, 904L, 904L),
Program = c("Buildings", "Buildings", "Homes", "Buildings",
"Buildings", "Homes", "Homes", "Homes"), Role = c("Primary",
"Communications", "Primary", "Primary", "Signatory", "Signatory",
"Primary", "Communications")), .Names = c("First", "Last",
"C_ID", "OrgName", "O_ID", "Program", "Role"), class = "data.frame", row.names = c(NA,
-8L))