您可以通过使用
dplyr
的函数
group_by()
mutate()
带有
ifelse
声明如下:
# Load library
library(dplyr)
# Create example data.frame
x <- read.table(text =
"cbsa_code cbsa_name county_code_long Population
936 10180 Abilene 48059 13544
967 10180 Abilene 48253 20202
993 10180 Abilene 48441 131506
765 10420 Akron 39133 161419
768 10420 Akron 39153 541781")
# Desired result
new_x <- x %>%
group_by(cbsa_code) %>%
mutate(Population = ifelse(Population == max(Population),
sum(Population), 0)) %>%
ungroup()
结果如下:
# A tibble: 5 x 4
cbsa_code cbsa_name county_code_long Population
<int> <fctr> <int> <dbl>
1 10180 Abilene 48059 0
2 10180 Abilene 48253 0
3 10180 Abilene 48441 165252
4 10420 Akron 39133 0
5 10420 Akron 39153 703200
假设有两个县的人口数量相同,都是最大的(我刚刚为阿克伦添加了一个例子):
# Create example data.frame
y <- read.table(text =
"cbsa_code cbsa_name county_code_long Population
936 10180 Abilene 48059 13544
967 10180 Abilene 48253 20202
993 10180 Abilene 48441 131506
765 10420 Akron 39133 161419
768 10420 Akron 39153 541781
769 10420 Akron 39154 541781")
在这种情况下,如果我们应用上面的代码。。。
y %>%
group_by(cbsa_code) %>%
mutate(Population = ifelse(Population == max(Population),
sum(Population), 0)) %>%
ungroup()
…我们有两个“阿克伦”县的参赛者
# A tibble: 6 x 4
cbsa_code cbsa_name county_code_long Population
<int> <fctr> <int> <dbl>
1 10180 Abilene 48059 0
2 10180 Abilene 48253 0
3 10180 Abilene 48441 165252
4 10420 Akron 39133 0
5 10420 Akron 39153 1244981
6 10420 Akron 39154 1244981
如果您想要包含零的完整表格,这里有一个解决方案(请参见
this dplyr vignette
有关该方法的更多信息):
# Rank the Population values according to their descending order, so that the
## one with maximum is ranked 1 (if there are ties, only one of them is chosen).
y %>%
group_by(cbsa_code) %>%
mutate(pop_rank = row_number(desc(Population)),
Population = ifelse(pop_rank == 1,
sum(Population), 0)) %>%
ungroup() %>%
select(-pop_rank)
导致:
# A tibble: 6 x 4
cbsa_code cbsa_name county_code_long Population
<int> <fctr> <int> <dbl>
1 10180 Abilene 48059 0
2 10180 Abilene 48253 0
3 10180 Abilene 48441 165252
4 10420 Akron 39133 0
5 10420 Akron 39153 1244981
6 10420 Akron 39154 0
如果你只想保留人口最多的县,你可以使用
summarise()
像这样(任意取第一个
county_code_long
y %>%
group_by(cbsa_code, cbsa_name) %>%
summarise(Population = sum(Population),
county_code_long = county_code_long[1]) %>%
ungroup()
导致:
# A tibble: 2 x 4
cbsa_code cbsa_name Population county_code_long
<int> <fctr> <int> <int>
1 10180 Abilene 165252 48059
2 10420 Akron 1244981 39133