代码之家 › 专栏 › 技术社区 › user3871

使用数据透视表仅返回索引列,忽略数据透视列

pentaho pandas python

user3871 · 技术社区 · 7 年前

我正在努力改变我的 measure 列将其值变为字段。

意思 net_revenue 和 vic 应该成为自己的领域。

在下图中,输入位于左侧,所需输出位于右侧:

我知道 测量 具有重复的密钥(例如。, 净收入 多次出现),但 date_budget ,对于该数据块来说是不同的。 date\u预算 重复,但仅当 测量 已更改,因此我们从未真正为索引列复制行。

问题: 在Pentaho-CPython脚本中,当我查看脚本的输出时,只会返回索引列,而不会返回数据透视列 净收入 和 维多利亚州 。为什么会这样?

脚本:

import pandas as pd

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'], errors='coerce')

# Perform the pivot.
budget = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure'
    )

budget.reset_index(inplace=True)

result_df = budget

示例数据帧:

d = {
    'country': ['us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us'],
    'customer': ['customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2',],
    'measure': ['net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic', 'net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic'],
    'date_budget': ['1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018'],
    'monthly_budget_phasing': ['$55', '$23', '$42', '$29', '$35', '$98', '$87', '$77', '$34', '$90', '$75', '$12']
    }
df = pd.DataFrame(data=d)

与熊猫共事 aggfunc='first' ,但在Pentaho不起作用。Pentaho仍仅输出 country ,则, customer ,则, 测量 。

熊猫从终端输出:

   country   customer date_budget      measure monthly_budget_phasing
0       us  customer1    1/1/2018  net_revenue                    $55
1       us  customer1    2/1/2018  net_revenue                    $23
2       us  customer1    3/1/2018  net_revenue                    $42
3       us  customer1    1/1/2018          vic                    $29
4       us  customer1    2/1/2018          vic                    $35
5       us  customer1    3/1/2018          vic                    $98
6       us  customer2    1/1/2018  net_revenue                    $87
7       us  customer2    2/1/2018  net_revenue                    $77
8       us  customer2    3/1/2018  net_revenue                    $34
9       us  customer2    1/1/2018          vic                    $90
10      us  customer2    2/1/2018          vic                    $75
11      us  customer2    3/1/2018          vic                    $12
measure country   customer date_budget net_revenue  vic
0            us  customer1    1/1/2018         $55  $29
1            us  customer1    2/1/2018         $23  $35
2            us  customer1    3/1/2018         $42  $98
3            us  customer2    1/1/2018         $87  $90
4            us  customer2    2/1/2018         $77  $75
5            us  customer2    3/1/2018         $34  $12

尽管上面的Python工作正常,但Pentaho 8.0 CPython插件仍然存在问题。

第一次融化日期:

然后我取消了措施:

我的净收入和vic字段在哪里?

2 回复 | 直到 7 年前

jezrael 7 年前

看来你需要添加 replace :

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'].replace('\$','', regex=True), errors='coerce')
#alternative
#budget['monthly_budget_phasing'] = budget['monthly_budget_phasing'].replace('\$','', regex=True).astype(int)


df = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure',
    aggfunc='first'

    ).reset_index()

备选方案:

cols = ['country', 'customer', 'date_budget', 'measure']
#if duplicates, first remove it
df = budget.drop_duplicates(cols)
#pivot by unstack
df = df.set_index(cols)['monthly_budget_phasing'].unstack().reset_index()

print (df)
measure country   customer date_budget  net_revenue  vic
0            us  customer1    1/1/2018           55   29
1            us  customer1    2/1/2018           23   35
2            us  customer1    3/1/2018           42   98
3            us  customer2    1/1/2018           87   90
4            us  customer2    2/1/2018           77   75
5            us  customer2    3/1/2018           34   12

Andrei Luksha 7 年前

Kettle需要知道在转换运行之前每一步都会产生哪些列,这就是为什么我不认为在Python中可以做到这一点(select*查询有点例外,但它们在转换运行之前也会秘密获取元数据)。在釜中执行枢轴操作的常用方法是 Row denormalizer 步该步骤要求您为未插入的值指定列名,但如果您无法硬编码这些值,则可以通过 ETL Metadata Injection 步

为了动态传递值,请创建2个转换: 子转换将从父转换获取输入数据,并通过行反规范化器执行透视操作。父转换将读取输入数据,获得唯一的值,这些值将成为列名,然后将这些值传递给ETL元数据注入步骤。注入步骤将用列名填充行非规范化器元数据,并执行转换,为输入数据提供数据。