|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|W 27-May-18 10:1...|false|
| ...|false| ##this one should not be flagged
|W 27-May-18 10:1...|false|
如果开始时没有w、i、e或u,我想把下面所有的行连接起来,所以后面应该是这样的:
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|I 27-May-18 10:1...|false|
|W 27-May-18 10:1......|false| ##the row after this one was joined to the one before
|W 27-May-18 10:1...|false|
为此,我认为我标记了行,以某种方式将组分配给行,然后使用group by语句。
但是,我已经习惯于标记行,因为正则表达式不起作用:
所以它的正则表达式是:
'^[EUWI]\s'
当我在PySpark中使用它时,它会返回所有错误…
这里的代码:
df_with_x5 = a_7_df.withColumn("x5", a_7_df.line.startswith("[EUWI]\s"))
##I am using start with thats why i can drop the `^`