Comment regrouper avec une condition dans PySpark?

Ceci est un exemple de données:

+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x     | Adult       |        123 |
| 124 | x     | Children    |        231 |
| 123 | x     | Children    |        456 |
| 156 | x     | Adult       |        453 |
| 124 | y     | Adult       |         34 |
| 432 | y     | Adult       |         23 |
| 234 | y     | Children    |         13 |
| 432 | z     | Children    |         22 |
| 234 | z     | Adult       |         44 |
+-----+-------+-------------+------------+

A ensuite voulu voir les données comme:

+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x     |     1 |        1 |          2 |
| 124 | x     |     1 |        1 |          2 |
| 156 | x     |     1 |        0 |          1 |
| 432 | y     |     1 |        1 |          2 |
| 234 | z     |     1 |        1 |          2 |
+-----+-------+-------+----------+------------+

Comment puis-je faire ceci?

-2
Niranjan Grandhi 28 août 2020 à 19:03

3 réponses

Meilleure réponse

Voici la version Spark SQL.

df.createOrReplaceTempView('table')

spark.sql('''
    select zip, state,
           count(if(Agegrouping = 'Adult', 1, null)) as adult,
           count(if(Agegrouping = 'Children', 1, null)) as children,
           count(1) as patient_id
    from table
    group by zip, state;
''').show()

+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123|    x|    1|       1|         2|
|156|    x|    1|       0|         1|
|234|    z|    1|       0|         1|
|432|    z|    0|       1|         1|
|234|    y|    0|       1|         1|
|124|    y|    0|       0|         1|
|124|    x|    0|       1|         1|
|432|    y|    1|       0|         1|
+---+-----+-----+--------+----------+
0
Lamanus 28 août 2020 à 23:43

Utilisez l'agrégation conditionnelle:

select 
    zip,
    state,
    sum(case when agregrouping = 'Adult'    then 1 else 0 end ) as adult
    sum(case when agregrouping = 'Children' then 1 else 0 end ) as children,
    count(*) patient_id
from mytable
group by zip, state
0
GMB 28 août 2020 à 16:05

Vous pouvez utiliser l'agrégation conditionnelle:

select zip, state,
       sum(case when agegrouping = 'Adult' then 1 else 0 end) as adult,
       sum(case when agegrouping = 'Children' then 1 else 0 end) as children,
       count(*) as num_patients
from t
group by zip, state;
0
Gordon Linoff 28 août 2020 à 16:05