Understanding Steps

Add Column

This step adds a new column to a data stream. You can set the default value and column name.

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

Adding “isHappy” with value true to users returns:

id name email isHappy
1 John john@gmail.com true
2 Phil phil@gmail.com true
3 Mark mark@gmail.com true

Adding “id” with value true to users returns:

id name email
true John john@gmail.com
true Phil phil@gmail.com
true Mark mark@gmail.com

Adding “” (an empty column name) with value true to users returns:

id name email
1 John john@gmail.com true
2 Phil phil@gmail.com true
3 Mark mark@gmail.com true

Unique

This step drops any non unique columns.

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

Running unique on users returns:

id name email
true John john@gmail.com
true Phil phil@gmail.com
true Mark mark@gmail.com

Binary Join

Binary join performs a natural join on two data streams. You can choose the left and right columns on which the join takes place.

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

usersSatisfaction:

id isHappy
1 true
2 false
3 true
4 false

Combining users and usersSatisfaction with the left_field “id” and right_field “id” returns:

id name email isHappy
1 John john@gmail.com true
2 Phil phil@gmail.com false
3 Mark mark@gmail.com true

Here is another example:

stream1:

a b
1 a1 stream 1
1 a2 stream 1

stream2:

a b
1 c2 stream 1
1 c2 stream 2
1 c2 stream 3

Combining stream1 and stream2 with the left_field “a” and right_field “a” returns:

a b
1 c2 stream 3
1 c2 stream 3

Multi Column Binary Join

Multi Column Binary Join join performs a natural join on two data streams along multiple columns. You need to merge along columns that both sources share.

cases:

state date cases
ny 05/01 1
ca 05/01 2
ca 06/01 3
ny 06/01 4

deaths:

state date deaths
ny 05/01 10
ny 06/01 40
ca 05/01 20
ca 06/01 30

Joining cases and deaths along “state” and “date” returns:

state date cases deaths
ny 05/01 1 10
ca 05/01 2 20
ca 06/01 3 30
ny 06/01 4 40

Merge

Merge appends streams together.

users1:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

users2:

id name email
4 Jane jane@gmail.com
5 Jack jack@gmail.com

users3:

id name email
6 Jo jo@gmail.com

Merging users1, users2 and users3 returns:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com
4 Jane jane@gmail.com
5 Jack jack@gmail.com
6 Jo jo@gmail.com

Here is another example:

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

animals:

number animal title
4 cat caty
33 dog doge
10 zebra ze

Merging users and animals returns:

id name email number animal title
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com
4 cat caty
33 dog doge
10 zebra ze

Renaming column

Renaming a column allows you to change the name of a column. It takes in the current column name and the new one.

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

Renaming “name” to “first_name” in users returns:

id first_name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

Select

Select allows you to select a subset of a data stream. It takes in a list of columns to be included.

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

Selecting id and email in users returns:

id email
1 john@gmail.com
2 phil@gmail.com
3 mark@gmail.com

Drop Columns

Drop Columns allows you to select a subset of columns to drop.

users:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

Dropping name and email in users returns:

id
1
2
3

Columns that don’t exist will be ignored.

Dropping name, email and fake_column in users returns:

id
1
2
3

Pivot Columns

This step is pefect for when you want to turn a selection of columns into rows.

countries:

id name 01/05 01/06 01/07
1 US 5 6 7
2 IS 50 60 70
3 CA 500 600 700

Pivoting countries with

  • Pivot Column Name: date
  • Value Column Name: cases
  • Columns to Keep: id and name

returns:

id name date cases
1 US 01/05 5
1 US 01/06 6
1 US 01/07 7
2 IS 01/05 50
2 IS 01/06 60
2 IS 01/07 70
3 CA 01/05 500
3 CA 01/06 600
3 CA 01/07 700

String Replace

This step performs a string replace on every value in a given column. A RegEx expression can be used.

countries:

id name date cases
1 US 01/05/20 5
1 US 01/06/20 6
1 US 01/07/20 7
2 IS 01/05/20 50
2 IS 01/06/20 60
2 IS 01/07/20 70
3 CA 01/05/20 500
3 CA 01/06/20 600
3 CA 01/07/20 700

String replacing countries with

  • Column Name: date
  • Replace: /
  • Using a String
  • With: -

returns:

id name date cases
1 US 01-05-20 5
1 US 01-06-20 6
1 US 01-07-20 7
2 IS 01-05-20 50
2 IS 01-06-20 60
2 IS 01-07-20 70
3 CA 01-05-20 500
3 CA 01-06-20 600
3 CA 01-07-20 700

The replace string can also be a RegEx expression.

countries:

id name date cases
1 US 01-m-20 5
1 US 01-m-20 6
1 US 01-mm-20 7

String replacing countries with

  • Column Name: date
  • Replace: -(.)-
  • Using a REGEX
  • With: -

returns:

id name date cases
1 US 01-20 5
1 US 01-20 6
1 US 01-mm-20 7

Combine Columns

This step allows you to combine columns to make a new column.

dates:

id day month year
0 11 03 2020
1 12 03 2020
2 13 03 2020

Combining dates with

  • New Column Name: date
  • Column Separator: -
  • Columns to Join (order matters): year,month,day

returns:

id day month year date
0 11 03 2020 2020-03-11
1 12 03 2020 2020-03-12
2 13 03 2020 2020-03-13

Drop Rows

This step allows you to conditionally drop rows.

people:

id name birthday
1 Jon 05/03/2000
2 Bob 03/2/2012
3 Jack 02/12/1983
4 Jacky 10/12/1933
5 Jon 05/02/1930

Row drop on people with

  • Column Name: name
  • Condition: Jon
  • Using Equals and Keep

returns:

id name birthday
1 Jon 05/03/2000
5 Jon 05/02/1930

Row drop on people with

  • Column Name: name
  • Condition: Jon
  • Using a Equals and Reject

returns:

id name birthday
2 Bob 03/2/2012
3 Jack 02/12/1983
4 Jacky 10/12/1933

Row drop on people with

  • Column Name: birthday
  • Condition: /20
  • Using Contains and Keep

returns:

id name birthday
1 Jon 05/03/2000
2 Bob 03/2/2012

Row drop on people with

  • Column Name: birthday
  • Condition: /20
  • Using Contains and Reject

returns:

id name birthday
3 Jack 02/12/1983
4 Jacky 10/12/1933
5 Jon 05/02/1930

Row drop on people with

  • Column Name: birthday
  • Condition: (.)(.)/(.)(.)/(.)(.)
  • Using RegEx and Keep

returns:

id name birthday
1 Jon 05/03/2000
3 Jack 02/12/1983
4 Jacky 10/12/1933
5 Jon 05/02/1930

Row drop on people with

  • Column Name: birthday
  • Condition: (.)(.)/(.)(.)/(.)(.)
  • Using RegEx and Reject

returns:

id name birthday
2 Bob 03/2/2012

Format Values

This step formats every values in a given column.

people:

id name birthday
1 bob 03/2/2012
2 Jack 02/12/1983
3 jacky 10/12/1933

Format Values on people with

  • Column Name: name
  • Format: Uppercase

returns:

id name birthday
1 BOB 03/2/2012
2 JACK 02/12/1983
3 JACKY 10/12/1933

Format Values on people with

  • Column Name: name
  • Format: Lowercase

returns:

id name birthday
1 bob 03/2/2012
2 jack 02/12/1983
3 jacky 10/12/1933

Format Values on people with

  • Column Name: name
  • Format: Capitalize

returns:

id name birthday
1 Bob 03/2/2012
2 Jack 02/12/1983
3 Jacky 10/12/1933

Coding

This step allows you to write custom code on notebooks.

When you create the step, you are given files which is an array of inputs. Each input in files is a list of maps. At the end of the step, you have to return an array of outputs. Currently, the return array needs to include exactly one output.

Let’s say that this is your starting data:

id name email
1 John john@gmail.com
2 Phil phil@gmail.com
3 Mark mark@gmail.com

If you add a coding step to the input and write:

[file] = files

file contains the information in the CSV. It is stored as a list of maps.

 [
        %{"id" => 1, "first_name" => "John", "email" => "john@gmail.com"},
        %{"id" => 2, "first_name" => "Phil", "email" => "phil@gmail.com"},
        %{"id" => 3, "first_name" => "Mark", "email" => "mark@gmail.com"}
      ]

If you run:

[file] = files
new_file = Stream.map(file, fn record -> Map.take(record, ["id"]) end)
[new_file]

It will return:

 [
        %{"id" => 1},
        %{"id" => 2},
        %{"id" => 3}
      ]

Which is effectively:

id
1
2
3

Notice how you have to return an array of files when you finish the step. Hence

[new_file]