Instabase Drive


AA3.ipynb

AA3: Fun with Window Functions

  • Assigned: Mar 15
  • Due: Apr 8, 10AM
  • Extra credit: up to 3.75%

In this assignment you'll take your data science skills to the next level, learning how to use window functions to perform advanced SELECT queries.

Please work individually for this assignment.

Copy the AA3 folder into your Instabase repo before starting this assignment.

Preparation

First, make sure that you fully understand how to use SQL concepts previously taught in class like aggregations, subqueries etc., as you will also need to use them.

Then, read this tutorial to understand window functions. Note that they are not PostgreSQL-specific, and some might refer to them as "Analytical Functions" instead.

Reference:

Token Setup

In [1]:
# Your columbia uni that is used in SSOL
#
# IMPORTANT:  make sure this is consistent with the uni/alias used as your @columbia.edu email in SSOL
#
UNI = "cu1111"

# your instabase username (if you go to the instabase homepage, your username should be in the URL)
USER = "custudent"

# your repository name containing this assignment
REPO = "my-repo"

print "Provided UNI is {0}. Make sure it is correct.".format(UNI)
Provided UNI is cu1111. Make sure it is correct.

Database Connection

Run following command to connect to database. You should see "Connected to ..." for successful connection.

In [2]:
# connect to db in public repo
ib.connect_db('ib://ewu/w4111-public/databases/w4111')
Connected to: ib://ewu/w4111-public/databases/w4111

Jupyter Notes: Read these carefully

  • You may create new IPython notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- just make sure that you run the final cell to submit your results
    • you can press shift+enter to execute to code in the cell that your cursor is in.
  • When you see In [*]: to the left of the cell you are executing, this means that the code / query is running. Please wait for the execution to complete
    • If the cell is hanging- i.e. running for too long: you can restart the kernel
    • To restart kernel using the menu bar: "Kernel >> Restart >> Clear all outputs & restart"), then re-execute cells from the top
  • Have fun!

Warm Up

STOP! Make sure you have read the tutorial mentioned above before proceeding.

You can imagine window functions as special functions that, for every row, will compute some value by looking at other rows as well. These other rows are the "window", hence the name.

An expression will have the following format:

some_function( args ... ) OVER ( <WINDOW_DEFINITION> )

The "window definition" is the key on how the function will look at other rows. It has three optional parts:

<WINDOW_DEFINITION> =
    [<PARTITIONING>] [<ORDERING>] [<FRAME_CLAUSE>]
  • <PARTITIONING> defines how to split windows based on some column(s) or even expression(s);
  • <ORDERING> defines how the rows should be ordered inside a window;
  • <FRAME_CLAUSE> defines how rows that come before and after the current one should be included in the window or not. Intuitively, it has no meaning if you are not using an ordering. The default (when ordering is set) is to include every row between the first one in the window and the current one.

 

Now, consider this sample table containing the number of cars that entered a shopping mall, breaking down by entrance and date.

In [3]:
%%sql
select * from fact_parking_entrances
Out[3]:
day entrance number_of_cars
0 2018-02-01 A 222
1 2018-02-02 A 122
2 2018-02-03 A 203
3 2018-02-04 A 474
4 2018-02-05 A 383
5 2018-02-06 A 292
6 2018-02-01 B 129
7 2018-02-02 B 178
8 2018-02-03 B 156
9 2018-02-04 B 381
10 2018-02-05 B 233
11 2018-02-06 B 198

 

Now let's see some window functions in practice. Feel free to play with them!

In [12]:
%%sql
SELECT
    day,
    entrance,
    number_of_cars,
    
    --" Total # of cars over all entrances for the same day. "
    sum(number_of_cars) OVER (PARTITION BY day)
        as daily_sum_all_entrances,
        
    --" Look how the record's number_of_cars attribute can be divided"
    --" by the SUM() from above to calculate the percent of this record's"
    --" to the day's total. "
    number_of_cars*100.0 /
        ( sum(number_of_cars) OVER (PARTITION BY day) )
        as daily_entrance_percentage,
    
    --" How about defining the position in which each entrance stands if  "
    --"  we rank them by the number of cars on a specific day "
    rank() OVER (PARTITION BY day ORDER BY number_of_cars DESC)
        as daily_rank
        
FROM fact_parking_entrances
ORDER BY day, entrance
Out[12]:
day entrance number_of_cars daily_sum_all_entrances daily_entrance_percentage daily_rank
0 2018-02-01 A 222 351 63.2478632478632479 1
1 2018-02-01 B 129 351 36.7521367521367521 2
2 2018-02-02 A 122 300 40.6666666666666667 2
3 2018-02-02 B 178 300 59.3333333333333333 1
4 2018-02-03 A 203 359 56.5459610027855153 1
5 2018-02-03 B 156 359 43.4540389972144847 2
6 2018-02-04 A 474 855 55.4385964912280702 1
7 2018-02-04 B 381 855 44.5614035087719298 2
8 2018-02-05 A 383 616 62.1753246753246753 1
9 2018-02-05 B 233 616 37.8246753246753247 2
10 2018-02-06 A 292 490 59.5918367346938776 1
11 2018-02-06 B 198 490 40.4081632653061224 2
In [13]:
%%sql
SELECT
    day,
    entrance,
    number_of_cars,
    
    --" Let's do a CUMULATIVE sum per entrance over days. "
    --" This SUM only differs from the above query's in the ORDER BY day clause"
    --" Note: Remember that the default frame clause is between the first row"
    --"  and the current record, so we omit it here. "
    sum(number_of_cars) OVER (PARTITION BY entrance ORDER BY day)
        as cum_sum_by_entrance,

    --" Same as the previous SUM expression, but we make the frame clause explicit"
    sum(number_of_cars) OVER (
            PARTITION BY entrance 
            ORDER BY day
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        as cum_sum_by_entrance2,
        
    --" Now, let's say that we want to get the max value per entrance, but only 
    --" between the previous day, the current one, and the next one."
    max(number_of_cars) OVER (
            PARTITION BY entrance
            ORDER BY day
            ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
        ) as max_yday_today_tmrw_by_entrance
        
FROM fact_parking_entrances
ORDER BY entrance, day
Out[13]:
day entrance number_of_cars cum_sum_by_entrance cum_sum_by_entrance2 max_yday_today_tmrw_by_entrance
0 2018-02-01 A 222 222 222 222
1 2018-02-02 A 122 344 344 222
2 2018-02-03 A 203 547 547 474
3 2018-02-04 A 474 1021 1021 474
4 2018-02-05 A 383 1404 1404 474
5 2018-02-06 A 292 1696 1696 383
6 2018-02-01 B 129 129 129 178
7 2018-02-02 B 178 307 307 178
8 2018-02-03 B 156 463 463 381
9 2018-02-04 B 381 844 844 381
10 2018-02-05 B 233 1077 1077 381
11 2018-02-06 B 198 1275 1275 233

"The interview"

You are being interviewed for a data scientist position at a major online store, and to get the job, you need to prove that you can answer some of their questions regarding past sales.

To do so, you have access to a dataset containing every sale between 2014 and 2017, stored in a single table. For simplicity, every transaction contains only one item.

The columns are the following:

  • transaction_id - unique integer id for that transaction (do not assume any ordering!)
  • transaction_datetime - date and time of the transaction, encoded as Postgres' timestamp type
  • customer_id - unique integer id for that customer
  • customer_first_name
  • customer_last_name
  • shipping_state
  • item_id - unique integer id for that item
  • item_description
  • item_price

You may run this query to have some ideas on the schema you are dealing with:

In [14]:
%%sql
select * from fact_sales limit 5;
Out[14]:
transaction_id transaction_datetime customer_id customer_first_name customer_last_name shipping_state item_id item_description item_price
0 1 2014-01-01 00:06:40 100000236 Carol Blake Missouri 8068199275 Slacks 87.7799988
1 2 2014-01-01 00:29:26 300000427 Dominic Welch Michigan 2695940506 Sweater 59.4000015
2 3 2014-01-01 00:53:39 100000197 Blake Ross Delaware 1336619171 Slacks 63.8400002
3 4 2014-01-01 01:41:48 100000116 Alison Johnston Delaware 5187373849 Short Slip 12.8699999
4 5 2014-01-01 02:31:54 100000007 Dorothy Thomson Alabama 8586814459 Gown & Robe Set 65.6399994

Now, please write the SQL query for each of the following questions.

You must use window functions to answer each one.

You may need to check the list of available window functions listed in the Preparation section.

Here's a list of some useful expressions for timestamp/datetime:

extract(year from some_timestamp_column) -> gets year from datetime
extract(month from some_timestamp_column) -> gets month from datetime
cast(some_timestamp_column as date) -> gets full date (year, month and day) from datetime

Q1

For each of the customers listed below by ID, show their individual cumulative sum over transactions made in 2015.

Customer IDs: 300000073, 100000381, 100000160

You must provide a single query, containing the following columns:

  • transaction_id
  • transaction_datetime
  • customer_id
  • customer_last_name
  • item_id
  • item_description
  • item_price
  • cumulative_sum, containing the cumulative sum over time for such customer
In [ ]:
%%sql
-- Complete your query here
In [4]:
# submit query in string format
q1 = {
    "query": """
    REPLACE ME WITH YOUR QUERY
    """
}

Q2

What were the top 5 shipping states per year, based on the total value of transactions, for 2015 and 2016?

You must provide a single query, containing the following columns

  • year, string
  • shipping_state, string
  • total_sales, containing the total value of transactions
  • position, containing the position of such item among the top 5 for such state.

Hints:

  • Use the WITH clause to prep temporary tables
  • First calculate the total value of transactions per state, year pair
  • Order the states in each year by total transaction value and compute their rank()
  • Find the states in years 2015 and 2016 with the appropriate ranks
In [ ]:
%%sql
-- Complete your query here
In [5]:
# submit query in string format
q2 = {
    "query": """
    REPLACE ME WITH YOUR QUERY
    """
}

Q3

Compute the daily total value of sales from 2014-01-01 to 2014-01-20 (inclusive), along with a running 7d average and a month-to-date total

You must provide a single query, containing the following columns:

  • date
  • total_sales, containing the total for such date
  • sales_7d_avg, containing the running 7d average (i.e. between 6 days before and the current date). Note: the average should be computed even if there's some missing data for the past 6 days (e.g. 2014-01-01).
  • month_to_date, containing the cumulative sum from the start of the month to the current date.

Hints:

  • Check the syntax reference on how to define a frame clause for a running avg.
  • Use WITH to compute daily totals first
In [ ]:
%%sql
-- Complete your query here
In [6]:
# submit query in string format
q3 = {
    "query": """
    REPLACE ME WITH YOUR QUERY
    """
}

Q4

Compute the yearly total value of sales for New York, Massachusetts and Michigan, along with % variation in relation to the previous year.

You must provide a single query, containing the following columns:

  • shipping_state
  • year
  • total_sales, containing the total for such state and year
  • variation, containg the % variation in relation to the previous year, i.e. if the yerly total for a state has doubled this field would contain a value of 100.0 . For the first year, this field should be NULL.

Hint:

  • Take a look at the lag() function in the WINDOW documentation for computing variation.
In [ ]:
%%sql
-- Complete your query here
In [7]:
# submit query in string format
q4 = {
    "query": """
    REPLACE ME WITH YOUR QUERY
    """
}

Q5

The store has decided to award customers that spent at least $2500 in a single year. Show which customer(s) will receive an award, along with the datetime of the transaction that made each of them pass the minimum value to get the prize.

You must provide a single query, containing the following columns:

  • year
  • customer_id
  • award_transaction_datetime
In [ ]:
%%sql
-- Complete your query here
In [8]:
# submit query in string format
q5 = {
    "query": """
    REPLACE ME WITH YOUR QUERY
    """
}

Submission

Create your submission file¶

Run the following cell to create a results file for your homework

DO NOT MODIFY THE FOLLOWING CELL!!

In [5]:
import datetime
import json

script_path = '{0}/{1}/fs/Instabase%20Drive'.format(USER, REPO)

with ib.open('results'.format(script_path), "w") as f:
    result = dict(
        q1=q1,
        q2=q2,
        q3=q3,
        q4=q4,
        q5=q5,
        uni=UNI,
        user=USER
    )
    f.write(json.dumps(result))
    print "UNI is {0}".format(UNI)
    print "Result file created at: {0}".format(datetime.datetime.now())
  
    print
    print "Check your results: http://www.instabase.com/{0}/AA3/results".format(script_path)

Finally, Submit your entire AA3 folder including your notebook and results file at the following URL:

https://www.instabase.com/apps/file-submission/cmd/submit/1a755559-190c-4a5d-8fb6-a9f40b5aca0e