Friday, November 13, 2020

Hive, Scala snippets

import org.apache.spark.sql.{Column, SaveMode, SparkSession}

val df1=spark.read.format("orc").option("header", "true").option("delimiter","|").load("s3a:

val df2=spark.read.format("orc").option("header", "true").option("delimiter","|").load("s3a:

val dropColLst ="insert_datetime_utc,extract_datetime"

val dropColumns = dropColLst.split(",")

val leftTableFilteredDF = df1.select(df1.columns.filter(colName => !dropColumns.contains(colName)).map(colName => new Column(colName)): _*)

val rightTableFilteredDF = df2.select(df2.columns.filter(colName => !dropColumns.contains(colName)).map(colName => new Column(colName)): _*)

val df12 = leftTableFilteredDF.except(rightTableFilteredDF)

val df21 = rightTableFilteredDF.except(leftTableFilteredDF)

export SPARK_MAJOR_VERSION=2

SHOW PARTITIONS schema.tablename;

ALTER TABLE schema.tablename DROP IF EXISTS PARTITION(partition_by_month_id=202010);

Wednesday, May 6, 2020

Copied - The Best Medium-Hard Data Analyst SQL Interview Questions

Between the fall of 2015 and the summer of 2019 I interviewed for data analyst and data scientists positions four separate times, getting to onsite interviews at over a dozen companies. After an interview in 2017 went poorly — mostly due to me floundering at the more difficult SQL questions they asked me — I started putting together a study guide of medium and hard SQL questions to better prepare and found it particularly useful during my 2019 interview cycle. Over the past year I have shared that guide with a couple of friends, and with the extra time on my hands due to the coronavirus pandemic, I have polished it up into this doc.

There are plenty of great beginner SQL guides out there. My favorites are Codecademy’s interactive SQL courses and Zi Chong Kao’s Select Star SQL. However, like I told a friend, while the first 70% of SQL is pretty straightforward, the remaining 30% can be pretty tricky. Data analyst and data scientist interview questions at technology companies often pull from that 30%.

Strangely, I have never really found a comprehensive source online for those medium-hard SQL questions, which is why I put together this guide.

Working through this guide should improve your performance on data analyst interviews. It should also make you better at your current and future job positions. Personally, I find some of the SQL patterns found in this doc useful for ETLs powering reporting tools featuring trends over time.

To be clear, data analyst and data scientist interviews consist of more than SQL questions. Other common topics include explaining past projects, A/B testing, metric development and open-ended analytical problems. This Quora answer has Facebook’s product analyst interview guide circa 2017, which discusses this topic in more depth. That said, if improving your SQL skills can make your interviews less stressful than they already are, it could very well be worth your time.

In the future, I may transition this doc to a website like Select Star SQL with an embedded SQL editor so that readers can write SQL statements to questions and get real-time feedback on their code. Another option could be adding these questions as problems on Leetcode. For the time being though I just wanted to publish this doc so that people could find it useful now.

I would love to get your feedback on this doc. Please drop a note if you find this useful, have improvements/corrections, or encounter other good resources for medium/hard difficulty SQL questions.

Assumptions & How to use this guide

Assumptions about SQL proficiency: This guide assumes you have a working knowledge of SQL. You probably use it frequently at work already but want to sharpen your skills on topics like self-joins and window functions.

How to use this guide: Since interviews usually utilize a whiteboard or a virtual (non-compiling) notepad, my recommendation is to get out a pencil and paper and write out your solutions to each part of the problem, and once complete compare your answers to the answer key. Or, complete these with a friend who can act as the interviewer!

Small SQL syntax errors aren’t a big deal during whiteboard/notepad interviews. However, they can distracting to the interviewer, so ideally practice reducing these so your logic shines through in the interview.
The answers I provide may not be the only way to successfully solve the question. Feel free to message with additional solutions and I can add them to this guide!

Tips on solving difficult SQL interview questions

This advice mirrors typical code interview advice ...

Listen carefully to problem description, repeat back the crux of the problem to the interviewer
Spell out an edge case to demonstrate you actually understand problem (i.e. a row that wouldn’t be included in the output of the SQL you are about to sketch out)
(If the problem involves a self-join) For your own benefit sketch out what the self-join will look like — this will typically be at least three columns: a column of interest from the main table, the column to join from the main table, and the column to join from the secondary table

Or, as you get more used to self-join problems, you can explain this step verbally

Start writing SQL — err towards writing SQL versus trying to perfectly understand the problem. Verbalize your assumptions as you go so your interviewer can correct you if you go astray.

Acknowledgments and Additional Resources

Some of the problems listed here are adapted from old Periscope blog posts (mostly written around 2014 by Sean Cook, although his authorship seems to have been removed from the posts following SiSense's merger with Periscope) or discussions from Stack Overflow; I've noted them at the start of questions as appropriate.

Select Star SQL has good challenge questions that are complementary to the questions in this doc.

Please note that these questions are not literal copies of SQL interview questions I have encountered while interviewing nor were they interview questions used at a company I have worked at or work at.

Self-Join Practice Problems

#1: MoM Percent Change

Context: Oftentimes it's useful to know how much a key metric, such as monthly active users, changes between months. Say we have a table logins in the form:

| user_id | date       |
|---------|------------|
| 1       | 2018-07-01 |
| 234     | 2018-07-02 |
| 3       | 2018-07-02 |
| 1       | 2018-07-02 |
| ...     | ...        |
| 234     | 2018-10-04 |

Task: Find the month-over-month percentage change for monthly active users (MAU).

Solution:

(This solution, like other solution code blocks you will see in this doc, contains comments about SQL syntax that may differ between flavors of SQL or other comments about the solutions as listed)

WITH mau AS 
(
  SELECT 
   /* 
    * Typically, interviewers allow you to write psuedocode for date functions 
    * i.e. will NOT be checking if you have memorized date functions. 
    * Just explain what your function does as you whiteboard 
    *
    * DATE_TRUNC() is available in Postgres, but other SQL date functions or 
    * combinations of date functions can give you a identical results   
    * See https://www.postgresql.org/docs/9.0/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
    */ 
    DATE_TRUNC('month', date) month_timestamp,
    COUNT(DISTINCT user_id) mau
  FROM 
    logins 
  GROUP BY 
    DATE_TRUNC('month', date)
  )
 
 SELECT 
    /*
    * You don't literally need to include the previous month in this SELECT statement. 
    * 
    * However, as mentioned in the "Tips" section of this guide, it can be helpful 
    * to at least sketch out self-joins to avoid getting confused which table 
    * represents the prior month vs current month, etc. 
    */ 
    a.month_timestamp previous_month, 
    a.mau previous_mau, 
    b.month_timestamp current_month, 
    b.mau current_mau, 
    ROUND(100.0*(b.mau - a.mau)/a.mau,2) AS percent_change 
 FROM
    mau a 
 JOIN 
    /*
    * Could also have done `ON b.month_timestamp = a.month_timestamp + interval '1 month'` 
    */
    mau b ON a.month_timestamp = b.month_timestamp - interval '1 month'

#2: Tree Structure Labeling

Context: Say you have a table tree with a column of nodes and a column corresponding parent nodes

node   parent
1       2
2       5
3       5
4       3
5       NULL

Task: Write SQL such that we label each node as a “leaf”, “inner” or “Root” node, such that for the nodes above we get:

node    label  
1       Leaf
2       Inner
3       Inner
4       Leaf
5       Root

(Side note: this link has more details on Tree data structure terminology. Not needed to solve the problem though!)

Solution:

Acknowledgement: this more generalizable solution was contributed by Fabian Hofmann on 5/2/20. Thank, FH!

WITH join_table AS
(
    SELECT 
        cur.node, 
        cur.parent, 
        COUNT(next.node) AS num_children
    FROM 
        tree cur
    LEFT JOIN 
        tree next ON (next.parent = cur.node)
    GROUP BY 
        cur.node, 
        cur.parent
)

SELECT
    node,
    CASE
        WHEN parent IS NULL THEN "Root"
        WHEN num_children = 0 THEN "Leaf"
        ELSE "Inner"
    END AS label
FROM 
    join_table

An alternate solution, without explicit joins:

Acknowledgement: William Chargin on 5/2/20 noted that WHERE parent IS NOT NULL is needed to make this solution return Leaf instead of NULL. Thanks, WC!

SELECT 
    node,
    CASE 
        WHEN parent IS NULL THEN 'Root'
        WHEN node NOT IN 
            (SELECT parent FROM tree WHERE parent IS NOT NULL) THEN 'Leaf'
        WHEN node IN (SELECT parent FROM tree) AND parent IS NOT NULL THEN 'Inner'
    END AS label 
 from 
    tree

#3: Retained Users Per Month (multi-part)

Acknowledgement: this problem is adapted from SiSense’s “Using Self Joins to Calculate Your Retention, Churn, and Reactivation Metrics” blog post

Part 1:

Context: Say we have login data in the table logins:

| user_id | date       |
|---------|------------|
| 1       | 2018-07-01 |
| 234     | 2018-07-02 |
| 3       | 2018-07-02 |
| 1       | 2018-07-02 |
| ...     | ...        |
| 234     | 2018-10-04 |

Task: Write a query that gets the number of retained users per month. In this case, retention for a given month is defined as the number of users who logged in that month who also logged in the immediately previous month.

Solution:

SELECT 
    DATE_TRUNC('month', a.date) month_timestamp, 
    COUNT(DISTINCT a.user_id) retained_users 
 FROM 
    logins a 
 JOIN 
    logins b ON a.user_id = b.user_id 
        AND DATE_TRUNC('month', a.date) = DATE_TRUNC('month', b.date) + 
                                             interval '1 month'
 GROUP BY 
    date_trunc('month', a.date)

Acknowledgement: Tom Moertel pointed out de-duping user-login pairs before the self-join would make the solution more efficient and contributed the alternate solution below. Thanks, TM!

Alternate solution:

WITH DistinctMonthlyUsers AS (
  /*
  * For each month, compute the *set* of users having logins.
  */
    SELECT DISTINCT
      DATE_TRUNC('MONTH', a.date) AS month_timestamp,
      user_id
    FROM logins
  )

SELECT
  CurrentMonth.month_timestamp month_timestamp,
  COUNT(PriorMonth.user_id) AS retained_user_count
FROM 
    DistinctMonthlyUsers AS CurrentMonth
LEFT JOIN 
    DistinctMonthlyUsers AS PriorMonth
  ON
    CurrentMonth.month_timestamp = PriorMonth.month_timestamp + INTERVAL '1 MONTH'
    AND 
    CurrentMonth.user_id = PriorMonth.user_id

Part 2:

Task: Now we’ll take retention and turn it on its head: Write a query to find many users last month did not come back this month. i.e. the number of churned users.

Solution:

SELECT 
    DATE_TRUNC('month', a.date) month_timestamp, 
    COUNT(DISTINCT b.user_id) churned_users 
FROM 
    logins a 
FULL OUTER JOIN 
    logins b ON a.user_id = b.user_id 
        AND DATE_TRUNC('month', a.date) = DATE_TRUNC('month', b.date) + 
                                         interval '1 month'
WHERE 
    a.user_id IS NULL 
GROUP BY 
    DATE_TRUNC('month', a.date)

Note that there are solutions to this problem that can use LEFT or RIGHT joins.

Part 3:

Note: this question is probably more complex than the kind you would encounter in an interview. Consider it a challenge problem, or feel free to skip it!

Context: Good work on the previous two problems! Data engineering has decided to give you a helping hand by creating a table of churned users per month, user_churns. If a user is active last month but then not active this month, then that user gets an entry for this month. user_churns has the form:

| user_id | month_date |
|---------|------------|
| 1       | 2018-05-01 |
| 234     | 2018-05-01 |
| 3       | 2018-05-01 |
| 12      | 2018-05-01 |
| ...     | ...        |
| 234     | 2018-10-01 |

Task: You now want to do a cohort analysis of active users this month who have been reactivated users in the past. Create a table that contains these users. You may use the tables user_churns as well as logins to create this cohort. In Postgres, the current timestamp is available through current_timestamp.

Solution:

WITH user_login_data AS 
(
    SELECT 
        DATE_TRUNC('month', a.date) month_timestamp,
        a.user_id,
        /* 
        * At least in the flavors of SQL I have used, you don't need to 
        * include the columns used in HAVING in the SELECT statement.
        * I have written them out for clarity here.  
        */ 
        MAX(b.month_date) as most_recent_churn, 
        MAX(DATE_TRUNC('month', c.date)) as most_recent_active 
     FROM 
        logins a
     JOIN 
        user_churns b 
            ON a.user_id = b.user_id AND DATE_TRUNC('month', a.date) > b.month_date 
     JOIN
        logins c 
            ON a.user_id = c.user_id 
            AND 
            DATE_TRUNC('month', a.date) > DATE_TRUNC('month', c.date)
     WHERE 
        DATE_TRUNC('month', a.date) = DATE_TRUNC('month', current_timestamp)
     GROUP BY 
        DATE_TRUNC('month', a.date),
        a.user_id
     HAVING 
        most_recent_churn > most_recent_active

#4: Cumulative Sums

Acknowledgement: This problem was inspired by Sisense’s “Cash Flow modeling in SQL” blog post

Context: Say we have a table transactions in the form:

| date       | cash_flow |
|------------|-----------|
| 2018-01-01 | -1000     |
| 2018-01-02 | -100      |
| 2018-01-03 | 50        |
| ...        | ...       |

Where cash_flow is the revenues minus costs for each day.

Task: Write a query to get cumulative cash flow for each day such that we end up with a table in the form below:

| date       | cumulative_cf |
|------------|---------------|
| 2018-01-01 | -1000         |
| 2018-01-02 | -1100         |
| 2018-01-03 | -1050         |
| ...        | ...           |

Solution:

SELECT 
    a.date date, 
    SUM(b.cash_flow) as cumulative_cf 
FROM
    transactions a
JOIN b 
    transactions b ON a.date >= b.date 
GROUP BY 
    a.date 
ORDER BY 
    date ASC

Alternate solution using a window function (more efficient!):

SELECT 
    date, 
    SUM(cash_flow) OVER (ORDER BY date ASC) as cumulative_cf 
FROM
    transactions 
ORDER BY 
    date ASC

#5: Rolling Averages

Acknowledgement: This problem is adapted from Sisense’s “Rolling Averages in MySQL and SQL Server” blog post

Note: there are different ways to compute rolling/moving averages. Here we'll use a preceding average which means that the metric for the 7th day of the month would be the average of the preceding 6 days and that day itself.

Context: Say we have table signups in the form:

| date       | sign_ups |
|------------|----------|
| 2018-01-01 | 10       |
| 2018-01-02 | 20       |
| 2018-01-03 | 50       |
| ...        | ...      |
| 2018-10-01 | 35       |

Task: Write a query to get 7-day rolling (preceding) average of daily sign ups.

Solution:

SELECT 
  a.date, 
  AVG(b.sign_ups) average_sign_ups 
FROM 
  signups a 
JOIN 
  signups b ON a.date <= b.date + interval '6 days' AND a.date >= b.date
GROUP BY 
  a.date

#6: Multiple Join Conditions

Acknowledgement: This problem was inspired by Sisense’s “Analyzing Your Email with SQL” blog post

Context: Say we have a table emails that includes emails sent to and from zach@g.com:

| id | subject  | from         | to           | timestamp           |
|----|----------|--------------|--------------|---------------------|
| 1  | Yosemite | zach@g.com   | thomas@g.com | 2018-01-02 12:45:03 |
| 2  | Big Sur  | sarah@g.com  | thomas@g.com | 2018-01-02 16:30:01 |
| 3  | Yosemite | thomas@g.com | zach@g.com   | 2018-01-02 16:35:04 |
| 4  | Running  | jill@g.com   | zach@g.com   | 2018-01-03 08:12:45 |
| 5  | Yosemite | zach@g.com   | thomas@g.com | 2018-01-03 14:02:01 |
| 6  | Yosemite | thomas@g.com | zach@g.com   | 2018-01-03 15:01:05 |
| .. | ..       | ..           | ..           | ..                  |

Task: Write a query to get the response time per email (id) sent to zach@g.com . Do not include ids that did not receive a response from zach@g.com. Assume each email thread has a unique subject. Keep in mind a thread may have multiple responses back-and-forth between zach@g.com and another email address.

Solution:

SELECT 
    a.id, 
    MIN(b.timestamp) - a.timestamp as time_to_respond 
FROM 
    emails a 
JOIN
    emails b 
        ON 
            b.subject = a.subject 
        AND 
            a.to = b.from
        AND 
            a.from = b.to 
        AND 
            a.timestamp < b.timestamp 
 WHERE 
    a.to = 'zach@g.com' 
 GROUP BY 
    a.id

Window Function Practice Problems

#1: Get the ID with the highest value

Context: Say we have a table salaries with data on employee salary and department in the following format:

  depname  | empno | salary |     
-----------+-------+--------+
 develop   |    11 |   5200 | 
 develop   |     7 |   4200 | 
 develop   |     9 |   4500 | 
 develop   |     8 |   6000 | 
 develop   |    10 |   5200 | 
 personnel |     5 |   3500 | 
 personnel |     2 |   3900 | 
 sales     |     3 |   4800 | 
 sales     |     1 |   5000 | 
 sales     |     4 |   4800 |

Task: Write a query to get the empno with the highest salary. Make sure your solution can handle ties!

Solution:

WITH max_salary AS (
    SELECT 
        MAX(salary) max_salary
    FROM 
        salaries
    )
SELECT 
    s.empno
FROM 
    salaries s
JOIN 
    max_salary ms ON s.salary = ms.max_salary

Alternate solution using RANK():

WITH sal_rank AS 
  (SELECT 
    empno, 
    RANK() OVER(ORDER BY salary DESC) rnk
  FROM 
    salaries)
SELECT 
  empno
FROM
  sal_rank
WHERE 
  rnk = 1;

#2: Average and rank with a window function (multi-part)

Part 1:

Context: Say we have a table salaries in the format:

  depname  | empno | salary |     
-----------+-------+--------+
 develop   |    11 |   5200 | 
 develop   |     7 |   4200 | 
 develop   |     9 |   4500 | 
 develop   |     8 |   6000 | 
 develop   |    10 |   5200 | 
 personnel |     5 |   3500 | 
 personnel |     2 |   3900 | 
 sales     |     3 |   4800 | 
 sales     |     1 |   5000 | 
 sales     |     4 |   4800 |

Task: Write a query that returns the same table, but with a new column that has average salary per depname. We would expect a table in the form:

  depname  | empno | salary | avg_salary |     
-----------+-------+--------+------------+
 develop   |    11 |   5200 |       5020 |
 develop   |     7 |   4200 |       5020 | 
 develop   |     9 |   4500 |       5020 |
 develop   |     8 |   6000 |       5020 | 
 develop   |    10 |   5200 |       5020 | 
 personnel |     5 |   3500 |       3700 |
 personnel |     2 |   3900 |       3700 |
 sales     |     3 |   4800 |       4867 | 
 sales     |     1 |   5000 |       4867 | 
 sales     |     4 |   4800 |       4867 |

Solution:

SELECT 
    *, 
    /*
    * AVG() is a Postgres command, but other SQL flavors like BigQuery use 
    * AVERAGE()
    */ 
    ROUND(AVG(salary),0) OVER (PARTITION BY depname) avg_salary
FROM
    salaries

Part 2:

Task: Write a query that adds a column with the rank of each employee based on their salary within their department, where the employee with the highest salary gets the rank of 1. We would expect a table in the form:

  depname  | empno | salary | salary_rank |     
-----------+-------+--------+-------------+
 develop   |    11 |   5200 |           2 |
 develop   |     7 |   4200 |           5 | 
 develop   |     9 |   4500 |           4 |
 develop   |     8 |   6000 |           1 | 
 develop   |    10 |   5200 |           2 | 
 personnel |     5 |   3500 |           2 |
 personnel |     2 |   3900 |           1 |
 sales     |     3 |   4800 |           2 | 
 sales     |     1 |   5000 |           1 | 
 sales     |     4 |   4800 |           2 |

Solution:

SELECT 
    *, 
    RANK() OVER(PARTITION BY depname ORDER BY salary DESC) salary_rank
 FROM  
    salaries

Other Medium/Hard SQL Practice Problems

#1: Histograms

Context: Say we have a table sessions where each row is a video streaming session with length in seconds:

| session_id | length_seconds |
|------------|----------------|
| 1          | 23             |
| 2          | 453            |
| 3          | 27             |
| ..         | ..             |

Task: Write a query to count the number of sessions that fall into bands of size 5, i.e. for the above snippet, produce something akin to:

| bucket  | count |
|---------|-------|
| 20-25   | 2     |
| 450-455 | 1     |

Get complete credit for the proper string labels (“5-10”, etc.) but near complete credit for something that is communicable as the bin.

Solution:

WITH bin_label AS 
(SELECT 
    session_id, 
    FLOOR(length_seconds/5) as bin_label 
 FROM
    sessions 
 )
 SELECT 
    CONCATENTATE(STR(bin_label*5), '-', STR(bin_label*5+5)) bucket, 
    COUNT(DISTINCT session_id) count 
 GROUP BY 
    bin_label
 ORDER BY 
    bin_label ASC

#2: CROSS JOIN (multi-part)

Part 1:

Context: Say we have a table state_streams where each row is a state and the total number of hours of streaming from a video hosting service:

| state | total_streams |
|-------|---------------|
| NC    | 34569         |
| SC    | 33999         |
| CA    | 98324         |
| MA    | 19345         |
| ..    | ..            |

(In reality these kinds of aggregate tables would normally have a date column, but we’ll exclude that component in this problem)

Task: Write a query to get the pairs of states with total streaming amounts within 1000 of each other. For the snippet above, we would want to see something like:

| state_a | state_b |
|---------|---------|
| NC      | SC      |
| SC      | NC      |

Solution:

SELECT
    a.state as state_a, 
    b.state as state_b 
 FROM   
    state_streams a
 CROSS JOIN 
    state_streams b 
 WHERE 
    ABS(a.total_streams - b.total_streams) < 1000
    AND 
    a.state <> b.state

FYI, CROSS JOIN s can also be written without explicitly specifying a join:

SELECT
    a.state as state_a, 
    b.state as state_b 
 FROM   
    state_streams a, state_streams b 
 WHERE 
    ABS(a.total_streams - b.total_streams) < 1000
    AND 
    a.state <> b.state

Part 2:

Note: This question is considered more of a bonus problem than an actual SQL pattern. Feel free to skip it!

Task: How could you modify the SQL from the solution to Part 1 of this question so that duplicates are removed? For example, if we used the sample table from Part 1, the pair NC and SC should only appear in one row instead of two.

Solution:

SELECT
    a.state as state_a, 
    b.state as state_b 
 FROM   
    state_streams a, state_streams b 
 WHERE 
    ABS(a.total_streams - b.total_streams) < 1000
    AND 
    a.state > b.state

#3: Advancing Counting

Acknowledgement: This question is adapted from this Stack Overflow question by me (zthomas.nc)

Note: this question is probably more complex than the kind you would encounter in an interview. Consider it a challenge problem, or feel free to skip it!

Context: Say I have a table table in the following form, where a user can be mapped to multiple values of class:

| user | class |
|------|-------|
| 1    | a     |
| 1    | b     |
| 1    | b     |
| 2    | b     |
| 3    | a     |

Task: Assume there are only two possible values for class. Write a query to count the number of users in each class such that any user who has label a and b gets sorted into b, any user with just a gets sorted into a and any user with just b gets into b.

For table that would result in the following table:

| class | count |
|-------|-------|
| a     | 1     |
 | b     | 2     |

Solution:

WITH usr_b_sum AS 
(
    SELECT 
        user, 
        SUM(CASE WHEN class = 'b' THEN 1 ELSE 0 END) num_b
    FROM 
        table
    GROUP BY 
        user
), 

usr_class_label AS 
(
    SELECT 
        user, 
        CASE WHEN num_b > 0 THEN 'b' ELSE 'a' END class 
    FROM 
        usr_b_sum
)

SELECT 
    class, 
    COUNT(DISTINCT user) count 
FROM
    usr_class_label
GROUP BY 
    class 
ORDER BY 
    class ASC

Alternate solution: Using SELECTs in the SELECT statement and UNION:

SELECT 
    "a" class,
    COUNT(DISTINCT user_id) - 
        (SELECT COUNT(DISTINCT user_id) FROM table WHERE class = 'b') count 
UNION
SELECT 
    "b" class,
    (SELECT COUNT(DISTINCT user_id) FROM table WHERE class = 'b') count

Tuesday, May 5, 2020

self in python Class/Instance variables/methods

Once you start using Python, there is no escaping from this word “self”. It is seen in method definitions and in variable initialization. But getting the idea behind it seems somewhat troublesome. Hopefully, at the end of this, you will get an intuitive idea of what “self” is and where you should use it.

But before talking about the self keyword (which is actually not a python keyword or any special literal), first, let’s recall what are class variables and instance variables. Class variables are variables that are being shared with all instances (objects) which were created using that particular class. So when accessed a class variable from any instance, the value will be the same. Instance variables, on the other hand, are variables which all instances keep for themselves (i.e a particular object owns its instance variables). So typically values of instance variables differ from instance to instance.

Class variables in python are defined just after the class definition and outside of any methods:

class SomeClass:
    variable_1 = “ This is a class variable”
    variable_2 = 100   #this is also a class variable

Unlike class variables, instance variables should be defined within methods:

class SomeClass:
    variable_1 = “ This is a class variable”
    variable_2 = 100    #this is also a class variable.

    def __init__(self, param1, param2):
        self.instance_var1 = param1
        #instance_var1 is a instance variable
 self.instance_var2 = param2   
        #instance_var2 is a instance variable

Let’s instantiate above class and do some introspections about those instances and above class:

>>> obj1 = SomeClass("some thing", 18) 
#creating instance of SomeClass named obj1
>>> obj2 = SomeClass(28, 6) 
#creating a instance of SomeClass named obj2

>>> obj1.variable_1
'a class variable'

>>> obj2.variable_1
'a class variable'

So as seen above, both obj1 and obj2 gives the same value when variable_1 is accessed, which is the normal behavior that we should expect from a class variable. Let’s find about instance variables:

>>> obj1.instance_var1
'some thing'
>>> obj2.instance_var1
28

So the expected behavior of instance variables can be seen above without any error. That is, both obj1 and obj2 have two different instance variables for themselves.

Instance and class methods in python

Just as there are instance and class variables, there are instance and class methods. These are intended to set or get status of the relevant class or instance. So the purpose of the class methods is to set or get the details (status) of the class. Purpose of instance methods is to set or get details about instances (objects). Being said that, let's see how to create instance and class methods in python.

When defining an instance method, the first parameter of the method should always be self. Why it should always be “self” is discussed in the next section (which is the main aim of this post). However one can name it anything other than self, but what that parameter represents will always be the same. And it’s a good idea for sticking with self as it’s the convention.

class SomeClass:    
    def create_arr(self): # An instance method
        self.arr = []
    
    def insert_to_arr(self, value):  #An instance method
        self.arr.append(value)

We can instantiate above class as obj3, and do some investigations as follows:

>>> obj3 = SomeClass()
>>> obj3.create_arr()
>>> obj3.insert_to_arr(5)
>>> obj3.arr
[5]

So as you can notice from above, although when defining an instance method the first parameter is self, when calling that method, we do not pass anything for self as arguments. How come this does not give errors? What’s going on behind the scene? These are explained in the next section.

Ok, with instance methods explained, all we have left is class methods ( — so I say). Just like instance methods, in class methods also there is a special parameter that should be placed as the first parameter. It is the cls parameter, which represents the class:

class SomeClass:
    def create_arr(self): # An instance method
        self.arr = []
    
    def insert_to_arr(self, value):  #An instance method
        self.arr.append(value)
        
    @classmethod
    def class_method(cls):
        print("the class method was called")

Without even instantiating an object, we can access class methods as follows:

SomeClass.class_method()

So all we have to call the class method with the name of the class. And in here also just like instance methods, although there is a parameter defined as cls, we do not pass any argument when calling the method — explained next.

Note: Python has another type of methods known as static methods. These are normal methods which do not have any special parameter as with instance methods or class methods. Therefore these static methods can neither modify object state nor class state.

Now with all things are being reminded (instance/class variables and methods), let’s talk about the use of self in python ( — finally).

self — intuition

Some of you may have got it by now, or some may have got it partially; anyway, the self in python represents or points the instance which it was called. Let’s clarify this with an example:

class SomeClass:
    def __init__(self):
        self.arr = [] 
        #All SomeClass objects will have an array arr by default
    
    def insert_to_arr(self, value):
        self.arr.append(value)

So now let’s create two objects of SomeClass and append some values for their arrays:

obj1 = SomeClass()
obj2 = SomeClass()
obj1.insert_to_arr(6)

Important: Unlike some other languages, when a new object is created in python, it does not create a new set of instance methods to itself. So instance methods lie within the class object without being created with each object initialization — nice way to save up memory. Recall that python is a fully object oriented language and so a class is also an object. So it lives within the memory.

Being said that, let’s look at the above example. There we have created obj1 and are calling the instance method insert_to_arr() of SomeClass while passing an argument 6. But now how does that method know “which object is calling me and whose instance attributes should be updated”. Here, to whose arr array should I append the value 6? Ok, now I think you got it. That’s the job of self. Behind the scene, in every instance method call, python sends the instance also with that method call. So what actually happens is, python convert the above calling of the instance method to something like below:

SomeClass.inseart_to_arr(obj1, 6)

Monday, May 4, 2020

Scala CheatSheet

variables
`var x = 5` GOOD `x = 6`	Variable.
`val x = 5` BAD `x = 6`	Constant.
`var x: Double = 5`	Explicit type.
functions
GOOD `def f(x: Int) = { x * x }` BAD `def f(x: Int) { x * x }`	Define function. Hidden error: without `=` it’s a procedure returning `Unit`; causes havoc. Deprecated in Scala 2.13.
GOOD `def f(x: Any) = println(x)` BAD `def f(x) = println(x)`	Define function. Syntax error: need types for every arg.
`type R = Double`	Type alias.
`def f(x: R)` vs. `def f(x: => R)`	Call-by-value. Call-by-name (lazy parameters).
`(x: R) => x * x`	Anonymous function.
`(1 to 5).map(_ * 2)` vs. `(1 to 5).reduceLeft(_ + _)`	Anonymous function: underscore is positionally matched arg.
`(1 to 5).map(x => x * x)`	Anonymous function: to use an arg twice, have to name it.
`(1 to 5).map { x => val y = x * 2 println(y) y }`	Anonymous function: block style returns last expression.
`(1 to 5) filter { _ % 2 == 0 } map { _ * 2 }`	Anonymous functions: pipeline style (or parens too).
`def compose(g: R => R, h: R => R) = (x: R) => g(h(x))` `val f = compose(_ * 2, _ - 1)`	Anonymous functions: to pass in multiple blocks, need outer parens.
`val zscore = (mean: R, sd: R) => (x: R) => (x - mean) / sd`	Currying, obvious syntax.
`def zscore(mean: R, sd: R) = (x: R) => (x - mean) / sd`	Currying, obvious syntax.
`def zscore(mean: R, sd: R)(x: R) = (x - mean) / sd`	Currying, sugar syntax. But then:
`val normer = zscore(7, 0.4) _`	Need trailing underscore to get the partial, only for the sugar version.
`def mapmake[T](g: T => T)(seq: List[T]) = seq.map(g)`	Generic type.
`5.+(3); 5 + 3` `(1 to 5) map (_ * 2)`	Infix sugar.
`def sum(args: Int*) = args.reduceLeft(_+_)`	Varargs.
packages
`import scala.collection._`	Wildcard import.
`import scala.collection.Vector` `import scala.collection.{Vector, Sequence}`	Selective import.
`import scala.collection.{Vector => Vec28}`	Renaming import.
`import java.util.{Date => _, _}`	Import all from `java.util` except `Date`.
At start of file: `package pkg` Packaging by scope: `package pkg { ... }` Package singleton: `package object pkg { ... }`	Declare a package.
data structures
`(1, 2, 3)`	Tuple literal (`Tuple3`).
`var (x, y, z) = (1, 2, 3)`	Destructuring bind: tuple unpacking via pattern matching.
BAD `var x, y, z = (1, 2, 3)`	Hidden error: each assigned to the entire tuple.
`var xs = List(1, 2, 3)`	List (immutable).
`xs(2)`	Paren indexing (slides).
`1 :: List(2, 3)`	Cons.
`1 to 5` same as `1 until 6` `1 to 10 by 2`	Range sugar.
`()`	Empty parens is singleton value of the Unit type. Equivalent to `void` in C and Java.
control constructs
`if (check) happy else sad`	Conditional.
`if (check) happy` same as `if (check) happy else ()`	Conditional sugar.
`while (x < 5) { println(x) x += 1 }`	While loop.
`do { println(x) x += 1 } while (x < 5)`	Do-while loop.
`import scala.util.control.Breaks._ breakable { for (x <- xs) { if (Math.random < 0.1) break } }`	Break (slides).
`for (x <- xs if x % 2 == 0) yield x * 10` same as `xs.filter(_ % 2 == 0).map(_ * 10)`	For-comprehension: filter/map.
`for ((x, y) <- xs zip ys) yield x * y` same as `(xs zip ys) map { case (x, y) => x * y }`	For-comprehension: destructuring bind.
`for (x <- xs; y <- ys) yield x * y` same as `xs flatMap { x => ys map { y => x * y } }`	For-comprehension: cross product.
`for (x <- xs; y <- ys) { val div = x / y.toFloat println("%d/%d = %.1f".format(x, y, div)) }`	For-comprehension: imperative-ish. `sprintf` style.
`for (i <- 1 to 5) { println(i) }`	For-comprehension: iterate including the upper bound.
`for (i <- 1 until 5) { println(i) }`	For-comprehension: iterate omitting the upper bound.
pattern matching
GOOD `(xs zip ys) map { case (x, y) => x * y }` BAD `(xs zip ys) map { (x, y) => x * y }`	Use case in function args for pattern matching.
BAD `val v42 = 42 3 match { case v42 => println("42") case _ => println("Not 42") }`	`v42` is interpreted as a name matching any Int value, and “42” is printed.
GOOD val v42 = 42 3 match { case `v42` => println("42") case _ => println("Not 42") }	`v42` with backticks is interpreted as the existing val `v42`, and “Not 42” is printed.
GOOD `val UppercaseVal = 42 3 match { case UppercaseVal => println("42") case _ => println("Not 42") }`	`UppercaseVal` is treated as an existing val, rather than a new pattern variable, because it starts with an uppercase letter. Thus, the value contained within `UppercaseVal` is checked against `3`, and “Not 42” is printed.
object orientation
`class C(x: R)`	Constructor params - `x` is only available in class body.
`class C(val x: R)` `var c = new C(4)` `c.x`	Constructor params - automatic public member defined.
`class C(var x: R) { assert(x > 0, "positive please") var y = x val readonly = 5 private var secret = 1 def this = this(42) }`	Constructor is class body. Declare a public member. Declare a gettable but not settable member. Declare a private member. Alternative constructor.
`new { ... }`	Anonymous class.
`abstract class D { ... }`	Define an abstract class (non-createable).
`class C extends D { ... }`	Define an inherited class.
`class D(var x: R)` `class C(x: R) extends D(x)`	Inheritance and constructor params (wishlist: automatically pass-up params by default).
`object O extends D { ... }`	Define a singleton (module-like).
`trait T { ... }` `class C extends T { ... }` `class C extends D with T { ... }`	Traits. Interfaces-with-implementation. No constructor params. mixin-able.
`trait T1; trait T2` `class C extends T1 with T2` `class C extends D with T1 with T2`	Multiple traits.
`class C extends D { override def f = ...}`	Must declare method overrides.
`new java.io.File("f")`	Create object.
BAD `new List[Int]` GOOD `List(1, 2, 3)`	Type error: abstract type. Instead, convention: callable factory shadowing the type.
`classOf[String]`	Class literal.
`x.isInstanceOf[String]`	Type check (runtime).
`x.asInstanceOf[String]`	Type cast (runtime).
`x: String`	Ascription (compile time).
options
`Some(42)`	Construct a non empty optional value.
`None`	The singleton empty optional value.
`Option(null) == None Option(obj.unsafeMethod)` but `Some(null) != None`	Null-safe optional value factory.
`val optStr: Option[String] = None` same as `val optStr = Option.empty[String]`	Explicit type for empty optional value. Factory for empty optional value.
`val name: Option[String] = request.getParameter("name") val upper = name.map { _.trim } filter { _.length != 0 } map { _.toUpperCase } println(upper.getOrElse(""))`	Pipeline style.
`val upper = for { name <- request.getParameter("name") trimmed <- Some(name.trim) if trimmed.length != 0 upper <- Some(trimmed.toUpperCase) } yield upper println(upper.getOrElse(""))`	For-comprehension syntax.
`option.map(f(_))` same as `option match { case Some(x) => Some(f(x)) case None => None }`	Apply a function on the optional value.
`option.flatMap(f(_))` same as `option match { case Some(x) => f(x) case None => None }`	Same as map but function must return an optional value.
`optionOfOption.flatten` same as `optionOfOption match { case Some(Some(x)) => Some(x) case _ => None }`	Extract nested option.
`option.foreach(f(_))` same as `option match { case Some(x) => f(x) case None => () }`	Apply a procedure on optional value.
`option.fold(y)(f(_))` same as `option match { case Some(x) => f(x) case None => y }`	Apply function on optional value, return default if empty.
`option.collect { case x => ... }` same as `option match { case Some(x) if f.isDefinedAt(x) => ... case Some(_) => None case None => None }`	Apply partial pattern match on optional value.
`option.isDefined` same as `option match { case Some(_) => true case None => false }`	`true` if not empty.
`option.isEmpty` same as `option match { case Some(_) => false case None => true }`	`true` if empty.
`option.nonEmpty` same as `option match { case Some(_) => true case None => false }`	`true` if not empty.
`option.size` same as `option match { case Some(_) => 1 case None => 0 }`	`0` if empty, otherwise `1`.
`option.orElse(Some(y))` same as `option match { case Some(x) => Some(x) case None => Some(y) }`	Evaluate and return alternate optional value if empty.
`option.getOrElse(y)` same as `option match { case Some(x) => x case None => y }`	Evaluate and return default value if empty.
`option.get` same as `option match { case Some(x) => x case None => throw new Exception }`	Return value, throw exception if empty.
`option.orNull` same as `option match { case Some(x) => x case None => null }`	Return value, `null` if empty.
`option.filter(f)` same as `option match { case Some(x) if f(x) => Some(x) case _ => None }`	Optional value satisfies predicate.
`option.filterNot(f(_))` same as `option match { case Some(x) if !f(x) => Some(x) case _ => None }`	Optional value doesn't satisfy predicate.
`option.exists(f(_))` same as `option match { case Some(x) if f(x) => true case Some(_) => false case None => false }`	Apply predicate on optional value or `false` if empty.
`option.forall(f(_))` same as `option match { case Some(x) if f(x) => true case Some(_) => false case None => true }`	Apply predicate on optional value or `true` if empty.
`option.contains(y)` same as `option match { case Some(x) => x == y case None => false }`

AI, Gen AI and Cloud

Friday, November 13, 2020

Hive, Scala snippets

Wednesday, May 6, 2020

Copied - The Best Medium-Hard Data Analyst SQL Interview Questions

Assumptions & How to use this guide

Tips on solving difficult SQL interview questions

Acknowledgments and Additional Resources

Self-Join Practice Problems

#1: MoM Percent Change

#2: Tree Structure Labeling

#3: Retained Users Per Month (multi-part)

Part 1:

Part 2:

Part 3:

#4: Cumulative Sums

#5: Rolling Averages

#6: Multiple Join Conditions

Window Function Practice Problems

#1: Get the ID with the highest value

#2: Average and rank with a window function (multi-part)

Part 1:

Part 2:

Other Medium/Hard SQL Practice Problems

#1: Histograms

#2: CROSS JOIN (multi-part)

Part 1:

Part 2:

#3: Advancing Counting

Tuesday, May 5, 2020

self in python Class/Instance variables/methods

Instance and class methods in python

self — intuition

Monday, May 4, 2020

Scala CheatSheet

Distributed Computing: A Guide to Comparing Data Between Hive Tables Using Spark

Report Abuse