Lesson #1: Calculate Internal Rate of Return (IRR) Directly in Redshift¶
Internal Rate of Return (IRR) is a calculation frequently used to estimate an investment’s rate of return.
I encountered it at a large mortgage lender where IRR was one of many calculations used to determine the health of individual loans. The inputs were the mortgage principle, the previous monthly payments, and estimated monthly payments to the end of the loan period. So, there could easily be more than 300 inputs to the IRR calculation. The standard practice was to transfer a sample of borrower data from the data warehouse to the SAS server to run the IRR calculation and transfer the results back to the data warehouse.
The type of iterative calculation required by IRR is not traditionally done on MPP data warehouse platforms. However, with a bit of advanced SQL thinking, this type of calculation can take advantage of the distributed nature of the MPP to score very large populations.
Step 1 - Introduction to IRR¶
So, what is the IRR calculation?
Advanced SQL writers come from a range of backgrounds. You may not have a computer science degree, or remember a lot of your high school or college math, so I’ll jump into a few details here. The Additional Resources section below has links to several useful articles regarding IRR. I’ll give a minimal explanation here, so you should read those. The purpose here is to demonstrate some useful SQL techniques. If you need to implement this, or any other financial calculation, in production, then work with an experienced financial engineer.
In the equation above, our goal is to find “r”, which is the interest rate that makes the Net Present Value (NPV) equal to 0. In the early 19th century, Evariste Galois proved that this type of polynomial has no general solution; that is, there is no “formula” for finding “r”. You have probably encountered these types of functions in your mathematical career. Hopefully, your math teacher discussed a bit about how your fancy scientific calculator, or computer program, was able to solve these equations without a formula.
In Computer Science, the study of finding “good enough” solutions to impossible mathematical problems is called Numerical Methods or Numerical Analysis. For this problem, we are trying to find the “r” that makes NPV equal to zero. So, with our computer program, we can use this method:
1. Make a guess for "r" 2. Calculate the NPV based on that guess 3. Evaluate whether or not the NPV resulting from that guess is "close enough" to 0 a. if it is "close enough" then we are done and our guess is a "good enough" result for r b. if not, then we need a better guess 4. Improve the guess and start over at #1
Step 2 - Relational Database Equivalent of Iteration¶
It is clear how this type of algorithm can be implemented in a traditional programming language, maybe even in a database system with a procedural language available. However, the power of the MPP lies in SQL, not in the procedure. In order to implement this in SQL, we have to change our thinking from “one at a time” to “all at once”. Think of this simple case. In a programming language, you might print the numbers one though 10 like this:
for i=1, 10 do print(i) end
Assuming that I have a numbers table, in SQL I achieve the same result like this:
select num from numbers where num <= 10 order by num ; .. code-block:: aginity_catalog_item
In the procedural language, it is clear that the computer “did” something (printed 1), then “did” something else (printed 2). Did this happen in the database system? Maybe, maybe not. The point is that SQL is focused on the result set. The SET is only in a particular order because I asked for it to be displayed that way. There is a SET of numbers. I don’t care about the order in which they were read off the disk. Let’s go a step further:
total = 0 for i=1, 10 do total = total + i end print(total)
select sum(num) as total from (select num from numbers where num <= 10 ) as inner_select ;
Again, the order of the process described by the procedural language matches what the system actually does. Does the database system keep a running total when it calculates a SUM? I don’t know. It doesn’t matter. The SQL says that, of the SET returned by the inner_select, give me the sum. When there is a logical ordering of operations – in order to SUM over a set I first have to identify a set – then that order of operations is clear in the SQL.
So, how does this apply to our problem of making guesses? In procedural languages, we make one guess at a time, evaluate it, and make another guess; in SQL we can make all of our guesses at the same time.
Step 3 - IRR Baby Steps¶
In breaking down a calculation like this, I like to start with all the terms on a single row and work backward from there. We can easily get as many examples as we want by using the IRR calculation in Excel. So, let’s look a simple excel example:
IRR Excel Function
Remember that anything raised to the power of “0” is equal to 1. So, we can use a “0” period to get the value of the payment “as-is”.
If “Net Present Value” is the sum of a bunch of things, let’s call those pre-summed things “Present Value”. So, for every payment period we need:
- an ID so we can handle multiple borrowers
- the period number
- the payment for that period
-- example 1 drop table if exists payment_data; create temp table payment_data (id int, period int, payment int) ; insert into payment_data VALUES (1, 0, -10000), (1, 1, 2400), (1, 2, 2400), (1, 3, 2400), (1, 4, 2400), (1, 5, 2400), (1, 6, 2400) ; -- section 1 select * from payment_data order by id, period ;
Let’s build up the calculation. Remember, we are trying to get an NPV of 0. For our example data, Excel has told us that happens when r is about “0.115305”.
- for a given period, the numerator is the payment for that period
- inside the parentheses, is (1 + 0.115305)
- the dominator is the parens raised to the power of the period: power((1 + 0.115305), period)
So, the query is
select id, period, payment, payment/(power(1+0.115305, period)) as pv from payment_data order by id, period ;
Let’s sum that to see whether it is correct:
select sum(pv) as npv from (select id, period, payment, payment/(power(1+0.115305, period)) as pv from payment_data order by id, period ) as inner_query ;
So, that’s pretty close to 0. Theoretically, we can get as close to 0 as we want by continually adding decimal places to our value of r.
In this case, we “cheated” by getting Excel to tell us the correct value for now. Next we are going to evaluate r over a range of “guesses” to determine which value of r produces an NPV close enough to 0.
Step 4 - Longer Strides¶
So, we know from Excel that the IRR for our data is between 11% and 12%. Let’s explore a few values.
First a table of guesses:
drop table if exists guesses; create temp table guesses as select num*.01 as guess from numbers ;
In the previous example, we had the “guess” hard coded. Now we want our guesses table to drive the guess. So, every row of our payment data needs its own guess. In SQL, we can achieve this by using cross join. In SQL development we always need to keep in mind an estimate of the sizes of our input and output sets so things don’t get out of hand. We have 7 periods; we’ll look at 10 guesses initially. That will be 70 inner rows that will aggregate to 10 npv_guesses to evaluate. Rather than make this strictly true by pre-limiting our guesses table, we’ll assume that Redshift is smart enough to do that limitation for us. If not, then we’ll have 7 million inner rows that will be filtered down to 70. For Redshift, we won’t consider that to be a big deal for now.
So our inner query is this:
select id, period, payment, guess, payment/(power(1+guess, period)) as pv from payment_data cross join guesses where guess between 0.06 and .15 order by id, period, guess
We can’t tell much by looking at this level of detail, so let’s aggregate
select id, guess, sum(pv) as npv_guess from (select id, period, payment, guess, payment/(power(1+guess, period)) as pv from payment_data cross join guesses where guess between 0.06 and .15 -- an arbitrary limit on guess for easy viewing order by id, period, guess ) as inner_query group by id, guess order by id, guess ;
We can see that the value closest to 0 is .12. Let’s dial in additional precision by adding decimals to our guesses, then re-running the aggregate query:
drop table if exists guesses; create temp table guesses as select num*.001 as guess from numbers ;
Run the npv_guess query again.
Now there are 100 rows of output and closest to 0 is .115. Let’s jump a couple levels of precision and re-run the aggregate query.
drop table if exists guesses; create temp table guesses as select num*.00001 as guess from numbers ;
Now we have 10,000 rows with the closest being 0.11530.
If you are working through these queries, go ahead and put the decimal back to “0.01” in the guesses table so we get faster execution times for the rest of the examples.
Step 5 - Crossing the Finish Line¶
Now we can see the shape of where we are going. We are making all of our guesses “at the same time”; at least as part of the same result set. From that set, we need to find the one that is closest to 0; that is, the npv_guess that has the minimum absolute value.
As our levels of aggregation continue to grow, we need to be comfortable with the technique of SQL Window functions.
Let’s rewrite the previous query with a window function. Also, remember the previous note to take the guesses table back down to two decimal places for faster execution.
select *, payment/(power(1+guess, period)) as pv, sum(payment/(power(1+guess, period))) over(partition by id, guess order by period rows unbounded preceding) as npv_guess, max(period) over(partition by id, guess) as max_period from payment_data cross join guesses order by id, guess, period
Now we have our payment data, pv, and npv_guess on the same row. The npv_guess aggregation is being driven by the sum using the window function. For aggregating at the next level, where we are going to find the npv_guess closest to 0, we need to choose a row. The row we want is the one with the last period for our data. So, we have a max_period aggregation that we’ll use for a filter at the next level. Note that the final “order by” clause here and in examples below is for us to use in visualizing the output. The aggregation here is based on the “order by” clause inside the window function.
Any time we use window functions, we want to add test cases to make sure that the functions are working as expected. So, let’s add a couple of more IRR examples from excel:
insert into payment_data VALUES (2, 0, -1000), (2, 1, 120), (2, 2, 360), (2, 3, 100), (2, 4, 240), (2, 5, 480), (3, 0, -18000), (3, 1, 3100), (3, 2, 2400), (3, 3, 2400), (3, 4, 2400), (3, 5, 2400), (3, 6, 2400), (3, 7, 3000), (3, 8, 3200), (3, 9, 3600) ;
At this level, we have all of our guesses, along with their distances from 0 (absolute value), and identification of which of these is the closest to 0.
select id, guess, abs(npv_guess) as abs_npv_guess, min(abs(npv_guess)) over(partition by id) as min_abs_npv_guess from (select *, payment/(power(1+guess, period)) as pv, sum(payment/(power(1+guess, period))) over(partition by id, guess order by period rows unbounded preceding) as npv_guess, max(period) over(partition by id, guess) as max_period from payment_data cross join guesses order by id, guess, period ) as payment_level where period = max_period order by id ;
So, one additional filter gives the final query:
select id, guess as irr from (select id, guess, abs(npv_guess) as abs_npv_guess, min(abs(npv_guess)) over(partition by id) as min_abs_npv_guess from (select *, payment/(power(1+guess, period)) as pv, sum(payment/(power(1+guess, period))) over(partition by id, guess order by period rows unbounded preceding) as npv_guess, max(period) over(partition by id, guess) as max_period from payment_data cross join guesses order by id, guess, period ) as payment_level where period = max_period order by id, guess ) as guess_level where abs_npv_guess = min_abs_npv_guess order by id ;