Forensic Accounting Techniques with R

Growth Analysis

Your browser needs to be JavaScript capable to view this video

Try reloading this page, or reviewing your browser settings

This video segment takes a look at our expense report data using growth analysis techniques and time series visualization.

Keywords

  • Growth analysis
  • R
  • count
  • sum
  • min
  • max
  • mean
  • median
  • time series

About this video

Author(s)
Kevin Feasel
First online
20 December 2019
DOI
https://doi.org/10.1007/978-1-4842-5673-2_2
Online ISBN
978-1-4842-5673-2
Publisher
Apress
Copyright information
© Kevin Feasel 2019

Video Transcript

In this section, we will talk about growth analysis. Growth analysis focuses on time series data, that is, looking at information aggregated across time. Often we will aggregate this information by a date range, such as a month or year, although aggregation is not necessary.

Given data which stretches over time, we will often aggregate this information to make use of it. We might want to get the count of records by month or year, such as the count of new customers. We could get the sum of an additive measure, such as cost revenue or gross margin. We might be interested in looking at changes in the average over time where average could be either the mean or the median.

We might even be interested in looking at percentiles. The extreme percentiles, minimum and maximum, are the most common. But we could also look at trends in say the 95th percentile of response time over time. The goal here is to understand whether the data we are looking at is fairly stable or if there are variations or trends over time.

One of the tools we can use to gauge trends is line charts. Line charts are excellent visuals for understanding changes over time as we will see in our demo. In this demo, we will continue our high-level analysis of this expenses data and see if we find anything suspicious or weird.

Let’s first review the results by year. To do this, we will take expense data, group it by year, and summarize the values. We will create the count with n, the total with sum, and then create mean and median amounts with the mean and median functions respectively.

At this level, nothing looks particularly out of line. We can see variance from year to year. But the only thing that stands out to me is that the mean expense report amount has increased consistently over time, though not by a huge amount. But people aren’t always great at finding patterns in tables of text. So let’s visualize this a little bit.

First I’m going to save this data frame as aggregates. Next I want to plot the mean and the median over time. To do so, I will use the ggplot2 library. We wants to call the ggplot function, passing in aggregates as our data set. We will build out an aesthetic setting the x-axis to year and y-axis to mean amount.

Once we have that in place, we will create a line geometry. ggplot has some default behavior around displaying numbers and axes, which doesn’t make much sense when you’re looking at years. So let’s override that behavior by setting the x-axis to scale from 2011 through 2019. Finally we will use a minimalist theme, because I think it looks nicer than the default.

Now let’s run the script. What we see here is a steady growth over time. Given the default y-axis range, we can see that, yes, there is real growth. Although as we noted before, it has risen from about $50 in 2011 to $54 per expense report in 2019– so not a major change.

If we want to plot the median, we can do the same thing. We will build out an aesthetic setting the x-axis to year and the y-axis to the median amount. Then we will create a line geometry.

We will next override the x-axis to scale again between 2011 and 2019. Finally, we will use the minimalist theme once again. Running this cell, we can see that the median fluctuates throughout the time period.

Now let’s go one level deeper and try to understand the change in median amount over time for inexpensive, moderately expensive, and expensive cities respectively. To do this, we first need to create a new data frame, which is aggregated by year and city type. Just as before, we’re going to build out count, sum, mean, and median measures for our data but this time group it by city type as well as year. And let’s finish up these measures with mean and median.

Once we have the data frame in place, we can use ggplot2 to build out another line chart. This time our x-axis will be year, y-axis median amount, and a different color will represent each city type. Once we have this new aesthetic set up, we will create a line chart. Just as before, we want to scale the x-axis as a continuous value ranging from 2011 through 2019 and breaks at each year.

And now that we have introduced color as a differentiating visual feature, I want to use the ColorBrewer function to choose colors from a color vision deficiency safe palette. We will tell ColorBrewer that we want a qualitative palette, meaning that each city type is logically distinct from the others and there is no gradation from best to worst, top to bottom, or any other implicit ordering.

I like how the Dark2 palette looks, so we’ve chosen that one. Finally, we will add the minimalist theme. Just as before, let it run, and we’ll see the results.

This time around, we see three line charts, one per city type. What is interesting to me is that the expense reports for expensive and moderately expensive cities have been pretty much flat from 2011 through 2019. But we can see inexpensive cities’ expense reports grow starting in 2017. This might just be a blip, but it is something that we will want to keep in mind as we work our way through other analytical techniques.

This wraps up the section on growth analysis. In the next section, we will continue our investigation with linear regression.