bunchr
A growing literature in empirical economics is focused on bunching. Loosely speaking, bunching occurs when many people self select to a specific location in the range of some variable. When this happens, a histogram of this variable might show a visible “bunch” - a large mass that would not be otherwise predicted by the surrounding bins. This bunching behavior is then interpreted as the response to some economic reality. In words of someone smarter than me:
“This approach uses bunching around points that feature discontinuities in incentives to elicit behavioral responses and estimate structural parameters.” 1
With the increasing use of administrative data, bunching has been used mostly by public and labor economists. Two main types of bunch-inducing setting are explored. First, where budget sets suddenly change slopes, creating a point where budget sets are continuous but non-differentiable. These “kinks” in budget sets have been explored in the case of Earned Income Tax Credits in the US2 and income tax rates in Denmark3. The second type are notches, where a policy sets a discontinuity in the budget set. For example, bunching in earnings of individuals is observed in Pakistan, where the average tax rates, rather than the marginal ones, change when earnings cross thresholds4. This creates a drop in the budget line of individuals, as all earnings are subject to a higher tax rate once earnings exceed the threshold.
When a policy introduces a kink or a notch, a structural analysis can be used to estimate a parameter of interest, such as the elasticity of earnings with respect to marginal tax rates. To do this, a specific function must be assumed for the agents. In many cases, a convenient function for analysis is quasi-linear and iso-elastic.
bunchr
?NEW (Dec. 2016)! Explore bunching simulations in an
interactive interface, using the bunchApp. After loading
bunchr, run bunchApp()
and start playing.
The bunchr
package can be used to visualize bunching,
calculate counter-factual distributions, and estimate the compensated
elasticity of earnings w.r.t. marginal tax rates, as usually done in the
literature cited above. There is also a function to simulate earnings
given certain parameters. bunchr
’s language is of earnings
and elasticities of earnings, but can be used for other suitable
bunching analysis. Indeed, you can provide a title and x-axis label for
the output plots.
Remember that, as usual, more observations are better. Bunching analysis involves the creation of a histogram, similar to a density curve. The narrower the bins of this histogram, the better. Wider bins could bias elasticity estimates. For a kink analysis, the elasticity formula takes into account the extra bunching over the width of the excluded area. Setting bins too wide could attenuate the estimates. For notches, the procedure involves finding the “end” of the notch, where the discontinuity in incentives is no longer relevant. This limit, referred to as delta_zed in the code, is determined in number of bins after the bunching bin. Setting bins too wide limits the set of potential delta_zed, which might bias the estimates.
Having more observations, one can use narrower bins and still avoid a “jumpy” looking histogram. It is better to have a smoother looking histogram, as the counter-factual histogram for the area of kink or notch is made by interpolation from the area outside of the notch. Smoother histograms make more reliable counter-factuals. As a robustness check for any analysis, I recommend comparing the results of analysis with varying bin widths.
All you need is a vector of earnings. Using
bunch_viewer
, you can see how a histogram of your data
looks like. Additionally, you can see where you could potentially set
some of the parameters for analysis. And, you can save the histogram as
an R histogram object.
Let’s create an earning distribution for a kinked budget set. First,
I create a vector of latent abilities, with 100,000 values. Then, I
simulate the earnings for a situation where everyone’s elasticity of
earnings w.r.t. marginal tax rate is 0.2 (elas = 0.2). The
simulation includes a kink at the earning point of 1000 (zstar
= 1000), where the marginal tax rate increases from 0% to 10%
(t1 = 0, t2 = 0.1), the notch height is zero
(Tax = 0). After creating the earning vector, we can view it
with bunch_viewer
.
As a preparation for further analysis, I can specify where the
counter-factual distribution will be calculated (setting
cf_start and cf_end), and where the bunching seems to
be (exclude_before and exclude_after) relative to the
place of the notch. Note that these are stated in number of bins. The
wider the bins, the greater the distances from the kink! That’s where
bunch_viewer
comes handy.
In this case, one can observe the whole bunching bin and the rest of the distribution quite well. In other cases, specially notches (where bunching is more substantial), the height of the bunching bin can be so high that, in order to include it in the graph, the rest of the distribution can hardly be seen. In these cases, I would keep the default option of “trimming” the vertical axis by keeping the default options and not specifying trimy = F. Also note that you can personalize the plot’s title and x-axis label.
We viewed the distribution. Now we want to get an estimate for the elasticity of earnings. We can use the same specification as we did while viewing. Note that, in order to calculate the elasticity, we must provide the marginal tax rates!
kink_est <- bunch(earning_vec, zstar = 1000, t1 = 0, t2 = 0.1, Tax = 0,
cf_start = 10, cf_end = 10,
exclude_before = 2, exclude_after = 2, binw = 50, nboots = 100)
## [1] 0.186233
## 0% 5% 10% 50% 90% 95% 100%
## 0.05263268 0.09054744 0.12119915 0.18033041 0.22830304 0.24487537 0.28896880
## [1] 1170.216
## 0% 5% 10% 50% 90% 95% 100%
## 344.1881 588.5049 771.4352 1130.6119 1406.7392 1514.3070 1761.6433
Note: we modeled the earnings with elasticity of 0.2, but got an estimate of 0.186. That’s no so bad, considering that the width of the bins and the excluded area have an attenuating effect. We might not be very happy with the bootstrapped confidence interval we got. It’s pretty wide, both for the elasticity estimate and the amount of bunching. Since the number of observations allows it, and the simulation is pretty “clean”, we can try this with narrower bins (width 1 instead of 50), and excluded area (same number of bins, but bins are narrower!). This does shrink the confidence intervals, and gives a point estimate closer to the value we used to generate the data. Also, the plot title and x-axis label can be modified by the user. In this case, however, plotting is suppressed.
kink_est <- bunch(earning_vec, zstar = 1000, t1 = 0, t2 = 0.1, Tax = 0,
cf_start = 500, cf_end = 500,
exclude_before = 10, exclude_after = 10, binw = 1,
nboots = 100, seed = 123, draw = F)
kink_est$e
## [1] 0.1932915
## 0% 5% 10% 50% 90% 95% 100%
## 0.1773313 0.1832356 0.1853342 0.1932290 0.2027058 0.2037494 0.2094628
## [1] 1223.928
## 0% 5% 10% 50% 90% 95% 100%
## 1137.399 1163.280 1177.909 1225.562 1279.609 1289.926 1322.022
First, let’s create an earning distribution with a notch. This time,
I will use bunch_viewer
without trimming the vertical axis.
Now, the excluded area is where the notch seems to be taking place. I
want to make sure I have enough area inside the red lines (where
counter-factual is calculated) and outside the green lines (the excluded
area). Of course, in this particular case we could just set the red
lines at both ends of the histogram. In other cases, however, we might
want to limit the analysis area (e.g. there are other notches or
kinks).
It’s best to set the cf_end variable so it lays to the right of what seems to be the end of the effect of the notch, but not too much to the right. The function uses the area between the red and green lines to calculate an initial counter-factual distribution, so you want to make it easy by having quite a few bins there. I would refrain from including areas with idiosyncratic density patterns in the counter-factual area, as it might make the interpolation harder and less accurate.
Basically, what we need to calculate the elasticity is the size of the notch, or Δz*. While this is clear in this case, where elasticity is heterogeneous it might not be so clear cut. The process goes like this: a counter-factual histogram is created. Then, the extra bunching (sum differences between the actual and counter- factual histograms from the beginning of the excluded area to the notch point) is calculated. That bunching comes from the right side of the histogram. Running on the bins to the right of the notch bin, the differences between the counter-factual and real histogram are added to the initial bunching sum (they should be negative). When the sum hits zero, the process stops and the current bin is declared the end of the notch. Δz* is the distance from the center of that bin to the notch point.
Notch analysis takes a little longer, mainly because of an iterative process in searching for the end of the notch.
earning_vec <- sapply(ability_vec, earning_fun, 0.2, 0.1, 0.1, 200, 1000)
bunch_viewer(earning_vec, 1000, 20, 50, 2, 25, binw = 20)
notch_est <- bunch(earning_vec, zstar = 1000, t1 = 0.1, t2 = 0.1, Tax = 200,
cf_start = 20, cf_end = 50, force_after = FALSE,
exclude_before = 2, exclude_after = 25, binw = 20,
nboots = 100, seed = 123)
## [1] 0.1843316
## 0% 5% 10% 50% 90% 95% 100%
## 0.1203677 0.1407937 0.1407937 0.2074509 0.2074509 0.2074509 0.2074509
The example above is a good one: _delta_zed seems to be the actual end of the notch. Sometimes, the output plot will show Δz* not concurring with what visibly seems like the end of the notch. This usually means that the elasticity estimates will be wrong. In that case, it might be useful to change the exclude_after parameter and see if the plot come out better.
Another option is using the force_after option to set the end of the notch manually. However, this means that not all bunching comes from inside the notch. When using this option, make sure to understand exactly what is the parameter you are estimating.
Another issue is the counter-factual distribution. For larger notches, the interpolation is harder. Remember the interpolation is based on the histogram bins outside the excluded area (green lines) but inside the analysis area (red). If the counter-factual distribution in the output plot does not seem convincing, a few measures can be taken:
As for the distribution of bootstrapped estimates, note that it might be less “smooth” than in the kink analysis. Recall that the major change in the different runs is the estimated number of bins forming delta_zed. As these are integers, the distribution of estimates will not be very smooth. With narrower bins, there should be more potential values for the elasticity estimates.
How meaningful is a having a large sample size? This demonstration shows some of the effects of sample size on kink estimations. First, I’ll use a small earning vector: 10,000 observations in total. The confidence interval is quite large.
ability_vec_small <- 4000 * rbeta(10000, 2, 5)
earnings_small <- sapply(ability_vec_small, earning_fun, 0.2, 0, 0.1, 0, 1000)
kink_est <- bunch(earnings_small, zstar = 1000, t1 = 0, t2 = 0.1, Tax = 0,
cf_start = 10, cf_end = 10,
exclude_before = 1, exclude_after = 1, binw = 50,
nboots = 100, seed = 123)
## [1] 0.2126395
## 0% 5% 10% 50% 90% 95% 100%
## 0.02482187 0.06540559 0.08508116 0.20008149 0.32884933 0.35271794 0.41942236
Now, let’s look at a much larger vector, of one million observations. Naturally, the confidence interval generated is much smaller. Note that, with a larger population, we could reliably narrow the bins and get a more accurate estimate.
ability_vec_large <- 4000 * rbeta(1000000, 2, 5)
earnings_large <- sapply(ability_vec_large, earning_fun, 0.2, 0, 0.1, 0, 1000)
kink_est <- bunch(earnings_large, zstar = 1000, t1 = 0, t2 = 0.1, Tax = 0,
cf_start = 50, cf_end = 50,
exclude_before = 5, exclude_after = 5, binw = 10,
nboots = 100, seed = 123)
## [1] 0.187467
## 0% 5% 10% 50% 90% 95% 100%
## 0.1700840 0.1773875 0.1802491 0.1877173 0.1939444 0.1963106 0.2034796
In fact, with a larger population, we could reliably narrow the bins and get a more accurate estimate. Using only 10 bins before and after the bunching bin:
kink_est <- bunch(earnings_large, zstar = 1000, t1 = 0, t2 = 0.1, Tax = 0,
cf_start = 50, cf_end = 50,
exclude_before = 1, exclude_after = 1, binw = 5,
nboots = 100, seed = 123, draw = F)
kink_est$e
## [1] 0.2002643
## 0% 5% 10% 50% 90% 95% 100%
## 0.1931161 0.1963163 0.1968887 0.1997593 0.2032068 0.2041389 0.2070643
Generally speaking, the estimation strategy in these procedures are likely to attenuate, rather than exacerbate, the estimates for elasticity. However, it is worthwhile to see what happens if, by mistake, we specify the wrong earnings or elasticity.
We can simulate an earning vector with no bunching and see the
estimates, and run bunch
on it.
earning_vec <- sapply(ability_vec, earning_fun, 0.2, 0.1, 0.1, 0, 1000)
bunch_viewer(earning_vec, 1000, 10, 10, 1, 1, binw = 50, trimy = F)
kink_est <- bunch(earning_vec, zstar = 1000, t1 = 0.1, t2 = 0.1, Tax = 0,
cf_start = 50, cf_end = 50,
exclude_before = 1, exclude_after = 1, binw = 10,
nboots = 100, seed = 123, draw = F)
## Error in kink_estimator(earnings, zstar, t1, t2, cf_start, cf_end, exclude_before, : No change in marginal tax rate - can't calculate elasticity!
Woops! Can’t do this. The formula for elasticity includes the logarithm of the proportion of net-of-tax rates before and after the kink. Setting them equal generates a NaN, as the zeroed logarithm term is in the denominator.
But let’s see what bunch
does when we try to estimate a
marginal tax rate change, when in fact there was no change. Presumably,
as there is no visible bunching, the estimates would be zero.
Indeed:
kink_est <- bunch(earning_vec, zstar = 1000, t1 = 0, t2 = 0.1, Tax = 0,
cf_start = 50, cf_end = 50,
exclude_before = 1, exclude_after = 1, binw = 10,
nboots = 100, seed = 123, draw = F)
kink_est$e
## [1] 0
## 0% 5% 10% 50% 90% 95%
## 0.000000000 0.000000000 0.000000000 0.000000000 0.003295070 0.006054123
## 100%
## 0.012957148
What would happen if bunching does occur, but we did not specify the
tax rates correctly when running bunch
? Let’s create an
earnings vector with a tax rate change from 0% to 30%, but run
bunch
for a smaller tax change, 0% to 20%. As the observed
bunching is higher than what would be expected for a change from 0% to
20% with elasticity of 0.2, bunch
estimates a higher
elasticity of earnings. This kind of user-induced error cannot be
detected, so make sure to input the parameters correctly.
earning_vec <- sapply(ability_vec, earning_fun, elas = 0.2, t1 = 0, t2 = 0.3,
Tax = 0, zstar = 1000)
kink_est <- bunch(earning_vec, zstar = 1000, t1 = 0, t2 = 0.2, Tax = 0,
cf_start = 50, cf_end = 50,
exclude_before = 1, exclude_after = 1, binw = 10,
nboots = 100, seed = 123, draw = F)
kink_est$e
## [1] 0.2947799
## 0% 5% 10% 50% 90% 95% 100%
## 0.2786584 0.2873745 0.2887431 0.2958581 0.3022360 0.3048818 0.3098970
To my knowledge, no other R package for bunching analysis is
publically available at the time of release of this current version.
Chetty et al (2011) did distribute their Stata code,
which was used for kink analysis, and I will use it to validate
bunchr for kinks. Unfortunately, I cannot use it with the
original administrative data used by the researchers. However, I can run
a simulation and test for similarity in results. The Stata
program was written by Tore Olsen, and can be found in Raj Chetty’s website.
To replicate these results, you will need a copy of Stata of
course. Download and install the bunch_count
function for
Stata, and define the working directories properly.
First, I create a dataset with bunchr
. The
Stata function requires collapsed, binned data. I use
bunch_viewer
to create and export that. Note that
bunch_viewer
will create the histogram by placing the kink
point in the middle of a bin.
set.seed(1982)
ability_vec <- 4000 * rbeta(100000, 2, 5)
earning_vec <- sapply(ability_vec, earning_fun, 0.3, 0, 0.2, 0, 1000)
data <- bunch_viewer(earning_vec, zstar = 1000, binw = 50, report = T)
Now, save the simulated earning histogram in comma delimited file, which is easy to read in Stata:
Now, run this code in Stata:
insheet using "sim_data.csv", clear
bunch_count earnings counts, bpoint(1000) ig_low(-10) ig_high(10) low_bunch(-1) high_bunch(1) plot(1) binwidth(50)
outsheet using "compare_results.csv", replace delim(",")
This will generate a plot, output some estimates, and change the dataset so the estimated counter-factuals are included as the plotabc3 variable. Then, the data will be exported as a .csv file which we will import back to R.
Now, we can run bunch
on the data. I use the same
specification regarding the bin width, counter-factual area, excluded
area, and correcting for shift on the right side of the notch. The main
difference is that, since bunch
calculates the estimated
elasticity, it needs the marginal tax rates as inputs (this will not
matter for the counter-factual estimation). Also note that I shut down
the default option for model selection for the counter-factual (mostly
useful for notches, and non-existing in bunch_count
), and
use the default 7th degree
polynomial.
estim <- bunch(earning_vec, zstar = 1000, t1 = 0, t2 = 0.2, Tax = 0,
cf_start = 10, cf_end = 10, exclude_before = 1, exclude_after = 1,
binw = 50, max_iter = 200, correct = T, select = F, poly_size = 7,
draw = F)
# creating comparison data-frame
bunchr_res <- estim$data
comp_data <- cbind(bunchr_res, chetty_res)
comp_data$cf_diff <- comp_data$cf_counts - comp_data$plotabc3
comp_data$per_diff <- 100 * comp_data$cf_diff / comp_data$plotabc3
show_data <- comp_data[,c(1,2,3,9,10,11)]
colnames(show_data) <- c("earnings", "counts",
"bunchr_cf_counts", "bunchcount_cf_counts", "diff",
"percent_diff")
# write.csv(show_data, file="show_data.csv", row.names=F)
# plot the results
plot(show_data$earnings, show_data$counts, type="h",
main="comparing bunching counter-factuals",
xlab="earnings", ylab="counts",
xlim=c(200,2000))
points(show_data$earnings, show_data$bunchr_cf_counts, col="red", pch="*")
points(show_data$earnings, show_data$bunchcount_cf_counts, col="blue",pch="*")
legend("topleft", lty=c(1, NA, NA), pch=c(NA, "*","*" ), col=c("black","red","blue"),
legend=c("real counts","bunchr cf", "bunch_count cf"))
sim_data.csv, compare_results.csv and show_data.csv are available in the extdata folder of the installation.
Kleven, H J (2016). “Bunching”, Annual Review of Economics, 8(1).↩︎
Saez, E. (2010). “Do taxpayers bunch at kink points?”↩︎
Chetty, R., Friedman, J., Olsen, T., Pistaferri, L. (2011). “Adjustment Costs, Firm Responses, and Micro vs. Macro Labor Supply Elasticities: Evidence from Danish Tax Records”, Quarterly Journal of Economics, 126(2).↩︎
Kleven, H.J., Waseem, M., “Using notches to uncover optimization frictions and structural elasticities: Theory and evidence from Pakistan”, The Quarterly Journal of Economics, 128(2)↩︎