Problem set 1

Due by 2:20 PM on Friday, September 17, 2021

Welcome

Welcome to Problem Set 1! There are two “problem” exercises and one extended Stata exercise. You’ll need to submit three things on Blackboard: a problem set, log file, and a do-file.

Note that there are superscripted numbers¹ throughout the page that provide additional information/suggestions to help you out.

Exercises

X	-1	0	1	4
$P(X=x)$	0.25	0.30	0.40	0.05

Consider the above random variable, $X$, with its associated probability distribution:
1. Draw the probability distribution function and the cumulative distribution function.
2. What is the expected value of X? That is, what is $E[X]$?
3. What is the variance of X?
The following table gives the joint probability distribution between employment status and college graduation among those either employed or looking for work (unemployed) in the working-age U.S. population for 2012:

	Unemployed ($Y=0$)	Employed ($Y=1$)	Total
Non-college grads ($X = 0$)	0.053	0.586	0.639
College grads ($X = 1$)	0.015	0.346	0.361
Total	0.068	0.932	1.000

Compute $E[Y]$.
The unemployment rate is the fraction of the labor force that is unemployed. Show that the unemployment rate is given by $1 - E[Y]$.
Calculate $E[Y|X=1]$ and $E[Y|X=0]$.
Calculate the unemployment rate for college graduates and for non-college graduates
A randomly selected member of this population reports beign unemployed. What is the probability that this worker is a college graduate? A non-college graduate? f. Are educational achievement and employment status independent? Explain.

Compute the following probabilities²:

If $Y$ is distributed $N(1,4)$, find $Pr(Y \leq 3)$
If $Y$ is distributed $N(3,9)$, find $Pr(Y >0 )$
If $Y$ is distributed $N(50,25)$, find $Pr(40 \leq Y \leq 52)$
If $Y$ is distributed $N(5,2)$, find $Pr(6 \leq Y \leq 8)$

For a randomly selected county in the United States, let $X$ represent the proportion of adults over age 65 who are employed (the elderly employment rate). Then, $X$ is restricted to a value between zero and one. Suppose that the cumulative distribution function for $X$ is given by $F(x) = 3x^2 - 2x^3$ for $0 \leq x \leq 1$.
1. What is the probability that the elderly employment rate is at least 0.5 (50%)?
2. What is the probability that the elderly employment rate is between 0.4 (40%) and 0.6 (60%)?

4. In any year, the weather can inflict storm damage to a home. From year to year, the damage is random. Let $Y$ denote the dollar value of damage in any given year. Suppose that in 95% of the years, there is no damage ($Y=0$), but that in 5% of the years, $Y = 20000$.

What are the mean and standard deviation of the damage in any year?
Consider an “insurance pool” of 100 people whose homes are sufficiently dispersed so that, in any year, the damange to different homes can be viewed as inddependently distributed random variables. Let $\bar{Y}$ denote the average damage to these 100 homes in a year (i) What is $E[\bar{Y}]$? (i) What is the probability that $\bar{Y}$ exceeds $2000?

5. Grades on a standardized test are known to have a mean of 1000 for students in the United States. The test is administered to 453 randomly selected students in Florida; in this sample, the mean is 1013 and the standard deviation ($s$) is 108.

Construct a 95% confidence interval for the average test score for Florida students
Is there statistically significant evidence that Florida students perform differently than other students in the United States? How do you know?
Another 503 students are selected at random from Florida. They are given a 3-hour preparation course before the test is administered Their average test scores is 1019 with a standard deviation of 95.

First, construct a 95% confidence interval for the change in the average test score associated with the prep course.
Is there statistically significant evidence that the prep course helped?

For the following question, make sure you submit your do-file and log-file alongside your answers!

Download countymurders.dta to answer this question. The variable $murders$ is the number of murders reported in the county. The variable $execs$ is the number of executions that took place of people sentenced to death in the given country. Most states in the United States have the death penalty, but several do not.
1. Keep only data from the year 1996. How many counties are there in the data set? Of these, how many have zero murders. What percentage of countries have zero executions?
2. What is the largest number of murders in a county? What is the largest number of executions in a county?
3. Compute the correlation coefficient $r$ between murders and execs and describe what you find. ³ Estimate the correlation coefficient between murdrate and execrate. Why do the two coefficients differ so much?
4. What are two characteristics in the data that are highly correlated with county murder rates?⁴ What are their correlation coefficients?
5. What is median real per-capita income?⁵
6. Generate a variable, highinc that is equal to 1 if a county has above-median real per capita income, and 0 otherwise. What is $E[rpcpersinc | highinc =0]$? What is $E[rpcpersinc | highinc =1]$?
7. Consider a two-sided hypothesis test of whether murder rates are different between counties with high (above median) vs low (below median) real per-capita personal income. Assume the two samples are independent, with equal variances.
  1. First, write a null and alternative hypothesis
  2. Use Stata to conduct the hypothesis test. What is the relevant t-statistic?⁶
  3. Can you reject the null hypothesis at the 5% level?
8. Generate a variable, perc1029, that is equal to the share of the population between the ages of 10 and 29. What is the median share of the population by county that is ages 10-29?
9. Generate a variable, young, that is equal to 1 if a county has an above-median share of the population that is age 10-29, and 0 otherwise. What is $E[perc1029| young = 0]$? What is $E[perc1029| young =1]$?
10. Consider a two-sided hypothesis test of whether murder rates are different between states with a high (above-median) share of the population ages 10-29 versus a low share. Assume the two samples are independent, with equal variances.
  1. First, write a null and alternative hypothesis
  2. Use Stata to conduct the hypothesis test. What is the relevant t-statistic?
  3. Can you reject the null hypothesis at the 5% level?

Sources

countymurders.dta

Source: Compiled by J. Monroe Gamble for a Summer Research Opportunities Program (SROP) at Michigan State University, Summer 2014. Monroe obtained data from the U.S. Census Bureau, the FBI Uniform Crime Reports, and the Death Penalty Information Center.

See?! Neat! :) ↩︎
Remember that we conventionally $N(\mu,\sigma^2)$, so the second term is the variance, not the standard deviation↩︎
Remember, you can use correlate var1 var2 to look at the correlation between two variables.↩︎
If you want to look at the correlation between lots of variables, you can use correlate var1 var2 ... var99. If you want to refer to a lot of variables, an asterisk (*) can act as a “wild.” So if you use correlate var*, it you’ll receive a correlation matrix of every variable with a name that starts with “var.” If you use correlate *var*, it will give you a correlation matrix of every variable with the letters “var” somewhere in the name.↩︎
tabstat is your friend!↩︎
The help file for ttest will be useful. Here we are conducting a two-sample t-test using groups. You will want to use the highinc variable you generated earlier.↩︎

Last updated on September 8, 2021

Edit this page

	Unemployed (\(Y=0\))	Employed (\(Y=1\))	Total
Non-college grads (\(X = 0\))	0.053	0.586	0.639
College grads (\(X = 1\))	0.015	0.346	0.361
Total	0.068	0.932	1.000