SOLUTION: Air Force Institute of Technology Regression Project Models and Codes in Python Worksheet

SOLUTION: Air Force Institute of Technology Regression Project Models and Codes in Python Worksheet.

Winter 2022
1
Final Project
DASC 512
Data
The data in this problem were collected by two economists to be used in constructing a regression equation
to serve as a price index for owner-occupied housing in a region containing a large U.S. city. Data were
obtained for each of 506 census tracts in and around the city. (The U.S. Census Bureau has partitioned the
entire country into geographical regions called census tracts that contain approximately the same number of
people.)
Some variables were reported on a census tract basis while others were reported on a community basis. For
example, the property tax rate is determined by each community. If a community consists of more than one
census tract, the property tax rate will be the same for each census tract in that community.
Census tracts 357 – 488 (inclusive) are all part of the city proper. The remaining census tracts are in towns
or suburbs in the surrounding metropolitan area, but they are not in the city. The census tracts in the city
have the same values for the property tax, pupil-teacher ratio, zoning, and highway access variables.
The data for 506 census tracks are in the associated data file student_data.csv, although you should note
that the last 50 data points are missing Y values. These are the test points. There is one line of data for
each census tract. Values for the variables appear in the order they are listed below. Use these variable
names in formulas and tables presented in your report.
With the exception of Census Tract, which is a three-digit identification, the variables are described in Table
1.
Variable
Y
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
Description
Median value of owner-occupied homes in the census tract
Per capita crime rate in the community
Percentage of a community’s residential land zoned for lots greater than
25,000 square feet
Percentage of acres in the community zoned for non-retail business
Dummy variable equaling 1 if the tract borders a specific river and 0
otherwise
Average concentration (parts per 100 million) of nitrogen oxides in the
air (a measure of air pollution)
Average number of rooms per owner-occupied home
Percentage of owner-occupied homes that are more than thirty years
old
Natural logarithm of the weighted distances to five major employments
centers in the metropolitan area. Larger values indicate the tract is
farther from the major employment centers.
Natural logarithm of an index of accessibility to radial highways. Calculated on a community basis. Larger values represent better highway
access.
Property tax rate in dollars per $10,000 of property value. This measures costs paid by homeowners to maintain schools and public services
in each community.
Pupil-teacher ratio in each school district. Lower values may indicate
higher quality public schools.
Percentage of adults without a high school diploma or classified as laborers
Table 1: Variables used in student_data.csv
1
Winter 2022
2
DASC 512
Final Project
Task
You have been commissioned by a real estate investor to develop a regression model for this data and generate
a formal report of the results. This investor conducts some political lobbying to reinforce property values
where she has a stake and has a particular interest in the effect of air pollution on median home values.
Your task is to analyze the data for the 456 census tracks for which you have complete data and construct
one or more good regression models for predicting Y, the median value of owner-occupied homes. Include
additional explanatory variables constructed from functions of the variables on the data file if you think that
they are worthwhile. No raw Python output should be present in the report. Summarize your analysis in a
report in PDF format that includes the following discussions.
1. A 1–2 paragraph “Executive Summary” of your major conclusions about the relationships between
median housing prices and the explanatory variables. Whether it is included in your model or not, you
should address the nitrogen oxide variable (X5). This should not contain any formulas or mathematical
symbols. It should be written so that it could be easily understood by a real estate investor with no
formal training in statistics.
2. A description of the steps taken to identify your best model(s). Do not submit any raw Python output
in this section. Graphical analysis and summary statistics are encouraged. Simply outline the issues
you considered, your decisions, and the sequence of steps you took to develop a model. Be detailed —
tell me what you did, why you did it, and if it worked.
• For the purposes of this course, consider only models with main effects, quadratic effects (X 2 ),
and the following interactions: X1*X5, X4*X5, X5*X6, and X5*X8. This should be reasonable
while still forcing you to explore the model building process.
• Because of this model limitation, it is possible that there may be some higher order effects and/or
non-linearity in the data that you cannot model. Remember this when looking at residual plots.
Point out if you think there may be deficiencies caused in this way. Y can still be transformed as
you see fit if you find it necessary/useful.
3. A formula for your best model(s), standard errors for coefficients, and the R2 value. Summarize Python
results in tables of your own creation — do not report any values that you do not intend to discuss.
Discuss and interpret any important features of your model. Pay some attention to the nitrogen oxide
(air pollution) variable as a predictor of median housing values, although you may conclude that it is
not important.
4. Convincing evidence that the model you selected is a good model for using some or all of the 12
explanatory variables to predict median housing values. Discussion of residual plots and other diagnostic checks would be appropriate. Statistical tests should be formulated correctly with appropriate
hypotheses and conclusions. Graphs and tables are encouraged, but raw Python output should not be
submitted and will be ignored.
5. (Optional) One paragraph outlining additional analyses that you would have done if you had more
time or were not artificially restricted in your model parameters. You will earn points for suggestions
with high potential value, but you will also lose points for suggestions with little potential value.
Separate from the PDF report, submit a CSV file with your predictions for the missing Y data points (the
last 50 observations). Use your final, best model to predict Y and create a 95% prediction interval for each
point. Points for the “Predictive Ability” section will be based on the following
1. Mean Square Prediction Error (MSPE) — Lower is better
M SP E =
n
2
X
(y − ŷ)
i=1
n
2. Coverage of your confidence intervals — 2–3 intervals are expected to miss due to random error
3. Width of your confidence intervals — narrow is better as long as coverage is acceptable
2
Winter 2022
3
Final Project
DASC 512
Deliverables
1. last_first.pdf — PDF document with your write-up. Use MS Word’s “print to PDF” feature as
necessary. You should have no raw Python output in this document. This should be a well-structured
report with narrative, graphics, and tables as needed.
2. last_first.csv — CSV file with your predictions for the last 50 observations. This should have the
following columns:
(a) Census Tract
(b) Prediction
(c) Lower Prediction CI
(d) Upper Prediction CI
3. last_first.ipynb or last_first.py — Python file with your complete analysis, including plot generation, statistical tests, and predictions. Only include relevant code and comments and/or narrative
blocks to explain the code, just as it would be delivered to a client. I should be able to run it top-tobottom without errors. It should not be the digital equivalent of scratch paper.
4
Grading
This is an individual assignment, and you are expected to do your own work. Do not discuss this project with
anyone other than the course instructor. The primary task of this assignment is to write a report detailing
how and why you came to a regression model relating median home price to predictor variables. Write a
coherent and concise report that flows well and clearly describes you analysis and conclusions. There is no
absolutely best answer, and I expect to receive many different models.
This final will be worth 80 points. You will be graded on:
1. Writing (10 points): Emphasis on precision, clarity, and efficiency. You should use paragraphs, transitions, sections and incorporate any figures and tables into the flow of the document.
2. Executive Summary (10 points): Clear and concise use of language to convey your model in a limited
space.
3. Model Building Process, Logic, and Conclusions (40 points): Appropriate use of tools from this course
applied correctly and communicated effectively.
4. Predictive Ability (20 points): You will provide point estimates and confidence intervals for the withheld
test points. Coverage of true values in those intervals and Mean Square Prediction Error (MSPR) will
determine this score.
3
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 1
Opinion Polls — 1 point
Pollsters regularly conduct opinion polls to determine the approval and disapproval rating of the
current President of the United States. Suppose a poll is to be conducted in which 2,000 individuals
will be asked whether they approve or disapprove of the President’s performance. The individuals will
be selected by random-digit telephone dialing and asked the question over the phone by a live pollster.
Phone numbers will be dialed until 2,000 responses are gathered.
(a) What is the population of interest?
Typically these surveys are intended to reach all adult U.S. residents. In the run-up to elections, polls
are sometimes adjusted to target “likely voters.”
(b) What is the variable of interest? Is it quantitative or qualitative?
The variable of interest is approval or disapproval of Presidential performance. It is qualitative.
(c) What is the sample?
The sample is 2,000 individuals who responded to the poll.
(d) What does the pollster wish to infer from the survey?
The pollster wants to estimate the proportion of U.S. residents who approve (approval rating) or disapprove (disapproval rating) of the President’s performance. Sometimes they also report the difference between
the two (net approval rating).
(e) Do you think that the sample will be representative of the population? Why or why not?
No, it is extremely unlikely that this will be a representative sample. The response rate for telephone
surveys has been dropping for years, and Pew Research Center reported that in 2018 it was down to 6% (see
here. Couple that with selection biases such as phone ownership, phone sharing, and availability to answer
the phone due to work, and you can get a very unrepresentative sample. Pollsters use a variety of models to
try to un-bias their results to cope with this.
1
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 2
New Cancer Screening Method — 1 point
Low-dose computed tomography (LDCT) is being investigated as a screening technique for lung
cancer early detection and screening. Results published in 2019 followed a sample of 4,052 long-term
smokers, aged 50-69 years of age from near Heidelberg, Germany as part of the German Lung Cancer
Screening Intervention trial. Each participating smoker is randomly assigned to either receive four
annual LDCT screens (2,029 participants) or usual care without screening (2,023 participants). Upon
first-time detection of a nodule, the largest nodule diameter was measured as a variable of interest.
Ignore the rest of the study for this question.
(a) What type of study was used by the researchers? (Hint: see section 5.4.3 in the book)
This is a designed experiment. Specifically, it is a randomized controlled trial.
(b) What is the population of interest? What is the sample?
The population is long-term smokers, aged 50-69 years of age, in Germany, although results are probably
intended to be generalized beyond that group. The sample is the 4,052 participants in the study from the
area of Heidelberg.
(c) What is the experimental unit of the study?
The experimental unit is a smoker. There are 4,052 experimental units in the sample.
(d) Is the variable of interest quantitative or qualitative?
The variable of interest is quantitative.
(e) What inference do you think will be ultimately drawn from the clinical trial?
By comparing the size at which nodules were detected, the nodule size will be used to represent how
early the cancer was identified. If LDCT identifies smaller nodules, it will be considered to be more sensitive
than standard care.
2
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 3
The Meaning of Life — 1 point
In 2021, Pew Research Center conducted a survey of 2,596 adults in the United States (report here).
They asked ”What aspects of your life do you currently find meaningful, fulfilling, or satisfying?” They
did not report raw data, but below is a generated dataset based on reported results.
Topic
Family
Friends
Material Well-being
Career
Challenges
Spirituality
Society
Health
Hobbies
Frequency
716
300
265
262
285
250
210
172
136
(a) Compute the Relative Frequencies for each response category.
Topic
Family
Friends
Material Well-being
Career
Challenges
Spirituality
Society
Health
Hobbies
Frequency
716
300
265
262
285
250
210
172
136
(b) Construct a bar graph of the Relative Frequencies.
3
Relative Frequency
0.2758
0.1156
0.1021
0.1009
0.1098
0.0963
0.0809
0.0663
0.0524
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
(c) Interpret the data in a paragraph (2 or more sentences).
The most commonly cited aspect of life that Americans find meaningful, fulfilling, or satisfying was
family, with more than one in four Americans providing that response. About one in eight responded that
either friends or challenges was meaningful, while one in ten cited material well-being, career, or spirituality.
The least common response at about 5% was hobbies.
4
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 4
Board Game Weights — 1 point
The file bgg.csv on Canvas contains a database of every board game on the popular site “Board
Game Geek.” These data can be used to answer the following questions.
(a) The column averageweight gives the average user assessment of the weight (i.e., complexity) of
each game. Games with a value of 0 have not been rated and should not be included in any of the
following analysis.
Create a table of summary statistics for the average weight of games. This table should include
the Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum, Sample Variance, and Sample
Standard Deviation (note the use of the word ‘Sample’ even though this is arguably a census).
The describe function gives us all of the information except the Variance. I recommend that when
you use a function like describe, you confirm that it uses the correct standard deviation form by trying
np.std(data, ddof=1) and confirming it is the same value.
Minimum
1st Quartile
Median
3rd Quartile
Maximum
Mean
Variance
Standard Deviation
1.00
1.34
2.00
2.57
5.00
2.04
0.65
0.81
(b) Create Box Plots for the average weight of games by whether it is ranked as a Family Game or not.
If Family Game Rank is blank (coded as NaN), then it is not ranked as a Family Game. I recommend
adding a new column using the function np.isnan.
Are family games more or less complex? What can you say about the relative complexity of family
games and non-family games?
The median family game is lighter than the median non-family game. In fact, the 3rd Quartile weight
of family games is equal to the median weight of non-family games. Furthermore, non-family games’ weight
varies across the entire spectrum of possible scores (1–5) while family games have weights no higher than
approximately 3.
5
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
(c) Create a scatterplot of weight to average rating (average). What can you say about this relationship?
BoardGameGeek users tend to rate heavier games more highly than lighter games.
6
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 5
Baseball Hall-of-Famers — 2 points
Baseball Hall-of-Famers (HoF) played during different eras of baseball. One common classification
of eras is ‘19th Century’ (up to the 1900 season), ‘Dead Ball’ (1901–1919), ‘Lively Ball’ (1920–1941),
‘Integration’ (1942–1960), ‘Expansion’ (1961–1976), ‘Free Agency’ (1977–1993), and ‘Long Ball’ (after
1993). For this exercise, define the era of a player based on the mid-point of their career (rounding up
if necessary).
Using the file hofbatting.csv, containing non-pitching HoFs as of 2013, classify each player according to their era to answer the following questions.
(a) Create a Bar Graph and a Pie Chart for the number of HoFs from each era as of 2013. Interpret
the data.
You will need to take multiple steps to solve this problem.
• Import the CSV.
• Create a column of data to define the mid-career.
• Find a way to count the number of HoFs in each era by the mid-career column.
• Create the graphs.
• Provide a written interpretation of the data.
7
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
The Lively Ball era produced more HoFs than any other era. Long Ball has so far produced the least
HoFs (as of 2013), while every other era has roughly equal numbers of HoFs. This can be interpreted in
many ways. Long Ball era players are still young and may not have retired and been inducted into the Hall
of Fame. The Lively Ball era meanwhile corresponds with the period between World Wars in which baseball
was immensely popular.
(b) Create a histogram showing the distribution of non-pitching HoFs’ Mid-Career year.
This shows a similar trend to the graphs by era. In the period between World Wars, there were more
HoFs active. The wars clearly interrupted baseball and removed many eligible athletic men from the pool
of possible players. There are likely many other explanations.
(c) There are two major dimensions to hitting: the ability to get on base (measured by the on-base
percentage OBP) and the ability to advance runners already on base (measured by the slugging percentage
8
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
SLG). Create a scatterplot of OBP vs. SLG. Are there any outliers? If so, identify them by name. Is
there a relationship between OBP and SLG?
There is one clear outlier: Willard Brown. If you Google him, you’ll find that he was inducted in the
Hall of Fame largely based on his performance in the Negro Leagues, which is not reflected in his official
Hall of Fame statistics. There is a clear positive relationship between OBP and SLG — HoFs with higher
OBP tend to have higher SLG as well.
(d) Consider a combined metric for hitting, the On-base Plus Slugging (OPS) statistic, which is the
sum of OBP and SLG. Normalize this data (i.e., calculate the z-scores), then create a scatterplot with
OPS on the y-axis and Mid-Career Year on the x-axis. Identify any outliers by name. Do you notice
any patterns in the scatterplot of the data? What can you say (if anything) about the cause of any
pattern?
9
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
There is a slight bump in standardized OPS during the Lively Ball era (it’s small, and it’s ok if you
didn’t see it). It could be explained by the popularity of the sport driving increased performance, or maybe
just the competition of playing against Babe Ruth increases performance across the board.
As defined by a standardized value of magnitude greater than 3, there are 3 outliers: Willard Brown,
Babe Ruth, and Ted Williams. Willard Brown has already been discussed, and Babe Ruth and Ted Williams
are some of the most famous ballplayers in history.
(e) Create a Box Plot for the Home-Run Rate (HRR), defined as home-runs per at-bat (HR/AB), of
HoFs during each era (i.e., you should have 7 box-plots). Also calculate descriptive statistics of HRR
including Min, Q1, Median, Q3, Max, Mean, Range, and Sample Standard Deviation for each era.
Provide a table of these values from the Expansion era only (to limit time spent copying and pasting
from Python).
Era
Expansion
Min
0.0081
Q1
0.0254
Med
0.0420
Q3
0.0586
10
Max
0.0703
Mean
0.0410
Range
0.0622
Std Dev
0.0189
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 6
Intrusion Detection System — 1 point
A computer intrusion detection system (IDS) is designed to provide an alarm whenever an intrusion
(e.g., unauthorized access) into a computer system is being attempted. A probabilistic evaluation of
a system with two independent operating intrusion detection systems (a double IDS) was published in
the Journal of Research of the National Institute of Standards and Technology (Nov/Dec 2003).
Consider a double IDS with System A and System B. If there is an intruder, System A sounds an
alarm with probability 0.9, and System B sounds an alarm with probability 0.95. If there is no intruder,
System A sounds an alarm with probability 0.2, and System B sounds an alarm with probability 0.1.
Assume that Systems A and B operate independently.
(a) Formally express the four probabilities given in the example including defining events.
A: The event that System A sounds an alarm
B: The event that System B sounds an alarm
I: The event that there is an intruder
P (A|I) = 0.9
P (B|I) = 0.95
P (A|I c ) = 0.2
P (B|I c ) = 0.1
(b) If there is an intruder, what is the probability that both systems sound an alarm?
Because A and B operate independently,
P (A ∩ B|I) = P (A|I)P (B|I) = 0.9 × 0.95 = 0.855
(c) If there is no intruder, what is the probability that both systems sound an alarm?
Again because A and B operate independently,
P (A ∩ B|I c ) = P (A|I c )P (B|I c ) = 0.2 × 0.1 = 0.02
(d) Given an intruder, what is the probability that at least one of the systems sound an alarm?
Using the Additive Rule,
P (A ∪ B|I) = P (A|I) + P (B|I) − P (A ∩ B|I) = 0.9 + 0.95 − 0.855 = 0.995
11
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
(e) Assume that the probability of an intruder is 0.4. Also continue to assume that both systems
operate independently. If both systems sound an alarm, what is the probability that an intruder is
detected?
We now have another probability,
P (I) = 0.4
P (I ∩ A ∩ B)
P (A ∩ B)
P (A ∩ B|I)P (I)
=
P (A ∩ B ∩ I) + P (A ∩ B ∩ I c )
P (A|I)P (B|I)P (I)
=
P (A|I)P (B|I)P (I) + P (A|I c )P (B|I c )P (I c )
(0.9)(0.95)(0.4)
=
(0.9)(0.95)(0.4) + (0.2)(0.1)(1 − 0.4)
P (I|A ∩ B) = 0.966
P (I|A ∩ B) =
12
Bayes’s Rule
Bayes’s Rule again
Independence of Events
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 7
Lie Detector Test — 1 point
A new type of lie detector, called the Computerized Voice Stress Analyzer (CVSA) has been developed. The manufacturer claims that the CVSA is 98% accurate, and unlike a polygraph machine, will
not be thrown off by drugs and medical factors. However, laboratory studies by the DoD found that
the CVSA had an accuracy rate of 49.8% — slightly less than pure chance. Suppose the CVSA is used
to test the veracity of four suspects. Assume the suspects’ responses are independent.
(a) If the manufacturer’s claim is true, what is the probability that the CVSA will correctly determine
the veracity of all four suspects?
Let Ai be the event that the CVSA correctly determines the veracity of suspect i. Each suspect’s response
is independent, so
4
Y
P (A1 ∩ A2 ∩ A3 ∩ A4 ) =
P (Ai ) = (0.98)4 = 0.9224
i=1
Note that the pi operator is to the product what the sigma operator is to the sum.
(b) If the manufacturer’s claim is true, what is the probability that the CVSA will yield an incorrect
result for at least one of the four suspects?
This is the complement of the answer to part A.
P ((A1 ∩ A2 ∩ A3 ∩ A4 )c ) = 1 − P (A1 ∩ A2 ∩ A3 ∩ A4 ) = 0.0776
(c) Suppose that in a laboratory experiment conducted by the DoD on four suspects, the CVSA yielded
incorrect results for two of the suspects. Use this result to make an inference about the true accuracy
rate of the new lie detector.

This result is equivalent to flipping a coin. Of all possible outcomes there are 42 ways to choose
two suspects for whom results were inaccurate. Each of these outcomes has Aci occurring twice and Ai
occurring twice, so each instance has probability (0.02)2 × (0.98)2 . The probability of this happening if the
manufacturer’s claim were true is then

4
× (0.02)2 × (0.98)2 = 6 × 0.000384 = 0.002305.
2
It is extremely unlikely — about a 1 in 400 chance — that the DoD outcome would occur if the manufacturer’s claims were accurate.
13
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 8
Auditing an Accounting System — 1 point
In auditing a firm’s financial statements, an auditor will (1) assess the capability of the firm’s
accounting system to accumulate, measure, and synthesize transactional data properly and (2) assess
the operational effectiveness of the accounting system. In performing the second assessment, the auditor
frequently relies on a random sample of actual transactions (Stickney and Weil, Financial Accounting:
An Introduction to Concepts, Methods, and Uses, 2002). A particular firm has 5,382 customer accounts
that are numbers from 0001 to 5382.
(a) One account is to be selected at random for audit. What is the probability that account number
3,241 is selected?
By the definition of random selection, each account number is equally likely to be selected, so the
1
.
probability is 5,382
(b) Draw a random sample of 10 accounts, and explain in detail the procedure you used. (Hint: Python
can do this)
There are many ways to do this. As seen in my Python script, I used Numpy: np.random.randint(low=1,
high=5382, size=10). Results will vary but I got {3401, 1929, 2362, 3980, 4311, 398, 2271, 1496, 1327,
3234}.
(c) Referring to part b, is one sample of size 10 more likely to be chosen than any other? What is the
probability that the sample you drew in part b was selected?
No, every sample of size 10 is equally likely because they are randomly selected. There are
samples of size 10, so the probability that any given sample is selected is
1

5382
10
1
.
5.57 × 1030
Whatever sample you got, it was an extremely unlikely outcome!
14
5382
10

possible
Dr. Chris Weimer
Homework 1 Solutions
DASC 512
Problem 9
Fish Contamination — 1 point
A U.S. Army Corps of Engineers (USACE) study focused on DDT contamination of fish in the
Tennessee River in Alabama. Part of that investigation studied how far upstream contaminated fish
have migrated. A fish is considered to be contaminated if its measured DDT concentration is greater
than 5.0 parts per million (ppm).
(a) Considering only contaminated fish captured from the Tennessee River, the data reveal that 52%
of the fish are found 275–300 miles upstream, 39% are found 305–325 miles upstream, and 9% are found
330–350 miles upstream. Use the percentages to estimate the probabilities P (275–300), P (305–325),
and P (330–350).
The best estimate of probability is the proportion observed.
P (275–300) = 0.52
P (305–325) = 0.39
P (330–350) = 0.09
(b) Given that a contaminated fish is found a certain distance upstream, the probability that it is a
channel catfish (CC) is determined from the data as P (CC|275–300) = 0.775, P (CC|305–325) = 0.77,
and P (CC|330–350) = 0.86. If a contaminated channel catfish is captured from the Tennessee River,
what is the probability that it was captured 275–300 miles upstream?
Using Bayes’s Rule,
P (275–300|CC) =
P (CC|275–300)P (275–300)
P (CC)
P (CC|275–300)P (275–300)
P (CC|275–300)P (275–300) + P (CC|305–325)P (305–325) + P (CC|330–350)P (330–350)
(0.775)(0.52)
=
(0.775)(0.52) + (0.77)(0.39) + (0.86)(0.09)
P (275–300|CC) = 0.5162
=
15
Homework 2 Solutions
DASC 512
Problem 1
Get Out of Jail Free — 2 points
In the recent Monopoly Gamer version of the classic board game Monopoly, players can either pay
the fine to be released from Jail or roll a die and be released for free upon rolling a 6. In the actual
rules, the player is released automatically after 3 turns, but let’s assume a house rule in which you can
attempt the roll infinitely. Let x be the number of rolls of a 6-sided die until rolling the first 6. This
has the known probability distribution
x−1
5
1
P (X = x) = p(x) =
6
6
(a) Find p(1) and interpret the result.
p(1) =
1
×
6
0
5
1
=
6
6
The probability of be released for free on the first attempt is the probability of rolling a 6 in a single attempt
— 16 .
(b) Find p(5) and interpret the result.
4
5
54
1
×
= 5
6
6
6
625
=
≈ 0.0804
7776
p(5) =
About 1 in 12 stays in Jail will be ended for free on the 5th attempt, assuming the player never pays the
fine.
(c) Find P (X ≥ 2) and interpret the result.
P (X ≥ 2) = 1 − P (X = 1) = 1 − p(1)
1
5
=1− =
6
6
(d) If we played by the original rules where the player was released for free without rolling at the start
of their third turn, what would be the probability of this outcome?
This outcome is P (X ≥ 3).
P (X ≥ 3) = 1 − P (X = 1) − P (X = 2)
11
1
5
=1− − 2 =1−
6 6
36
25
=
36
Just over two-thirds of attempts to escape jail for free before the 3rd turn will be unsuccessful, resulting in
two wasted turns.
1
Homework 2 Solutions
DASC 512
(e) In the original Monopoly, players rolled two dice and escaped Jail on doubles (i.e., when rolling the
same number on both dice). How would this affect the probability distribution? Explain your logic.
It would not affect the probability distribution at all. There are 6 ways to roll doubles with two dice:
(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6). There are 36 total ways to roll two dice. So the probability of rolling
a success remains 16 .
2
Homework 2 Solutions
DASC 512
Problem 2
Six Sigma processes — 2 points
The “Six Sigma” quality control process dictates that manufacturing processes should be controlled
such that all products within 3 standard deviations of the mean on some measure are within allowable
deviations. However, this can be expensive, and many companies specify difference quality control
standards depending on the cost of failure.
Consider two companies produce computer monitors sold on an online marketplace similar to Amazon. Company A adheres to six-sigma for their monitors, so only 0.3% of monitors do not function
correctly. Company B sells fewer monitors and adheres to a four-sigma process, so 5% of monitors do
not function correctly.
(a) Let X be the number of non-functioning monitors in the first production run of 1,000 monitors
from company A. Let Y be the number of non-functioning monitors in the first production run of 50
monitors from company B. Define the distribution and pmf for both X and Y assuming that each
monitor’s quality is independent.
These are both binomial random variables.
X ∼ Binom(p = 0.003, n = 1000)

1000
fX (x) =
0.003x 0.9971000−x
x
Y ∼ Binom(p = 0.05, n = 50)

50
fY (y) =
0.05y 0.9550−y
y
(b) What is the mean number of non-functioning monitors from each company’s first batch? What is
the variance?
µX = np = 1000(0.003) = 3
2
σX
= np(1 − p) = 1000(0.003)(0.997) = 2.991
µY = np = 50(0.05) = 2.5
σY2 = np(1 − p) = 50(0.05)(0.95) = 2.375
(c) What is the probability that each company’s first run will be perfect (i.e., have no non-functioning
monitors)?

fX (0) =

1000
0.0030 0.9971000
0
= 0.9971000 ≈ 0.0496

50
0.050 0.9550
fY (0) =
0
= 0.9550 ≈ 0.0769
There is about a 5% chance that Company A will have a perfect first run. There is about a 7.7% chance
that Company B will have a perfect (albeit much smaller) first run.
3
Homework 2 Solutions
DASC 512
(d) Plot the pmf for both distributions with some reasonable x- and y-limits and compare them. What
can you say in general about the relative number of failed monitors each company can expect to ship
in their first batches?
Company B is more likely to have 0, 1, or 2 defects than Company A. Company A is more likely to have
3 or more defects. In general, Company A will likely have more defective monitors than Company B in their
first batch (as a total, clearly not as a proportion).
4
Homework 2 Solutions
DASC 512
Problem 3
Emergency Room arrivals — 2 points
Each day a hospital records the number of people who come to the emergency room for treatment.
(a) Assume that people arrive at a constant rate each day — that is, that they arrive according to a
Poisson distribution — with an average of 25 patients arriving per day. What is the probability that
less than 20 patients arrive today?
We are given that the distribution of arrivals is X ∼ Poisson(λ = 25). We can then use the cdf in Python.
P (X 35) ≈ 0.02246
stats.poisson.sf(k=35, mu=25)
Multiplying this by 365 days in a year (365.24 if you want to be really specific to a solar year) gives us the
expected number of days that this will occur.
P (X > 35) × 365 ≈ 8.2
So we expect the ER to be overwhelmed about 8–9 times per year.
(c) In a particular week, the arrivals to the ER are:
Sunday
10
Monday
8
Tuesday
14
Wednesday
7
Thursday
21
Friday
44
Saturday
60
Do you think that the Poisson distribution might describe the random variability in arrivals adequately? Why or why not?
No. The daily average number of arrivals was 23.4. If we assume that the number of daily arrivals is
Poisson distributed with λ = 23.4, the pmf would look like below.
Both the Friday and Saturday arrivals of 44 and 60 would be exceedingly rare events. It is far more likely
that Fridays and Saturdays are busier than the rest of the week. Realistically, you can expect to see weekly
and seasonal trends.
5
Homework 2 Solutions
DASC 512
(d) Building upon your answer to part c, would you expect the Poisson distribution to better describe,
or more poorly describe, the number of weekly admissions to the ER? Why?
It would likely far better describe weekly admissions, because that would smooth out the fluctuations by
day of the week by including a Friday and a Saturday in each data point.
6
Homework 2 Solutions
DASC 512
Problem 4
Normal Location Families — 2 points
Lake Wobegon Junior College admits students only if they score above 400 on a standardized achievement test. Applicants from Group A have a mean of 500 and a standard deviation of 100 on this test,
and applicants from Group B have a mean of 450 and a standard deviation of 100. Both distributions
are approximately normal, and both groups have the same size.
(a) Find the proportion not admitted for each group.
For each group, we are looking for the cdf at x = 400. We have two distributions:
A ∼ N (µ = 500, σ = 100)
B ∼ N (µ = 450, σ = 100)
So the proportion not admitted for each group is
P (A 0
This is a one-sided alternative hypothesis because it specifies a range and a direction.
(d) The median family income is the same in Colorado Springs as in Duluth.
Let θC be the median income in Colorado Springs and θD be the median income in Duluth.
H0 : θ C = θ D
This is a null hypothesis because it takes a specific value.
(e) The variance in resting heart rates is lower for collegiate rowers than for collegiate volleyball
players.
2
Let σR
be the variance for rowers and σV2 be the variance for volleyball players.
2
Ha : σR
0.5
Type of test: Binomial exact test (Could use z-test)
Significance: α = 0.05. The sample size is large enough that any practical difference should be visible
with high confidence.
Test statistic: p = 0.52
P value: p = 0.0874
Conclusion: There is insufficient evidence to reject the null hypothesis that neither group represents a
majority.
Problem 4
Who’s a good dog? — 1 point
In a study to determine whether dogs prefer petting or vocal praise, researchers randomly placed
14 dogs into two groups of 7 each. In group 1, the owner would pet the dog. In group 2, the owner
would provide vocal praise. Researchers measured the time, in seconds, that the dog interacted with its
owner.
(a) Owners in group 1 got 114, 203, 217, 254, 256, 284, and 296 seconds of interaction. Owners in
group 2 got 4, 7, 24, 25, 48, 71, and 294 seconds of interaction. The low outlier value in group 1 and the
high outlier value in group 2 indicate the distributions may be highly skewed, so perform a hypothesis
test on the median.
BLUF: This data shows that dogs prefer petting over vocal praise, with dogs being petted interacting
with owners longer than those being praised.
Let θ1 and θ2 be the medians of the two groups.
Hypotheses: H0 : θ1 = θ2 Ha : θ1 ̸= θ2
Type of test: Mann-Whitney U test
Significance: Not specified. Given the small samples and low cost of a type I error, I’d go with a higher
value like α = 0.1.
Test statistic: U = 43
P value: p = 0.0215
Conclusion: We reject the null hypothesis and find that these dogs prefer petting over vocal praise.
3
Homework 3 Solutions
DASC 512
(b) Suppose that the measurements in part a were gathered for the same owner/dog pair on separate
days. Perform an appropriate hypothesis test to determine if there was a difference between groups.
The differences between groups still exhibit significant left-skew. Furthermore, the Shapiro-Wilk test indicates non-normality (p = 0.0212).
While a paired t-test is an acceptable choice, a paired Wilcoxon signed-rank sum test is probably a
better choice and more comparable to part a. However, we didn’t explicitly go over that as an option in the
lessons. Philosophically, a paired test can be executed for any one-sample test method by simply testing the
difference. Notably, Python implements the Wilcoxon test as a paired test.
BLUF: This data gives strong evidence that dogs prefer petting over vocal praise, with dogs being petted
interacting with owners longer than those being praised.
Hypotheses: H0 : θ1 = θ2 Ha : θ1 ̸= θ2
Type of test: Paired Wilcoxon signed-rank sum test
Significance: Not specified. Given the small samples and low cost of a type I error, I’d go with a higher
value like α = 0.1.
Test statistic: W = 0
P value: p = 0.0156
Conclusion: We reject the null hypothesis and find that these dogs prefer petting over vocal praise.
If the paired t-test had been performed, the results would have been similar with the following differences:
Hypotheses: H0 : µ1 = µ2 Ha : µ1 ̸= µ2
Type of test: Paired t-test
Test statistic: t = 5.36
P value: p = 0.0017
4
Homework 3 Solutions
DASC 512
Problem 5
Three’s a crowd — 1 point
A recent General Social Survey (GSS) asked the question, “What do you think is the ideal number
of children to have?” The mean value of 1302 responses was 2.49 with a standard deviation of 0.85. Do
Americans on average think that the ideal number of children is more than 2?
BLUF: This survey gives very strong evidence that Americans on average think that the ideal number of
children is more than 2.
With such a high sample size, we’ll have no problem using a one-sample t-test or a z-test. Let µ be the
average response of all Americans.
Hypotheses: H0 : µ = 2 Ha : µ > 2
Type of test: One-sample t-test.
Significance: Not specified, but with such a large sample we can use α = 0.05.
2.85−2

= 20.8
Test statistic: t = 0.85/
1302

Critical value: t = 1.96
P value: p = 1.7 × 10−83
Conclusion: We reject the null hypothesis and find that Americans on average think that the ideal number
of children is more than 2.
5
Homework 3 Solutions
DASC 512
Problem 6
You batter bell-lieve it — 2 points
Use the file BattingAverages.csv, containing batting averages for all players with at least 100 at
bats for the 2009 season, for the following questions. Assume this is a random sample rather than a
census.
(a) Are the batting averages data (BattingAvg) normally distributed? Use both graphical and analytical methods to make your argument.
BLUF: The batting averages data appear to be normally distributed.
Let X be the batting averages data.
Hypotheses: H0 : X ∼ N orm(x̄, s2 ) Ha : x ̸∼ N orm(x̄, s2 )
Type of test: Visual and Lilliefors (you may have chosen another test)
Significance: Not specified, but I would want very strong evidence of non-normality to possibly push me
to a non-parametric test. I’ll use α = 0.01.
Test statistic: T = 0.0232
P value: p = 0.8471
Conclusion: We fail to reject the null hypothesis that the data is normally distributed. Visual assessment
with a Q-Q plot confirms near-normality for this data.
6
Homework 3 Solutions
DASC 512
(b) Is the mean value of batting averages at least .265? Perform a test to find out. Use α = 0.05 for
your test.
BLUF: There is insufficient evidence to conclude that the mean batting average in the 2009 season was
greater than .265. In fact, the observed average was less than .265 (.261).
Since we are comparing means and the data is approximately normal with a large sample size, we can
use either the t-test or the z-test. I’ll use the t-test.
Hypotheses: H0 : µ = 0.265 Ha : µ > 0.265
Type of test: One-sample t-test.
Significance: α = 0.05 as specified.
Test statistic: t = −2.37
Critical value: t∗ = 1.65
P value: p = 0.9910
Conclusion: We fail to reject the null hypothesis that the mean batting average is .265.
(c) Was there a difference between batting averages in the National League and American League
(column League)? Use α = 0.05 for your test.
BLUF: There is insufficient evidence to conclude that the National League average batting average was
difference from the American League average batting average in the 2009 season.
Since we are comparing means and the data is approximately normal with a large sample size, we can
use either the two-sample t-test or the two-sample z-test. I’ll use the t-test.
Hypotheses: H0 : µN − µA = 0 Ha : µN − µA ̸= 0
Type of test: Two-sample t-test with pooled variance. The sample variance for the NL is 0.0011 and the
sample variance for the AL is 0.0012.
Significance: α = 0.05 as specified.
Test statistic: t = −0.0985
Critical value: t∗ = 1.65
P value: p = 0.9216
Conclusion: We fail to reject the null hypothesis that the mean batting average is the same between the
leagues.
7
Homework 3 Solutions
DASC 512
Problem 7
Cowbell usage must have increased… — 2 points
A researcher studying true body temperature in adult humans collected the data in BodyTemp.csv
in degrees Fahrenheit.
(a) Is the body temperature normally distributed? Use graphical and analytical methods to make your
argument.
BLUF: The body temperature data appear to be normally distributed.
Let X be the body temperature observations.
Hypotheses: H0 : X ∼ N orm(x̄, s2 ) Ha : x ̸∼ N orm(x̄, s2 )
Type of test: Visual and Lilliefors (you may have chosen another test)
Significance: Not specified, but I would want very strong evidence of non-normality to possibly push me
to a non-parametric test. I’ll use α = 0.01.
Test statistic: T = 0.0692
P value: p = 0.1195
Conclusion: We fail to reject the null hypothesis that the data is normally distributed. Visual assessment
with a Q-Q plot confirms near-normality for this data with perhaps some deviation at the tails.
(b) Is the body temperature equal to 98.6?
BLUF: There is very strong evidence that the average body temperature is not 98.6.
Since we are comparing means and the data is approximately normal with a large sample size, we can
use either the t-test or the z-test. I’ll use the t-test.
Hypotheses: H0 : µ = 98.6 Ha : µ ̸= 98.6
Type of test: One-sample t-test.
Significance: α = 0.05 because we have enough data to make a strong conclusion.
Test statistic: t = −6.03
Critical value: t∗ = 1.65
P value: p = 1.3 × 10−8
Conclusion: We reject the null hypothesis that the mean body temperature is 98.6 degrees.
8
Homework 3 Solutions
DASC 512
(c) For the α you selected, what is the power to detect a difference of 0.2 degrees? Assume the sample
variance is equal to the population variance.
For a difference of 0.2 degrees, the effect size (as a multiple of s) is 0.27. With 148 observations and
α = 0.05, the power is 0.9061.
(d) Create a plot showing how α (x-axis) affects the power to detect a difference of 0.2 degrees (y-axis).
(e) Is there a difference between body temperature in males and females?
BLUF: There is enough evidence to conclude that males and females have different average body temperatures.
Since we are comparing means and the data is approximately normal with a large (and equal) sample
size, we can use either the two-sample t-test or the two-sample z-test. I’ll use the t-test.
Hypotheses: H0 : µM − µF = 0 Ha : µM − µF ̸= 0
Type of test: Two-sample t-test with pooled variance. The sample variance for males is 0.4912 and the
sample variance for females is 0.5497.
Significance: α = 0.05 because we have enough data to make a strong conclusion.
Test statistic: t = −2.77
Critical value: t∗ = 1.98
P value: p = 0.0064
Conclusion: We reject the null hypothesis that average body temperature is equal for males and females.
9
Homework 3 Solutions
DASC 512
Problem 8
The power of love — 1 point
I want to design a study to determine if my daughter can average more than five minutes without
asking me a question while teleworking. I plan on using a t-test regardless of sample size.
(a) I want to detect a difference of 90 seconds with 80% power and 90% confidence. How many intervals
do I need to measure? Assume a standard deviation of 2 minutes.
For a difference of 90 seconds, the effect size (as a multiple of s) is 0.75. I’ll need a sample size of at least
12.46 intervals, so I must measure at least 13 intervals.
(b) For differences of 30, 60, and 90 seconds, construct a plot showing the effect that sample size
(x-axis) will have on power (y-axis) for α = 0.1.
10
Winter Quarter 2022
1
Midterm Exam
DASC 512
Cover Sheet and Instructions
There are five (5) pages and nine (9) questions on this exam. This mid-term exam covers Weeks 1–5 of
DASC 512 material. It is worth 35% of your overall course grade. It is due not later than 2359 EST, 14 Feb
2022, which is two weeks from the date the exam is posted. I will accept late submissions up to 2 days late at
a 10% reduction in grade. I will not accept submissions later than that. Note that a 10% reduction in your
mid-term grade is roughly equivalent to failing to submit an entire homework. Submissions must include
both a PDF fully detailing your response to the questions (i.e., results, narrative, tables, and graphs) along
with any Python code you used in a .py or .ipynb format. Code does not need to be part of the PDF.
Integrity Rules: If you have questions on this exam, you may contact the instructor or post a question
on the Mid-term Discussion Forum. You are not allowed to answer anyone else’s question on that forum —
the instructor will choose the level of information to provide.
Instructions: For all problems, be sure to give full details of your analysis.
I recommend using an Assume, Given, Find, Solution, Answer method to organize your thoughts and
response for problems that are not hypothesis tests. This is not required, but it helps guide your thought
process.
For hypothesis tests, be sure to include:
• a non-technical summary of results
• hypothesis statements
• assumptions
• test chosen with justification
• significance level
• appropriate results, such as test statistic, p-value, rejection region, confidence intervals, and/or ANOVA
tables
• technical conclusion
Examples of well-formulated solutions are given on the next page.
1
Winter Quarter 2022
2
Midterm Exam
DASC 512
Example Solutions
Example Problem 1: In the board game Gloomhaven, characters start with decks of 20 cards that
provide modifiers for an attack: 1 miss (0 damage), 1 -2 (base – 2 damage), 5 -1 (base – 1 damage), 6 +0
(base damage), 5 +1 (base + 1 damage), 1 +2 (base + 2 damage), and 1 critical hit (2x base damage). A
character attacks with an 3-damage attack and uses advantage — taking the higher of two random modifiers.
What is the probability that they do at least 4 damage?
Assume: The full deck is available and well shuffled. Let X be the damage from a single card draw. Let
Y be the damage with advantage.
Given: For X, P(0) = 1/20, P(1) = 1/20, P(2) = 5/20, P(3) = 6/20, P(4) = 5/20, P(5) = 1/20, P(6) =
1/20.
Find: P (Y ≥ 4)
Solution: On the first draw, P (X ≥ 4) = 7/20. If the first draw results in less than 4 damage, then
P (Y ≥ 4) = 7/20. If the first draw results in at least 4 damage, then (P (Y ≥ 4) = 1. Thus
P (Y ≥ 4) = P (Y ≥ 4|X ≥ 4) + P (Y ≥ 4|X 0.
Assumptions: Both dice are fair dice with 1/6 chance of rolling each outcome — a uniform distribution
between 1 and 6. Each roll is iid. Although the underlying distribution is uniform, the sample size is 100 for
both groups, so assume the population is normally distributed with µ = 3.5, σ2 = 35/12 — the mean and
variance of a discrete uniform distribution with 6 outcomes.
Type of test: Two-sample t-test with equal variance. Sample size is sufficient for the Central Limit
Theorem to apply to the sampling distribution despite underlying uniform distributions.
Significance: The risk of a type-I error is low, so let α = 0.1.
Test statistic: t = 2.07
Rejection region: t > 1.29
P-value: p = 0.0199
Confidence interval: With 10% confidence, Beau’s rolls are better than Chris’s by between 0.22 and 0.78
on average.
Conclusion: At the 0.1 significance level, we reject the null hypothesis that Chris and Beau roll equally
well and conclude that Beau’s rolls are higher on average than Chris’s.
2
Winter Quarter 2022
3
Midterm Exam
DASC 512
Exam Questions
Problem 1: 5 points
Suppose that there are four inspectors at a film factory who are supposed to stamp the expiration date
on each package of film at the end of the assembly line: John, Tina, Wayne, and Amy. John processes
20% of all packages, and he fails to stamp 1/200 packages that he processes. Tina processes 60% of all
packages, and she fails to stamp 1/100 packages. Wayne processes 15% of all packages, and he fails to
stamp 1/90 packages. Amy processes 5% of all packages, and she fails to stamp 1/200 packages.
A customer calls to complain that her package of film does not show an expiration date. What is
the probability that it was inspected by John?
Assume: Worker mistakes are regular and uniform
F−Event of missing of Stamps.
J−Event of John processing a package.
T−Event of Tina processing a package
W−Event of Wayne processing a package.
A−Event of Amy processing a package.
P(J) = 20/100 = 0.2
P(F|J) = 1/200 = 0.005
P(T) = 60/100 = 0.6
P(F|T) = 1/200 = 0.01
P(W) = 15/100 = 0.15
P(F|W) = 1/90 = 0.0111
P(A) = 5/100 = 0.05
P(F|A) = 1/200 = 0.005
Find: P(J|F) Solution: From the Bayes’ Rule P(A|B) = P(B|A)P(A) P(B) 1
From combined Probability P(A) = Xn k=0 P(Bk)P(A|Bk)
We can find the Solution
P(F) = Xn k=0 P(Bk)P(FBI) P(F) = 0.2 × 0.005 + 0.6 × 0.01 + 0.15 × 0.011 + 0.05 × 0.005 =
0.0179 Then, P(J|F) = P(F|J)P(J) P(F) P(J|F) = 0.005 × 0.2 0.0179 Answer: P(J|F) = 0.05586
3
Winter Quarter 2022
DASC 512
Midterm Exam
Problem 2: 5 points
A regional telephone company operates three identical relay stations at different locations. During a
one-year period, the number of malfunctions reported by each station and the causes are shown below.
Causes
Problems with Electricity Supplied
Computer Malfunction
Malfunctioning Equipment
Human Error
Station
A B C
2 6 4
4 3 1
3 4 3
9 4 7
Suppose that a malfunction was reported and it was found to be caused by human error. What is
the probability that it came from Station C?
From combined Probability Assume:
All the stations are unique and independent
C−Event of Error at Station
C. H−Event of Error at Station H.
Solution:
P(C) = 4 + 1 + 3 + 7 2 + 6 + 4 + 4 + 3 + 1 + 3 + 4 + 3 + 9 + 4 + 7 = 0.3
P(H) = 9 + 4 + 7 2 + 6 + 4 + 4 + 3 + 1 + 3 + 4 + 3 + 9 + 4 + 7 = 0.4
P(H|C) = 7 7 + 3 + 1 + 4 = 0.4667 P(C|H) = P(H|C)P(C) P(H) = 0.4667 × 0.3 0.4 = 0.35
0.35
4
Winter Quarter 2022
Midterm Exam
DASC 512
Problem 3: 10 points
Describe the effect of sample size, effect size, and level of significance on statistical power.
Sample size is a frequently used term in statistics is use whenever you have a large population.
When you have a large population, you are interested in the entire population, but it is not
realistically to study the entire population. Therefore, you take a random sample which represents
the entire population. The size of the sample is important for accurate, statistically significant
results and running a successfully study. If your sample is too small, you may include a
disproportionate number which are outliers and anomalies. These may skew the results not
accurate results projected of the entire population. If the sample is too big, the whole study
becomes complex, expensive, and time-consuming, and although the results are more accurate, the
benefits don’t outweigh the costs.
Effect size is a statistical concept that measures the strength of the relationship between two
variables on a numeric scale. In statistic effect size helps us in determining if the difference is real
or if it is due to a change of factors. In hypothesis testing, effect size, power, sample size, and
critical significance level are related to each other. The effect size indicates the practical
significance of a link between two variables. The significance level is the probability of rejecting
the null hypothesis when it is true, represented by the p-value (also known as Alpha). A significant
result simply implies that your statistical test’s p-value was equal to or less than your alpha, which
is usually 0.05 in most circumstances. For example, a significance level of 0.05 indicates a 5% risk
of concluding that a difference exists when there is no actual difference. A larger sample size can
make an effect easier to detect, and the statistical power can be increased in a test by increasing the
significance level.
Problem 4: 5 points
Suppose that we are interested in the IQ of incoming students at AFIT. We want to run a test that
will detect a 5 IQ point difference between the true mean and our new class. From previous research,
we can assume a standard deviation of 15 (i.e., the parameter is known), and leadership wants to have
options for α = 0.05 and α = 0.1. Create a plot for a power analysis where the y-axis is the power of
our test and the x-axis is the sample size. It should have a line for each significance level.
5
Winter Quarter 2022
Midterm Exam
DASC 512
Figure 1: Number of observations Vs Power of Test
Suppose we are interested in the IQ of incoming students at AFIT. We want to run a test that will
detect a 5 IQ point difference between the true mean and our new class. From previous research,
we can assume a standard deviation of 15 (i.e., the parameter is known), and leadership wants to
have options for alpha = 0.05 and alpha = 1.
Problem 5: 5 points
The accuracy of a new precision air drop system being tested by the US Air Force follows a normal
distribution with a mean of 50 ft and a standard deviation of 10 ft. A particular resupply mission drops
12 payloads. It is considered to be successful if at least 9 of the 12 payloads are delivered at between
45 and 60 feet. What is the probability that the resupply mission will be successful?
H0 Null Hypothesis: The Air drop mission is unsuccessful
µ = 50 σ = 10
Considering Normal Distribution z = x − µ σ
Considering Upper boundary z = 60 − 50/10 p(x 3.3^2
Assumptions: Assume the population is normally distributed type of test: One sample chisquared test for variance (standard deviation). Given population is normally distributed
Significance: The st andard risk of a type-I error is α=0.05.
Test statistic: c h is q u a r e d = 34.41
Rejection region: c h i sq u a r e d > 54.57
P-value: p = 0.6789
.05 significance level, we fail to reject the null hypothesis that the standard deviation of the rod
diameters is greater than 3.3 on average. We may have to inform to management.
7
Winter Quarter 2022
DASC 512
Midterm Exam
This test gave sufficient evidence to conclude mean chlorine content (in ppm of chlorine) is content
is less than 71ppm on average.
Hypotheses: H0 : μ= 71, Ha : μF)
7.683892e-08
NaN
resulting anova table 2
C(alloy)
Residual
df
1.0
88.0
sum_sq
1.058155e+05
1.785280e+06
mean_sq
105815.511111
20287.273232
resulting anova table 3
8
F
5.215857
NaN
PR(>F)
0.024789
NaN
Winter Quarter 2022
Midterm Exam
DASC 512
Problem 9: 55 points
The data given in faithful.csv records data of 272 eruptions at the Old Faithful geyser at Yellowstone
National Park. Column 1 reports the length of the eruption (in minutes) and column 2 reports the length
of the interval between the previous eruption and this one (i.e., wait time).
9
Winter Quarter 2022
Midterm Exam
DASC 512
This test gave sufficient evidence to conclude mean wait time for eruptions less than 3 minutes is
less than 60 minutes.
Hypotheses: H0 : μ= 60, Ha : μ
Purchase answer to see full
attachment

SOLUTION: Air Force Institute of Technology Regression Project Models and Codes in Python Worksheet

Calculate your order
Pages (275 words)
Standard price: $0.00
Client Reviews
4.9
Sitejabber
4.6
Trustpilot
4.8
Our Guarantees
100% Confidentiality
All your data is secure and will never be disclosed to third parties. Your essay or assignment is treated as your intellectual property and can never be shared or provided as a sample to aspiring customers.
Original Writing
We complete all papers from scratch. You can get a plagiarism report.
Timely Delivery
You will never have to worry about deadlines – 98% of our assignments are completed on time.
Money Back
We give refunds anytime you feel the work did not meet your expectations. However, we have not refunded any papers in the last 6 months as our team keeps improving their quality and customer service.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00
Power up Your Academic Success with the
Team of writers and tutors. We are here for you.
Power up Your Study Success with Experts We’ve Got Your Back.