# SOLUTION: Air Force Institute of Technology Regression Project Models and Codes in Python Worksheet

SOLUTION: Air Force Institute of Technology Regression Project Models and Codes in Python Worksheet.

1

Final Project

DASC 512

Data

The data in this problem were collected by two economists to be used in constructing a regression equation

to serve as a price index for owner-occupied housing in a region containing a large U.S. city. Data were

obtained for each of 506 census tracts in and around the city. (The U.S. Census Bureau has partitioned the

entire country into geographical regions called census tracts that contain approximately the same number of

people.)

Some variables were reported on a census tract basis while others were reported on a community basis. For

example, the property tax rate is determined by each community. If a community consists of more than one

census tract, the property tax rate will be the same for each census tract in that community.

Census tracts 357 – 488 (inclusive) are all part of the city proper. The remaining census tracts are in towns

or suburbs in the surrounding metropolitan area, but they are not in the city. The census tracts in the city

have the same values for the property tax, pupil-teacher ratio, zoning, and highway access variables.

The data for 506 census tracks are in the associated data file student_data.csv, although you should note

that the last 50 data points are missing Y values. These are the test points. There is one line of data for

each census tract. Values for the variables appear in the order they are listed below. Use these variable

names in formulas and tables presented in your report.

With the exception of Census Tract, which is a three-digit identification, the variables are described in Table

1.

Variable

Y

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

Description

Median value of owner-occupied homes in the census tract

Per capita crime rate in the community

Percentage of a community’s residential land zoned for lots greater than

25,000 square feet

Percentage of acres in the community zoned for non-retail business

Dummy variable equaling 1 if the tract borders a specific river and 0

otherwise

Average concentration (parts per 100 million) of nitrogen oxides in the

air (a measure of air pollution)

Average number of rooms per owner-occupied home

Percentage of owner-occupied homes that are more than thirty years

old

Natural logarithm of the weighted distances to five major employments

centers in the metropolitan area. Larger values indicate the tract is

farther from the major employment centers.

Natural logarithm of an index of accessibility to radial highways. Calculated on a community basis. Larger values represent better highway

access.

Property tax rate in dollars per $10,000 of property value. This measures costs paid by homeowners to maintain schools and public services

in each community.

Pupil-teacher ratio in each school district. Lower values may indicate

higher quality public schools.

Percentage of adults without a high school diploma or classified as laborers

Table 1: Variables used in student_data.csv

1

Winter 2022

2

DASC 512

Final Project

Task

You have been commissioned by a real estate investor to develop a regression model for this data and generate

a formal report of the results. This investor conducts some political lobbying to reinforce property values

where she has a stake and has a particular interest in the effect of air pollution on median home values.

Your task is to analyze the data for the 456 census tracks for which you have complete data and construct

one or more good regression models for predicting Y, the median value of owner-occupied homes. Include

additional explanatory variables constructed from functions of the variables on the data file if you think that

they are worthwhile. No raw Python output should be present in the report. Summarize your analysis in a

report in PDF format that includes the following discussions.

1. A 1–2 paragraph “Executive Summary” of your major conclusions about the relationships between

median housing prices and the explanatory variables. Whether it is included in your model or not, you

should address the nitrogen oxide variable (X5). This should not contain any formulas or mathematical

symbols. It should be written so that it could be easily understood by a real estate investor with no

formal training in statistics.

2. A description of the steps taken to identify your best model(s). Do not submit any raw Python output

in this section. Graphical analysis and summary statistics are encouraged. Simply outline the issues

you considered, your decisions, and the sequence of steps you took to develop a model. Be detailed —

tell me what you did, why you did it, and if it worked.

• For the purposes of this course, consider only models with main effects, quadratic effects (X 2 ),

and the following interactions: X1*X5, X4*X5, X5*X6, and X5*X8. This should be reasonable

while still forcing you to explore the model building process.

• Because of this model limitation, it is possible that there may be some higher order effects and/or

non-linearity in the data that you cannot model. Remember this when looking at residual plots.

Point out if you think there may be deficiencies caused in this way. Y can still be transformed as

you see fit if you find it necessary/useful.

3. A formula for your best model(s), standard errors for coefficients, and the R2 value. Summarize Python

results in tables of your own creation — do not report any values that you do not intend to discuss.

Discuss and interpret any important features of your model. Pay some attention to the nitrogen oxide

(air pollution) variable as a predictor of median housing values, although you may conclude that it is

not important.

4. Convincing evidence that the model you selected is a good model for using some or all of the 12

explanatory variables to predict median housing values. Discussion of residual plots and other diagnostic checks would be appropriate. Statistical tests should be formulated correctly with appropriate

hypotheses and conclusions. Graphs and tables are encouraged, but raw Python output should not be

submitted and will be ignored.

5. (Optional) One paragraph outlining additional analyses that you would have done if you had more

time or were not artificially restricted in your model parameters. You will earn points for suggestions

with high potential value, but you will also lose points for suggestions with little potential value.

Separate from the PDF report, submit a CSV file with your predictions for the missing Y data points (the

last 50 observations). Use your final, best model to predict Y and create a 95% prediction interval for each

point. Points for the “Predictive Ability” section will be based on the following

1. Mean Square Prediction Error (MSPE) — Lower is better

M SP E =

n

2

X

(y − ŷ)

i=1

n

2. Coverage of your confidence intervals — 2–3 intervals are expected to miss due to random error

3. Width of your confidence intervals — narrow is better as long as coverage is acceptable

2

Winter 2022

3

Final Project

DASC 512

Deliverables

1. last_first.pdf — PDF document with your write-up. Use MS Word’s “print to PDF” feature as

necessary. You should have no raw Python output in this document. This should be a well-structured

report with narrative, graphics, and tables as needed.

2. last_first.csv — CSV file with your predictions for the last 50 observations. This should have the

following columns:

(a) Census Tract

(b) Prediction

(c) Lower Prediction CI

(d) Upper Prediction CI

3. last_first.ipynb or last_first.py — Python file with your complete analysis, including plot generation, statistical tests, and predictions. Only include relevant code and comments and/or narrative

blocks to explain the code, just as it would be delivered to a client. I should be able to run it top-tobottom without errors. It should not be the digital equivalent of scratch paper.

4

Grading

This is an individual assignment, and you are expected to do your own work. Do not discuss this project with

anyone other than the course instructor. The primary task of this assignment is to write a report detailing

how and why you came to a regression model relating median home price to predictor variables. Write a

coherent and concise report that flows well and clearly describes you analysis and conclusions. There is no

absolutely best answer, and I expect to receive many different models.

This final will be worth 80 points. You will be graded on:

1. Writing (10 points): Emphasis on precision, clarity, and efficiency. You should use paragraphs, transitions, sections and incorporate any figures and tables into the flow of the document.

2. Executive Summary (10 points): Clear and concise use of language to convey your model in a limited

space.

3. Model Building Process, Logic, and Conclusions (40 points): Appropriate use of tools from this course

applied correctly and communicated effectively.

4. Predictive Ability (20 points): You will provide point estimates and confidence intervals for the withheld

test points. Coverage of true values in those intervals and Mean Square Prediction Error (MSPR) will

determine this score.

3

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 1

Opinion Polls — 1 point

Pollsters regularly conduct opinion polls to determine the approval and disapproval rating of the

current President of the United States. Suppose a poll is to be conducted in which 2,000 individuals

will be asked whether they approve or disapprove of the President’s performance. The individuals will

be selected by random-digit telephone dialing and asked the question over the phone by a live pollster.

Phone numbers will be dialed until 2,000 responses are gathered.

(a) What is the population of interest?

Typically these surveys are intended to reach all adult U.S. residents. In the run-up to elections, polls

are sometimes adjusted to target “likely voters.”

(b) What is the variable of interest? Is it quantitative or qualitative?

The variable of interest is approval or disapproval of Presidential performance. It is qualitative.

(c) What is the sample?

The sample is 2,000 individuals who responded to the poll.

(d) What does the pollster wish to infer from the survey?

The pollster wants to estimate the proportion of U.S. residents who approve (approval rating) or disapprove (disapproval rating) of the President’s performance. Sometimes they also report the difference between

the two (net approval rating).

(e) Do you think that the sample will be representative of the population? Why or why not?

No, it is extremely unlikely that this will be a representative sample. The response rate for telephone

surveys has been dropping for years, and Pew Research Center reported that in 2018 it was down to 6% (see

here. Couple that with selection biases such as phone ownership, phone sharing, and availability to answer

the phone due to work, and you can get a very unrepresentative sample. Pollsters use a variety of models to

try to un-bias their results to cope with this.

1

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 2

New Cancer Screening Method — 1 point

Low-dose computed tomography (LDCT) is being investigated as a screening technique for lung

cancer early detection and screening. Results published in 2019 followed a sample of 4,052 long-term

smokers, aged 50-69 years of age from near Heidelberg, Germany as part of the German Lung Cancer

Screening Intervention trial. Each participating smoker is randomly assigned to either receive four

annual LDCT screens (2,029 participants) or usual care without screening (2,023 participants). Upon

first-time detection of a nodule, the largest nodule diameter was measured as a variable of interest.

Ignore the rest of the study for this question.

(a) What type of study was used by the researchers? (Hint: see section 5.4.3 in the book)

This is a designed experiment. Specifically, it is a randomized controlled trial.

(b) What is the population of interest? What is the sample?

The population is long-term smokers, aged 50-69 years of age, in Germany, although results are probably

intended to be generalized beyond that group. The sample is the 4,052 participants in the study from the

area of Heidelberg.

(c) What is the experimental unit of the study?

The experimental unit is a smoker. There are 4,052 experimental units in the sample.

(d) Is the variable of interest quantitative or qualitative?

The variable of interest is quantitative.

(e) What inference do you think will be ultimately drawn from the clinical trial?

By comparing the size at which nodules were detected, the nodule size will be used to represent how

early the cancer was identified. If LDCT identifies smaller nodules, it will be considered to be more sensitive

than standard care.

2

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 3

The Meaning of Life — 1 point

In 2021, Pew Research Center conducted a survey of 2,596 adults in the United States (report here).

They asked ”What aspects of your life do you currently find meaningful, fulfilling, or satisfying?” They

did not report raw data, but below is a generated dataset based on reported results.

Topic

Family

Friends

Material Well-being

Career

Challenges

Spirituality

Society

Health

Hobbies

Frequency

716

300

265

262

285

250

210

172

136

(a) Compute the Relative Frequencies for each response category.

Topic

Family

Friends

Material Well-being

Career

Challenges

Spirituality

Society

Health

Hobbies

Frequency

716

300

265

262

285

250

210

172

136

(b) Construct a bar graph of the Relative Frequencies.

3

Relative Frequency

0.2758

0.1156

0.1021

0.1009

0.1098

0.0963

0.0809

0.0663

0.0524

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

(c) Interpret the data in a paragraph (2 or more sentences).

The most commonly cited aspect of life that Americans find meaningful, fulfilling, or satisfying was

family, with more than one in four Americans providing that response. About one in eight responded that

either friends or challenges was meaningful, while one in ten cited material well-being, career, or spirituality.

The least common response at about 5% was hobbies.

4

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 4

Board Game Weights — 1 point

The file bgg.csv on Canvas contains a database of every board game on the popular site “Board

Game Geek.” These data can be used to answer the following questions.

(a) The column averageweight gives the average user assessment of the weight (i.e., complexity) of

each game. Games with a value of 0 have not been rated and should not be included in any of the

following analysis.

Create a table of summary statistics for the average weight of games. This table should include

the Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum, Sample Variance, and Sample

Standard Deviation (note the use of the word ‘Sample’ even though this is arguably a census).

The describe function gives us all of the information except the Variance. I recommend that when

you use a function like describe, you confirm that it uses the correct standard deviation form by trying

np.std(data, ddof=1) and confirming it is the same value.

Minimum

1st Quartile

Median

3rd Quartile

Maximum

Mean

Variance

Standard Deviation

1.00

1.34

2.00

2.57

5.00

2.04

0.65

0.81

(b) Create Box Plots for the average weight of games by whether it is ranked as a Family Game or not.

If Family Game Rank is blank (coded as NaN), then it is not ranked as a Family Game. I recommend

adding a new column using the function np.isnan.

Are family games more or less complex? What can you say about the relative complexity of family

games and non-family games?

The median family game is lighter than the median non-family game. In fact, the 3rd Quartile weight

of family games is equal to the median weight of non-family games. Furthermore, non-family games’ weight

varies across the entire spectrum of possible scores (1–5) while family games have weights no higher than

approximately 3.

5

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

(c) Create a scatterplot of weight to average rating (average). What can you say about this relationship?

BoardGameGeek users tend to rate heavier games more highly than lighter games.

6

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 5

Baseball Hall-of-Famers — 2 points

Baseball Hall-of-Famers (HoF) played during different eras of baseball. One common classification

of eras is ‘19th Century’ (up to the 1900 season), ‘Dead Ball’ (1901–1919), ‘Lively Ball’ (1920–1941),

‘Integration’ (1942–1960), ‘Expansion’ (1961–1976), ‘Free Agency’ (1977–1993), and ‘Long Ball’ (after

1993). For this exercise, define the era of a player based on the mid-point of their career (rounding up

if necessary).

Using the file hofbatting.csv, containing non-pitching HoFs as of 2013, classify each player according to their era to answer the following questions.

(a) Create a Bar Graph and a Pie Chart for the number of HoFs from each era as of 2013. Interpret

the data.

You will need to take multiple steps to solve this problem.

• Import the CSV.

• Create a column of data to define the mid-career.

• Find a way to count the number of HoFs in each era by the mid-career column.

• Create the graphs.

• Provide a written interpretation of the data.

7

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

The Lively Ball era produced more HoFs than any other era. Long Ball has so far produced the least

HoFs (as of 2013), while every other era has roughly equal numbers of HoFs. This can be interpreted in

many ways. Long Ball era players are still young and may not have retired and been inducted into the Hall

of Fame. The Lively Ball era meanwhile corresponds with the period between World Wars in which baseball

was immensely popular.

(b) Create a histogram showing the distribution of non-pitching HoFs’ Mid-Career year.

This shows a similar trend to the graphs by era. In the period between World Wars, there were more

HoFs active. The wars clearly interrupted baseball and removed many eligible athletic men from the pool

of possible players. There are likely many other explanations.

(c) There are two major dimensions to hitting: the ability to get on base (measured by the on-base

percentage OBP) and the ability to advance runners already on base (measured by the slugging percentage

8

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

SLG). Create a scatterplot of OBP vs. SLG. Are there any outliers? If so, identify them by name. Is

there a relationship between OBP and SLG?

There is one clear outlier: Willard Brown. If you Google him, you’ll find that he was inducted in the

Hall of Fame largely based on his performance in the Negro Leagues, which is not reflected in his official

Hall of Fame statistics. There is a clear positive relationship between OBP and SLG — HoFs with higher

OBP tend to have higher SLG as well.

(d) Consider a combined metric for hitting, the On-base Plus Slugging (OPS) statistic, which is the

sum of OBP and SLG. Normalize this data (i.e., calculate the z-scores), then create a scatterplot with

OPS on the y-axis and Mid-Career Year on the x-axis. Identify any outliers by name. Do you notice

any patterns in the scatterplot of the data? What can you say (if anything) about the cause of any

pattern?

9

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

There is a slight bump in standardized OPS during the Lively Ball era (it’s small, and it’s ok if you

didn’t see it). It could be explained by the popularity of the sport driving increased performance, or maybe

just the competition of playing against Babe Ruth increases performance across the board.

As defined by a standardized value of magnitude greater than 3, there are 3 outliers: Willard Brown,

Babe Ruth, and Ted Williams. Willard Brown has already been discussed, and Babe Ruth and Ted Williams

are some of the most famous ballplayers in history.

(e) Create a Box Plot for the Home-Run Rate (HRR), defined as home-runs per at-bat (HR/AB), of

HoFs during each era (i.e., you should have 7 box-plots). Also calculate descriptive statistics of HRR

including Min, Q1, Median, Q3, Max, Mean, Range, and Sample Standard Deviation for each era.

Provide a table of these values from the Expansion era only (to limit time spent copying and pasting

from Python).

Era

Expansion

Min

0.0081

Q1

0.0254

Med

0.0420

Q3

0.0586

10

Max

0.0703

Mean

0.0410

Range

0.0622

Std Dev

0.0189

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 6

Intrusion Detection System — 1 point

A computer intrusion detection system (IDS) is designed to provide an alarm whenever an intrusion

(e.g., unauthorized access) into a computer system is being attempted. A probabilistic evaluation of

a system with two independent operating intrusion detection systems (a double IDS) was published in

the Journal of Research of the National Institute of Standards and Technology (Nov/Dec 2003).

Consider a double IDS with System A and System B. If there is an intruder, System A sounds an

alarm with probability 0.9, and System B sounds an alarm with probability 0.95. If there is no intruder,

System A sounds an alarm with probability 0.2, and System B sounds an alarm with probability 0.1.

Assume that Systems A and B operate independently.

(a) Formally express the four probabilities given in the example including defining events.

A: The event that System A sounds an alarm

B: The event that System B sounds an alarm

I: The event that there is an intruder

P (A|I) = 0.9

P (B|I) = 0.95

P (A|I c ) = 0.2

P (B|I c ) = 0.1

(b) If there is an intruder, what is the probability that both systems sound an alarm?

Because A and B operate independently,

P (A ∩ B|I) = P (A|I)P (B|I) = 0.9 × 0.95 = 0.855

(c) If there is no intruder, what is the probability that both systems sound an alarm?

Again because A and B operate independently,

P (A ∩ B|I c ) = P (A|I c )P (B|I c ) = 0.2 × 0.1 = 0.02

(d) Given an intruder, what is the probability that at least one of the systems sound an alarm?

Using the Additive Rule,

P (A ∪ B|I) = P (A|I) + P (B|I) − P (A ∩ B|I) = 0.9 + 0.95 − 0.855 = 0.995

11

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

(e) Assume that the probability of an intruder is 0.4. Also continue to assume that both systems

operate independently. If both systems sound an alarm, what is the probability that an intruder is

detected?

We now have another probability,

P (I) = 0.4

P (I ∩ A ∩ B)

P (A ∩ B)

P (A ∩ B|I)P (I)

=

P (A ∩ B ∩ I) + P (A ∩ B ∩ I c )

P (A|I)P (B|I)P (I)

=

P (A|I)P (B|I)P (I) + P (A|I c )P (B|I c )P (I c )

(0.9)(0.95)(0.4)

=

(0.9)(0.95)(0.4) + (0.2)(0.1)(1 − 0.4)

P (I|A ∩ B) = 0.966

P (I|A ∩ B) =

12

Bayes’s Rule

Bayes’s Rule again

Independence of Events

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 7

Lie Detector Test — 1 point

A new type of lie detector, called the Computerized Voice Stress Analyzer (CVSA) has been developed. The manufacturer claims that the CVSA is 98% accurate, and unlike a polygraph machine, will

not be thrown off by drugs and medical factors. However, laboratory studies by the DoD found that

the CVSA had an accuracy rate of 49.8% — slightly less than pure chance. Suppose the CVSA is used

to test the veracity of four suspects. Assume the suspects’ responses are independent.

(a) If the manufacturer’s claim is true, what is the probability that the CVSA will correctly determine

the veracity of all four suspects?

Let Ai be the event that the CVSA correctly determines the veracity of suspect i. Each suspect’s response

is independent, so

4

Y

P (A1 ∩ A2 ∩ A3 ∩ A4 ) =

P (Ai ) = (0.98)4 = 0.9224

i=1

Note that the pi operator is to the product what the sigma operator is to the sum.

(b) If the manufacturer’s claim is true, what is the probability that the CVSA will yield an incorrect

result for at least one of the four suspects?

This is the complement of the answer to part A.

P ((A1 ∩ A2 ∩ A3 ∩ A4 )c ) = 1 − P (A1 ∩ A2 ∩ A3 ∩ A4 ) = 0.0776

(c) Suppose that in a laboratory experiment conducted by the DoD on four suspects, the CVSA yielded

incorrect results for two of the suspects. Use this result to make an inference about the true accuracy

rate of the new lie detector.

This result is equivalent to flipping a coin. Of all possible outcomes there are 42 ways to choose

two suspects for whom results were inaccurate. Each of these outcomes has Aci occurring twice and Ai

occurring twice, so each instance has probability (0.02)2 × (0.98)2 . The probability of this happening if the

manufacturer’s claim were true is then

4

× (0.02)2 × (0.98)2 = 6 × 0.000384 = 0.002305.

2

It is extremely unlikely — about a 1 in 400 chance — that the DoD outcome would occur if the manufacturer’s claims were accurate.

13

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 8

Auditing an Accounting System — 1 point

In auditing a firm’s financial statements, an auditor will (1) assess the capability of the firm’s

accounting system to accumulate, measure, and synthesize transactional data properly and (2) assess

the operational effectiveness of the accounting system. In performing the second assessment, the auditor

frequently relies on a random sample of actual transactions (Stickney and Weil, Financial Accounting:

An Introduction to Concepts, Methods, and Uses, 2002). A particular firm has 5,382 customer accounts

that are numbers from 0001 to 5382.

(a) One account is to be selected at random for audit. What is the probability that account number

3,241 is selected?

By the definition of random selection, each account number is equally likely to be selected, so the

1

.

probability is 5,382

(b) Draw a random sample of 10 accounts, and explain in detail the procedure you used. (Hint: Python

can do this)

There are many ways to do this. As seen in my Python script, I used Numpy: np.random.randint(low=1,

high=5382, size=10). Results will vary but I got {3401, 1929, 2362, 3980, 4311, 398, 2271, 1496, 1327,

3234}.

(c) Referring to part b, is one sample of size 10 more likely to be chosen than any other? What is the

probability that the sample you drew in part b was selected?

No, every sample of size 10 is equally likely because they are randomly selected. There are

samples of size 10, so the probability that any given sample is selected is

1

≈

5382

10

1

.

5.57 × 1030

Whatever sample you got, it was an extremely unlikely outcome!

14

5382

10

possible

Dr. Chris Weimer

Homework 1 Solutions

DASC 512

Problem 9

Fish Contamination — 1 point

A U.S. Army Corps of Engineers (USACE) study focused on DDT contamination of fish in the

Tennessee River in Alabama. Part of that investigation studied how far upstream contaminated fish

have migrated. A fish is considered to be contaminated if its measured DDT concentration is greater

than 5.0 parts per million (ppm).

(a) Considering only contaminated fish captured from the Tennessee River, the data reveal that 52%

of the fish are found 275–300 miles upstream, 39% are found 305–325 miles upstream, and 9% are found

330–350 miles upstream. Use the percentages to estimate the probabilities P (275–300), P (305–325),

and P (330–350).

The best estimate of probability is the proportion observed.

P (275–300) = 0.52

P (305–325) = 0.39

P (330–350) = 0.09

(b) Given that a contaminated fish is found a certain distance upstream, the probability that it is a

channel catfish (CC) is determined from the data as P (CC|275–300) = 0.775, P (CC|305–325) = 0.77,

and P (CC|330–350) = 0.86. If a contaminated channel catfish is captured from the Tennessee River,

what is the probability that it was captured 275–300 miles upstream?

Using Bayes’s Rule,

P (275–300|CC) =

P (CC|275–300)P (275–300)

P (CC)

P (CC|275–300)P (275–300)

P (CC|275–300)P (275–300) + P (CC|305–325)P (305–325) + P (CC|330–350)P (330–350)

(0.775)(0.52)

=

(0.775)(0.52) + (0.77)(0.39) + (0.86)(0.09)

P (275–300|CC) = 0.5162

=

15

Homework 2 Solutions

DASC 512

Problem 1

Get Out of Jail Free — 2 points

In the recent Monopoly Gamer version of the classic board game Monopoly, players can either pay

the fine to be released from Jail or roll a die and be released for free upon rolling a 6. In the actual

rules, the player is released automatically after 3 turns, but let’s assume a house rule in which you can

attempt the roll infinitely. Let x be the number of rolls of a 6-sided die until rolling the first 6. This

has the known probability distribution

x−1

5

1

P (X = x) = p(x) =

6

6

(a) Find p(1) and interpret the result.

p(1) =

1

×

6

0

5

1

=

6

6

The probability of be released for free on the first attempt is the probability of rolling a 6 in a single attempt

— 16 .

(b) Find p(5) and interpret the result.

4

5

54

1

×

= 5

6

6

6

625

=

≈ 0.0804

7776

p(5) =

About 1 in 12 stays in Jail will be ended for free on the 5th attempt, assuming the player never pays the

fine.

(c) Find P (X ≥ 2) and interpret the result.

P (X ≥ 2) = 1 − P (X = 1) = 1 − p(1)

1

5

=1− =

6

6

(d) If we played by the original rules where the player was released for free without rolling at the start

of their third turn, what would be the probability of this outcome?

This outcome is P (X ≥ 3).

P (X ≥ 3) = 1 − P (X = 1) − P (X = 2)

11

1

5

=1− − 2 =1−

6 6

36

25

=

36

Just over two-thirds of attempts to escape jail for free before the 3rd turn will be unsuccessful, resulting in

two wasted turns.

1

Homework 2 Solutions

DASC 512

(e) In the original Monopoly, players rolled two dice and escaped Jail on doubles (i.e., when rolling the

same number on both dice). How would this affect the probability distribution? Explain your logic.

It would not affect the probability distribution at all. There are 6 ways to roll doubles with two dice:

(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6). There are 36 total ways to roll two dice. So the probability of rolling

a success remains 16 .

2

Homework 2 Solutions

DASC 512

Problem 2

Six Sigma processes — 2 points

The “Six Sigma” quality control process dictates that manufacturing processes should be controlled

such that all products within 3 standard deviations of the mean on some measure are within allowable

deviations. However, this can be expensive, and many companies specify difference quality control

standards depending on the cost of failure.

Consider two companies produce computer monitors sold on an online marketplace similar to Amazon. Company A adheres to six-sigma for their monitors, so only 0.3% of monitors do not function

correctly. Company B sells fewer monitors and adheres to a four-sigma process, so 5% of monitors do

not function correctly.

(a) Let X be the number of non-functioning monitors in the first production run of 1,000 monitors

from company A. Let Y be the number of non-functioning monitors in the first production run of 50

monitors from company B. Define the distribution and pmf for both X and Y assuming that each

monitor’s quality is independent.

These are both binomial random variables.

X ∼ Binom(p = 0.003, n = 1000)

1000

fX (x) =

0.003x 0.9971000−x

x

Y ∼ Binom(p = 0.05, n = 50)

50

fY (y) =

0.05y 0.9550−y

y

(b) What is the mean number of non-functioning monitors from each company’s first batch? What is

the variance?

µX = np = 1000(0.003) = 3

2

σX

= np(1 − p) = 1000(0.003)(0.997) = 2.991

µY = np = 50(0.05) = 2.5

σY2 = np(1 − p) = 50(0.05)(0.95) = 2.375

(c) What is the probability that each company’s first run will be perfect (i.e., have no non-functioning

monitors)?

fX (0) =

1000

0.0030 0.9971000

0

= 0.9971000 ≈ 0.0496

50

0.050 0.9550

fY (0) =

0

= 0.9550 ≈ 0.0769

There is about a 5% chance that Company A will have a perfect first run. There is about a 7.7% chance

that Company B will have a perfect (albeit much smaller) first run.

3

Homework 2 Solutions

DASC 512

(d) Plot the pmf for both distributions with some reasonable x- and y-limits and compare them. What

can you say in general about the relative number of failed monitors each company can expect to ship

in their first batches?

Company B is more likely to have 0, 1, or 2 defects than Company A. Company A is more likely to have

3 or more defects. In general, Company A will likely have more defective monitors than Company B in their

first batch (as a total, clearly not as a proportion).

4

Homework 2 Solutions

DASC 512

Problem 3

Emergency Room arrivals — 2 points

Each day a hospital records the number of people who come to the emergency room for treatment.

(a) Assume that people arrive at a constant rate each day — that is, that they arrive according to a

Poisson distribution — with an average of 25 patients arriving per day. What is the probability that

less than 20 patients arrive today?

We are given that the distribution of arrivals is X ∼ Poisson(λ = 25). We can then use the cdf in Python.

P (X 35) ≈ 0.02246

stats.poisson.sf(k=35, mu=25)

Multiplying this by 365 days in a year (365.24 if you want to be really specific to a solar year) gives us the

expected number of days that this will occur.

P (X > 35) × 365 ≈ 8.2

So we expect the ER to be overwhelmed about 8–9 times per year.

(c) In a particular week, the arrivals to the ER are:

Sunday

10

Monday

8

Tuesday

14

Wednesday

7

Thursday

21

Friday

44

Saturday

60

Do you think that the Poisson distribution might describe the random variability in arrivals adequately? Why or why not?

No. The daily average number of arrivals was 23.4. If we assume that the number of daily arrivals is

Poisson distributed with λ = 23.4, the pmf would look like below.

Both the Friday and Saturday arrivals of 44 and 60 would be exceedingly rare events. It is far more likely

that Fridays and Saturdays are busier than the rest of the week. Realistically, you can expect to see weekly

and seasonal trends.

5

Homework 2 Solutions

DASC 512

(d) Building upon your answer to part c, would you expect the Poisson distribution to better describe,

or more poorly describe, the number of weekly admissions to the ER? Why?

It would likely far better describe weekly admissions, because that would smooth out the fluctuations by

day of the week by including a Friday and a Saturday in each data point.

6

Homework 2 Solutions

DASC 512

Problem 4

Normal Location Families — 2 points

Lake Wobegon Junior College admits students only if they score above 400 on a standardized achievement test. Applicants from Group A have a mean of 500 and a standard deviation of 100 on this test,

and applicants from Group B have a mean of 450 and a standard deviation of 100. Both distributions

are approximately normal, and both groups have the same size.

(a) Find the proportion not admitted for each group.

For each group, we are looking for the cdf at x = 400. We have two distributions:

A ∼ N (µ = 500, σ = 100)

B ∼ N (µ = 450, σ = 100)

So the proportion not admitted for each group is

P (A 0

This is a one-sided alternative hypothesis because it specifies a range and a direction.

(d) The median family income is the same in Colorado Springs as in Duluth.

Let θC be the median income in Colorado Springs and θD be the median income in Duluth.

H0 : θ C = θ D

This is a null hypothesis because it takes a specific value.

(e) The variance in resting heart rates is lower for collegiate rowers than for collegiate volleyball

players.

2

Let σR

be the variance for rowers and σV2 be the variance for volleyball players.

2

Ha : σR

0.5

Type of test: Binomial exact test (Could use z-test)

Significance: α = 0.05. The sample size is large enough that any practical difference should be visible

with high confidence.

Test statistic: p = 0.52

P value: p = 0.0874

Conclusion: There is insufficient evidence to reject the null hypothesis that neither group represents a

majority.

Problem 4

Who’s a good dog? — 1 point

In a study to determine whether dogs prefer petting or vocal praise, researchers randomly placed

14 dogs into two groups of 7 each. In group 1, the owner would pet the dog. In group 2, the owner

would provide vocal praise. Researchers measured the time, in seconds, that the dog interacted with its

owner.

(a) Owners in group 1 got 114, 203, 217, 254, 256, 284, and 296 seconds of interaction. Owners in

group 2 got 4, 7, 24, 25, 48, 71, and 294 seconds of interaction. The low outlier value in group 1 and the

high outlier value in group 2 indicate the distributions may be highly skewed, so perform a hypothesis

test on the median.

BLUF: This data shows that dogs prefer petting over vocal praise, with dogs being petted interacting

with owners longer than those being praised.

Let θ1 and θ2 be the medians of the two groups.

Hypotheses: H0 : θ1 = θ2 Ha : θ1 ̸= θ2

Type of test: Mann-Whitney U test

Significance: Not specified. Given the small samples and low cost of a type I error, I’d go with a higher

value like α = 0.1.

Test statistic: U = 43

P value: p = 0.0215

Conclusion: We reject the null hypothesis and find that these dogs prefer petting over vocal praise.

3

Homework 3 Solutions

DASC 512

(b) Suppose that the measurements in part a were gathered for the same owner/dog pair on separate

days. Perform an appropriate hypothesis test to determine if there was a difference between groups.

The differences between groups still exhibit significant left-skew. Furthermore, the Shapiro-Wilk test indicates non-normality (p = 0.0212).

While a paired t-test is an acceptable choice, a paired Wilcoxon signed-rank sum test is probably a

better choice and more comparable to part a. However, we didn’t explicitly go over that as an option in the

lessons. Philosophically, a paired test can be executed for any one-sample test method by simply testing the

difference. Notably, Python implements the Wilcoxon test as a paired test.

BLUF: This data gives strong evidence that dogs prefer petting over vocal praise, with dogs being petted

interacting with owners longer than those being praised.

Hypotheses: H0 : θ1 = θ2 Ha : θ1 ̸= θ2

Type of test: Paired Wilcoxon signed-rank sum test

Significance: Not specified. Given the small samples and low cost of a type I error, I’d go with a higher

value like α = 0.1.

Test statistic: W = 0

P value: p = 0.0156

Conclusion: We reject the null hypothesis and find that these dogs prefer petting over vocal praise.

If the paired t-test had been performed, the results would have been similar with the following differences:

Hypotheses: H0 : µ1 = µ2 Ha : µ1 ̸= µ2

Type of test: Paired t-test

Test statistic: t = 5.36

P value: p = 0.0017

4

Homework 3 Solutions

DASC 512

Problem 5

Three’s a crowd — 1 point

A recent General Social Survey (GSS) asked the question, “What do you think is the ideal number

of children to have?” The mean value of 1302 responses was 2.49 with a standard deviation of 0.85. Do

Americans on average think that the ideal number of children is more than 2?

BLUF: This survey gives very strong evidence that Americans on average think that the ideal number of

children is more than 2.

With such a high sample size, we’ll have no problem using a one-sample t-test or a z-test. Let µ be the

average response of all Americans.

Hypotheses: H0 : µ = 2 Ha : µ > 2

Type of test: One-sample t-test.

Significance: Not specified, but with such a large sample we can use α = 0.05.

2.85−2

√

= 20.8

Test statistic: t = 0.85/

1302

∗

Critical value: t = 1.96

P value: p = 1.7 × 10−83

Conclusion: We reject the null hypothesis and find that Americans on average think that the ideal number

of children is more than 2.

5

Homework 3 Solutions

DASC 512

Problem 6

You batter bell-lieve it — 2 points

Use the file BattingAverages.csv, containing batting averages for all players with at least 100 at

bats for the 2009 season, for the following questions. Assume this is a random sample rather than a

census.

(a) Are the batting averages data (BattingAvg) normally distributed? Use both graphical and analytical methods to make your argument.

BLUF: The batting averages data appear to be normally distributed.

Let X be the batting averages data.

Hypotheses: H0 : X ∼ N orm(x̄, s2 ) Ha : x ̸∼ N orm(x̄, s2 )

Type of test: Visual and Lilliefors (you may have chosen another test)

Significance: Not specified, but I would want very strong evidence of non-normality to possibly push me

to a non-parametric test. I’ll use α = 0.01.

Test statistic: T = 0.0232

P value: p = 0.8471

Conclusion: We fail to reject the null hypothesis that the data is normally distributed. Visual assessment

with a Q-Q plot confirms near-normality for this data.

6

Homework 3 Solutions

DASC 512

(b) Is the mean value of batting averages at least .265? Perform a test to find out. Use α = 0.05 for

your test.

BLUF: There is insufficient evidence to conclude that the mean batting average in the 2009 season was

greater than .265. In fact, the observed average was less than .265 (.261).

Since we are comparing means and the data is approximately normal with a large sample size, we can

use either the t-test or the z-test. I’ll use the t-test.

Hypotheses: H0 : µ = 0.265 Ha : µ > 0.265

Type of test: One-sample t-test.

Significance: α = 0.05 as specified.

Test statistic: t = −2.37

Critical value: t∗ = 1.65

P value: p = 0.9910

Conclusion: We fail to reject the null hypothesis that the mean batting average is .265.

(c) Was there a difference between batting averages in the National League and American League

(column League)? Use α = 0.05 for your test.

BLUF: There is insufficient evidence to conclude that the National League average batting average was

difference from the American League average batting average in the 2009 season.

Since we are comparing means and the data is approximately normal with a large sample size, we can

use either the two-sample t-test or the two-sample z-test. I’ll use the t-test.

Hypotheses: H0 : µN − µA = 0 Ha : µN − µA ̸= 0

Type of test: Two-sample t-test with pooled variance. The sample variance for the NL is 0.0011 and the

sample variance for the AL is 0.0012.

Significance: α = 0.05 as specified.

Test statistic: t = −0.0985

Critical value: t∗ = 1.65

P value: p = 0.9216

Conclusion: We fail to reject the null hypothesis that the mean batting average is the same between the

leagues.

7

Homework 3 Solutions

DASC 512

Problem 7

Cowbell usage must have increased… — 2 points

A researcher studying true body temperature in adult humans collected the data in BodyTemp.csv

in degrees Fahrenheit.

(a) Is the body temperature normally distributed? Use graphical and analytical methods to make your

argument.

BLUF: The body temperature data appear to be normally distributed.

Let X be the body temperature observations.

Hypotheses: H0 : X ∼ N orm(x̄, s2 ) Ha : x ̸∼ N orm(x̄, s2 )

Type of test: Visual and Lilliefors (you may have chosen another test)

Significance: Not specified, but I would want very strong evidence of non-normality to possibly push me

to a non-parametric test. I’ll use α = 0.01.

Test statistic: T = 0.0692

P value: p = 0.1195

Conclusion: We fail to reject the null hypothesis that the data is normally distributed. Visual assessment

with a Q-Q plot confirms near-normality for this data with perhaps some deviation at the tails.

(b) Is the body temperature equal to 98.6?

BLUF: There is very strong evidence that the average body temperature is not 98.6.

Since we are comparing means and the data is approximately normal with a large sample size, we can

use either the t-test or the z-test. I’ll use the t-test.

Hypotheses: H0 : µ = 98.6 Ha : µ ̸= 98.6

Type of test: One-sample t-test.

Significance: α = 0.05 because we have enough data to make a strong conclusion.

Test statistic: t = −6.03

Critical value: t∗ = 1.65

P value: p = 1.3 × 10−8

Conclusion: We reject the null hypothesis that the mean body temperature is 98.6 degrees.

8

Homework 3 Solutions

DASC 512

(c) For the α you selected, what is the power to detect a difference of 0.2 degrees? Assume the sample

variance is equal to the population variance.

For a difference of 0.2 degrees, the effect size (as a multiple of s) is 0.27. With 148 observations and

α = 0.05, the power is 0.9061.

(d) Create a plot showing how α (x-axis) affects the power to detect a difference of 0.2 degrees (y-axis).

(e) Is there a difference between body temperature in males and females?

BLUF: There is enough evidence to conclude that males and females have different average body temperatures.

Since we are comparing means and the data is approximately normal with a large (and equal) sample

size, we can use either the two-sample t-test or the two-sample z-test. I’ll use the t-test.

Hypotheses: H0 : µM − µF = 0 Ha : µM − µF ̸= 0

Type of test: Two-sample t-test with pooled variance. The sample variance for males is 0.4912 and the

sample variance for females is 0.5497.

Significance: α = 0.05 because we have enough data to make a strong conclusion.

Test statistic: t = −2.77

Critical value: t∗ = 1.98

P value: p = 0.0064

Conclusion: We reject the null hypothesis that average body temperature is equal for males and females.

9

Homework 3 Solutions

DASC 512

Problem 8

The power of love — 1 point

I want to design a study to determine if my daughter can average more than five minutes without

asking me a question while teleworking. I plan on using a t-test regardless of sample size.

(a) I want to detect a difference of 90 seconds with 80% power and 90% confidence. How many intervals

do I need to measure? Assume a standard deviation of 2 minutes.

For a difference of 90 seconds, the effect size (as a multiple of s) is 0.75. I’ll need a sample size of at least

12.46 intervals, so I must measure at least 13 intervals.

(b) For differences of 30, 60, and 90 seconds, construct a plot showing the effect that sample size

(x-axis) will have on power (y-axis) for α = 0.1.

10

Winter Quarter 2022

1

Midterm Exam

DASC 512

Cover Sheet and Instructions

There are five (5) pages and nine (9) questions on this exam. This mid-term exam covers Weeks 1–5 of

DASC 512 material. It is worth 35% of your overall course grade. It is due not later than 2359 EST, 14 Feb

2022, which is two weeks from the date the exam is posted. I will accept late submissions up to 2 days late at

a 10% reduction in grade. I will not accept submissions later than that. Note that a 10% reduction in your

mid-term grade is roughly equivalent to failing to submit an entire homework. Submissions must include

both a PDF fully detailing your response to the questions (i.e., results, narrative, tables, and graphs) along

with any Python code you used in a .py or .ipynb format. Code does not need to be part of the PDF.

Integrity Rules: If you have questions on this exam, you may contact the instructor or post a question

on the Mid-term Discussion Forum. You are not allowed to answer anyone else’s question on that forum —

the instructor will choose the level of information to provide.

Instructions: For all problems, be sure to give full details of your analysis.

I recommend using an Assume, Given, Find, Solution, Answer method to organize your thoughts and

response for problems that are not hypothesis tests. This is not required, but it helps guide your thought

process.

For hypothesis tests, be sure to include:

• a non-technical summary of results

• hypothesis statements

• assumptions

• test chosen with justification

• significance level

• appropriate results, such as test statistic, p-value, rejection region, confidence intervals, and/or ANOVA

tables

• technical conclusion

Examples of well-formulated solutions are given on the next page.

1

Winter Quarter 2022

2

Midterm Exam

DASC 512

Example Solutions

Example Problem 1: In the board game Gloomhaven, characters start with decks of 20 cards that

provide modifiers for an attack: 1 miss (0 damage), 1 -2 (base – 2 damage), 5 -1 (base – 1 damage), 6 +0

(base damage), 5 +1 (base + 1 damage), 1 +2 (base + 2 damage), and 1 critical hit (2x base damage). A

character attacks with an 3-damage attack and uses advantage — taking the higher of two random modifiers.

What is the probability that they do at least 4 damage?

Assume: The full deck is available and well shuffled. Let X be the damage from a single card draw. Let

Y be the damage with advantage.

Given: For X, P(0) = 1/20, P(1) = 1/20, P(2) = 5/20, P(3) = 6/20, P(4) = 5/20, P(5) = 1/20, P(6) =

1/20.

Find: P (Y ≥ 4)

Solution: On the first draw, P (X ≥ 4) = 7/20. If the first draw results in less than 4 damage, then

P (Y ≥ 4) = 7/20. If the first draw results in at least 4 damage, then (P (Y ≥ 4) = 1. Thus

P (Y ≥ 4) = P (Y ≥ 4|X ≥ 4) + P (Y ≥ 4|X 0.

Assumptions: Both dice are fair dice with 1/6 chance of rolling each outcome — a uniform distribution

between 1 and 6. Each roll is iid. Although the underlying distribution is uniform, the sample size is 100 for

both groups, so assume the population is normally distributed with µ = 3.5, σ2 = 35/12 — the mean and

variance of a discrete uniform distribution with 6 outcomes.

Type of test: Two-sample t-test with equal variance. Sample size is sufficient for the Central Limit

Theorem to apply to the sampling distribution despite underlying uniform distributions.

Significance: The risk of a type-I error is low, so let α = 0.1.

Test statistic: t = 2.07

Rejection region: t > 1.29

P-value: p = 0.0199

Confidence interval: With 10% confidence, Beau’s rolls are better than Chris’s by between 0.22 and 0.78

on average.

Conclusion: At the 0.1 significance level, we reject the null hypothesis that Chris and Beau roll equally

well and conclude that Beau’s rolls are higher on average than Chris’s.

2

Winter Quarter 2022

3

Midterm Exam

DASC 512

Exam Questions

Problem 1: 5 points

Suppose that there are four inspectors at a film factory who are supposed to stamp the expiration date

on each package of film at the end of the assembly line: John, Tina, Wayne, and Amy. John processes

20% of all packages, and he fails to stamp 1/200 packages that he processes. Tina processes 60% of all

packages, and she fails to stamp 1/100 packages. Wayne processes 15% of all packages, and he fails to

stamp 1/90 packages. Amy processes 5% of all packages, and she fails to stamp 1/200 packages.

A customer calls to complain that her package of film does not show an expiration date. What is

the probability that it was inspected by John?

Assume: Worker mistakes are regular and uniform

F−Event of missing of Stamps.

J−Event of John processing a package.

T−Event of Tina processing a package

W−Event of Wayne processing a package.

A−Event of Amy processing a package.

P(J) = 20/100 = 0.2

P(F|J) = 1/200 = 0.005

P(T) = 60/100 = 0.6

P(F|T) = 1/200 = 0.01

P(W) = 15/100 = 0.15

P(F|W) = 1/90 = 0.0111

P(A) = 5/100 = 0.05

P(F|A) = 1/200 = 0.005

Find: P(J|F) Solution: From the Bayes’ Rule P(A|B) = P(B|A)P(A) P(B) 1

From combined Probability P(A) = Xn k=0 P(Bk)P(A|Bk)

We can find the Solution

P(F) = Xn k=0 P(Bk)P(FBI) P(F) = 0.2 × 0.005 + 0.6 × 0.01 + 0.15 × 0.011 + 0.05 × 0.005 =

0.0179 Then, P(J|F) = P(F|J)P(J) P(F) P(J|F) = 0.005 × 0.2 0.0179 Answer: P(J|F) = 0.05586

3

Winter Quarter 2022

DASC 512

Midterm Exam

Problem 2: 5 points

A regional telephone company operates three identical relay stations at different locations. During a

one-year period, the number of malfunctions reported by each station and the causes are shown below.

Causes

Problems with Electricity Supplied

Computer Malfunction

Malfunctioning Equipment

Human Error

Station

A B C

2 6 4

4 3 1

3 4 3

9 4 7

Suppose that a malfunction was reported and it was found to be caused by human error. What is

the probability that it came from Station C?

From combined Probability Assume:

All the stations are unique and independent

C−Event of Error at Station

C. H−Event of Error at Station H.

Solution:

P(C) = 4 + 1 + 3 + 7 2 + 6 + 4 + 4 + 3 + 1 + 3 + 4 + 3 + 9 + 4 + 7 = 0.3

P(H) = 9 + 4 + 7 2 + 6 + 4 + 4 + 3 + 1 + 3 + 4 + 3 + 9 + 4 + 7 = 0.4

P(H|C) = 7 7 + 3 + 1 + 4 = 0.4667 P(C|H) = P(H|C)P(C) P(H) = 0.4667 × 0.3 0.4 = 0.35

0.35

4

Winter Quarter 2022

Midterm Exam

DASC 512

Problem 3: 10 points

Describe the effect of sample size, effect size, and level of significance on statistical power.

Sample size is a frequently used term in statistics is use whenever you have a large population.

When you have a large population, you are interested in the entire population, but it is not

realistically to study the entire population. Therefore, you take a random sample which represents

the entire population. The size of the sample is important for accurate, statistically significant

results and running a successfully study. If your sample is too small, you may include a

disproportionate number which are outliers and anomalies. These may skew the results not

accurate results projected of the entire population. If the sample is too big, the whole study

becomes complex, expensive, and time-consuming, and although the results are more accurate, the

benefits don’t outweigh the costs.

Effect size is a statistical concept that measures the strength of the relationship between two

variables on a numeric scale. In statistic effect size helps us in determining if the difference is real

or if it is due to a change of factors. In hypothesis testing, effect size, power, sample size, and

critical significance level are related to each other. The effect size indicates the practical

significance of a link between two variables. The significance level is the probability of rejecting

the null hypothesis when it is true, represented by the p-value (also known as Alpha). A significant

result simply implies that your statistical test’s p-value was equal to or less than your alpha, which

is usually 0.05 in most circumstances. For example, a significance level of 0.05 indicates a 5% risk

of concluding that a difference exists when there is no actual difference. A larger sample size can

make an effect easier to detect, and the statistical power can be increased in a test by increasing the

significance level.

Problem 4: 5 points

Suppose that we are interested in the IQ of incoming students at AFIT. We want to run a test that

will detect a 5 IQ point difference between the true mean and our new class. From previous research,

we can assume a standard deviation of 15 (i.e., the parameter is known), and leadership wants to have

options for α = 0.05 and α = 0.1. Create a plot for a power analysis where the y-axis is the power of

our test and the x-axis is the sample size. It should have a line for each significance level.

5

Winter Quarter 2022

Midterm Exam

DASC 512

Figure 1: Number of observations Vs Power of Test

Suppose we are interested in the IQ of incoming students at AFIT. We want to run a test that will

detect a 5 IQ point difference between the true mean and our new class. From previous research,

we can assume a standard deviation of 15 (i.e., the parameter is known), and leadership wants to

have options for alpha = 0.05 and alpha = 1.

Problem 5: 5 points

The accuracy of a new precision air drop system being tested by the US Air Force follows a normal

distribution with a mean of 50 ft and a standard deviation of 10 ft. A particular resupply mission drops

12 payloads. It is considered to be successful if at least 9 of the 12 payloads are delivered at between

45 and 60 feet. What is the probability that the resupply mission will be successful?

H0 Null Hypothesis: The Air drop mission is unsuccessful

µ = 50 σ = 10

Considering Normal Distribution z = x − µ σ

Considering Upper boundary z = 60 − 50/10 p(x 3.3^2

Assumptions: Assume the population is normally distributed type of test: One sample chisquared test for variance (standard deviation). Given population is normally distributed

Significance: The st andard risk of a type-I error is α=0.05.

Test statistic: c h is q u a r e d = 34.41

Rejection region: c h i sq u a r e d > 54.57

P-value: p = 0.6789

.05 significance level, we fail to reject the null hypothesis that the standard deviation of the rod

diameters is greater than 3.3 on average. We may have to inform to management.

7

Winter Quarter 2022

DASC 512

Midterm Exam

This test gave sufficient evidence to conclude mean chlorine content (in ppm of chlorine) is content

is less than 71ppm on average.

Hypotheses: H0 : μ= 71, Ha : μF)

7.683892e-08

NaN

resulting anova table 2

C(alloy)

Residual

df

1.0

88.0

sum_sq

1.058155e+05

1.785280e+06

mean_sq

105815.511111

20287.273232

resulting anova table 3

8

F

5.215857

NaN

PR(>F)

0.024789

NaN

Winter Quarter 2022

Midterm Exam

DASC 512

Problem 9: 55 points

The data given in faithful.csv records data of 272 eruptions at the Old Faithful geyser at Yellowstone

National Park. Column 1 reports the length of the eruption (in minutes) and column 2 reports the length

of the interval between the previous eruption and this one (i.e., wait time).

9

Winter Quarter 2022

Midterm Exam

DASC 512

This test gave sufficient evidence to conclude mean wait time for eruptions less than 3 minutes is

less than 60 minutes.

Hypotheses: H0 : μ= 60, Ha : μ

Purchase answer to see full

attachment

SOLUTION: Air Force Institute of Technology Regression Project Models and Codes in Python Worksheet