# I need some tutoring with statistics using SPSS

Question #5 indicates there are 23 outliers, based on Jackknife distances. We are not using Jackknife, but boxplots. You will get a different number of outliers using boxplots, so ignore the Jackknife indication. Most likely you will see 31 outliers using boxplots. Go with what you get with boxplots. I’m more concerned with the process/logic of your responses and not on the specific answers as those will vary somewhat from person to person. There is too much variability in how you all answer these graded assignment questions; depends on the number of outliers you identify

1. What challenges did Research Project 3 pose as you were completing it?

2. In Question 9 you ran four independent bivariate regression analyses. Suppose instead of

keeping these analyses separate, you instead ran a single multiple regression analysis that

included all four IVs being analyzed collectively and in the presence of each other. How

different do you think the respective B values would be? Explain.

3. As you begin thinking about your own *dissertation* research, what role do you perceive

correlation and regression analysis playing relative to the statistical strategy you would use?

Week 11

Applying Linear Regression and Prediction: A Guided Example

This handout material provides a guided example of a regression analysis. The example is

similar in structure to the previous guided examples that have been presented in Week’s 7, 8, and

10. Prior to working through this example you are encouraged to review Chapter 9 of the

assigned textbook by Wilson and Joye as well as the Gallo supplement on regression.

Guided Example Context

The research context for this example is to determine if pilots’ sex (male vs. female) has

any relationship to their annual salary, and to be able to generate a prediction equation that could

be used to predict pilots’ salary based on their sex. A random sample of N = 62 full-time ATPs

from Spirit Airlines was selected and participants were asked to report their annual salary and

sex (male or female). A copy of the data acquired from this study is given in the Excel file,

“Week 11 Guided Example Data.”

Pre-Data Analysis

Pre-data analysis involves stating the research question, identifying the correct research

methodology to answer the RQ, and conducting an a priori power analysis to determine the

minimum sample size needed.

What is the RQ? The overriding research question for the current example is: “What is

the effect of sex on the annual salaries of ATPs? In the context of the current study, an ATP is

defined as a full-time pilot working for Spirit Airlines who holds an ATP certificate, which is the

highest level of aircraft pilot certificate issued by the FAA. In part 121, or air carrier operations,

each pilot is required to have an ATP certificate. Annual salary is defined in U.S. dollars, and sex

refers to male or female.

What is the research methodology? One research methodology that could be used to

answer this question is ex post facto because the study involves a pre-existing group membership

that cannot be manipulated. Alternatively, the research methodology could be correlational

because the study involves a single group (ATPs) and is examining the relationship between

multiple variables (salary and sex) of this single group.

So which is the more appropriate methodology: ex post facto or correlational? If the

focus of the study were to examine the differences in salary between male and female ATPs to

see if one group’s mean annual salary was significantly higher (or lower) than a second group’s

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 1

mean annual salary, then effects-type ex post facto would be appropriate. In fact, this would be

analogous to the second guided example given in Week 7 that examined the difference in mean

FAA IRA scores between two groups of students: those who prepared for the exam using

Gleim’s software and those who prepared for the exam using Sheppard Air’s software.

However, given that one of the study’s goals is to generate a prediction equation that can

be used to predict salaries based on sex, then this is a prediction study and therefore the

appropriate research methodology/design is prediction correlational research. This is because

the results of the study will be used to forecast future behavior by examining the correlations

between variables. More concretely, because we endeavor to estimate the effect on a dependent

measure (annual salary) relative to a change in a predictor variable (sex), the focus is on the

regression coefficient (B) because the regression coefficient is a reflection of causal effects.

What is the minimum sample size needed? To determine the minimum sample size, we

consult G•Power based on the following parameters:

• Test family = t tests

• Statistical test = Linear multiple regression: Fixed model, single regression coefficient

• Type of power analysis: A priori

• Tail(s) = Two

• Effect size f 2 = 0.15, which is a medium effect

• α error prob = .05

• Power = .8

• Number of predictors = 1

The minimum total sample size is N = 55. We will use a two-tailed test because we do not know

what the effect will be. Furthermore, in the absence of any guidance from theory, the literature,

or a preliminary study, we set the expected effect size to medium.

Data Analysis

We now conduct the hypothesis test. Following is a summary of the steps associated with

the corresponding hypothesis test.

Step 1: Formulate the null and alternative hypotheses.

H0: β = 0: The slope of the regression line in the population is zero (horizontal line).

More specifically, sex (x) has no significant effect on annual salary (y) among

full-time ATPs.

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 2

H1: β ≠ 0: The slope of the regression line in the population differs significantly from

zero (oblique line). More specifically, sex (x) has a significant effect on on

annual salary (y) among full-time ATPs.

Step 2: Determine the test criteria. The test statistic is t, the level of significance is α =

.05, and the boundary of the critical region is determined from the table of critical values for t

given in Table 7.1 (p. 4) of the Gallo supplement for Week 7. From the Excel file, N = 62 and

therefore the degrees of freedom are df = 62 – 2 = 60. For a 2-tailed test with a = .05, the critical

value is tcritical = ± 2.0. Thus, to be significant, the calculated t must be greater than or equal to

tcritical = 2.0 or less than or equal to tcritical = -2.0. This is shown in Figure 1. Note that the sample

size is greater than the minimum sample size needed.

Reject H0

α = .025

t(60) = −2.0

Reject H0

α = .025

t(60) = 2.0

Figure 1. Critical regions for two-tailed t test for α =

.05 and df = 60 (N = 62).

Step 3: Collect data and compute sample statistics. We import the given Excel file into

our statistical software program and then run a regression analysis The first thing we do is

observe that the independent variable, x = sex, is categorical. Before we can perform a regression

analysis we have to express this categorical variable as a continuous variable. Because x

represents a dichotomy, we will use a binary coding scheme of 0 and 1 and assign 0 to males and

1 to females. Thus, we must create a new variable in the data set that has these assignments

before we conduct any analyses.

Check assumptions. The next thing to do is check to see if the data satisfy the underlying

assumptions. As indicated in the Gallo supplement, there are four main assumptions: (a)

linearity, (b) homoscedasticity, (c) independence of the residuals, and (d) normality. The single

best way to confirm these assumptions is by examining residual plots. Recall that residuals are

nothing more than the difference between an observed y score and the corresponding predicted y

score for a particular x value. Your statistical software should provide scatter plots of the

residuals as part of its output for a regression analysis. If it does not do so by default, then you

€

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 3

will have to invoke additional menu commands to get the plots. For readers using SPSS, here is a

link that describes how to test for these assumptions using SPSS:

http://blog.uwgb.edu/bansalg/statistics-data-analytics/linear-regression/what-are-the-fourassumptions-of-linear-regression/. You also must make sure that you will be working with the

modified continuous version of the x variable and not the categorical x variable.

Linearity of residuals. To confirm the linearity assumption, we examine a residual plot in

which the y scores are placed on the y axis and the residuals are place on the x axis.

(Alternatively, we could examine a scatter plot that involves plotting the residuals on the y axis

and the predicted y values on the x axis.) A copy of this plot is shown in Figure 2. We now

examine this plot to see if the relationship is linear in form. This appears to be the case. The

scatter plot does not suggest a nonlinear pattern. As a result, this assumption is satisfied.

€

Figure 2. Scatter plot of y vs. residuals.

Homoscedasticity of residuals. Recall from the guided example in Week 10 involving

correlation that homoscedascity means “equal variances,” which implies that the variances are

the same for all values of each variable. To confirm this is the case for a regression analysis, we

examine the same residuals vs. predicted scatter plot we used for the linearity assumption. What

we seek is a plot that has no discernible pattern. From Figure 2, there appears to be no detected

systematic trend (other than the linear relationship). More specifically, the data do not “fan out”

in a horizontal “V” form or show a curvilinear form. As a result, this assumption is satisfied.

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 4

Independence of residuals. In a regression analysis the residuals also must be

independent of each other. There are several instances when this assumption has the potential to

be violated. One form of dependency in the residuals occurs when there is a systematic change

over time in the nature of the participants such as in a longitudinal study. Although this is not the

case of the current study because it is cross-sectional in nature (the data were collected only one

time), it would still be prudent to confirm this assumption. For our purposes, the best strategy for

detecting violations of this assumption is to examine a plot of the residuals vs. the case numbers.

If there is no detectable/discernible pattern, then there is a good indication that the residuals are

independent. To do so, we plot the residuals on the y axis and the case numbers on the x axis. As

shown in Figure 3, there appears to be no detectable/discernible pattern and therefore this

assumption is satisfied.

Figure 3. Scatter plot of residuals vs. case numbers (1–62).

Normality of residuals. To confirm the normality assumption, we examine a normal q-q

plot similar to what we have done in previous examples. For the current situation, though, the qq plot involves the residuals. As shown in Figure 4, the points appear to fall along the line and

are contained within the 95% confidence band. This assumption also can be confirmed by

examining the corresponding Shapiro-Wilk Goodness of Fit test. The results of this test confirm

the residuals are from a normal distribution.

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 5

Figure 4. Normal q-q plot of the residuals.

These preliminary analyses confirm that the four principal assumptions of regression are

satisfied. As a result, we now run the regression analysis and review the primary results.

Run the analysis. Now that the assumptions have been met, we run the analysis and

review the findings. As part of our review we also will interpret the findings in the context of the

research setting.

The scatter plot. A copy of the scatter plot that corresponds to this analysis is shown in

Figure 5. Note how the scatter plot shows the dichotomy between Males (coded 0) and Females

(coded 1) relative to their salaries.

Figure 5. Results of regression analysis in which y = annual

salaries were regressed on x = sex (male or female).

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 6

The regression equation. The results of the analysis produced the following regression

equation: y = -3902.09x + 56515.06 with B = -3902.09, which is the slope, and B0 = 56515.06,

which is the y value of the y intercept. From a generic perspective, B is interpreted as for every

one-unit change in x we can expect on average a B-unit change in y. Because x represents a

€

dichotomy with 0 = Males and 1 = Females, the only change in the slope is between 0 and 1.

This can be seen from Figure 5. Note how the slope of the line is negative: It decreases from

Males to Females, which indicates that the annual salary of males is higher than that of females.

This can be confirmed by simply substituting the coded values into the regression equation:

• When we substitute 0 (Males) for x into the equation, we get

y = -3902.09(0) + 56515.06 = 0 + 56515.06 = 56515.06

This result represents the predicted average salary for Males.

• When we substitute 1 (Females) for x into the equation, we get

€

y = -3902.09(1) + 56515.06 = -3902.09 + 56515.06 = 52612.97

This result represents the predicted average salary for females.

Thus, based on these results, the mean average salary for males is $56,515.06 and the mean

€

average salary for females is $52,612.97. Note that the difference in these mean salaries is

$56,515.06 – $52,612.97 = $3,902.09

This difference is exactly what B is equal to in the regression equation. So when conducting a

regression analysis involving a dichotomy where one group is coded 1 and another group is

coded 0, B represents the difference in group means, and B0 represents the mean of the group that

was coded 0.

The coefficient of determination. The coefficient of determination is r2 = .04. This is

interpreted from an explained variance perspective as follows: 4% of the variance in salaries (y)

is explained by the variance in sex (x). Because the variance in sex is the difference between

males and females, we would conclude that 4% of the variance in salaries is being explained by

whether a pilot is male or female. This implies that 96% of the variance in salaries is being

explained by something else. We also can interpret this from a prediction perspective: If we

know the sex of a pilot (male or female), then we have 4% of the information to perfectly predict

his or her annual salary.

The result of the t test. The t test result is t(60) = -1.59, p = .1173, which means that the

regression equation is not significant and the corresponding r2 also is not significant.

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 7

Standard error of estimate. Recall that the standard error of estimate is a metric that

provides an idea of how far “off” we will be on average in using x to predict y. Our statistical

software reports this error as the root mean square error. Based on the results, RMSE = 9586.92,

which indicates that if we use the sex of a pilot (male vs. female) to predict pilots’ annual salary,

we will be “off” on average by $9,586.92 or by approximately $9,600.

Outliers. Before we continue to Step 4 of hypothesis testing, it might be prudent to first

check for outliers. Although this is not an assumption of regression as it is with correlation, it is

still possible that outliers could have an impact on the results. Running an outlier analysis using

Jackknife distances, we discover two outliers: Case 48 (male pilot) and Case 60 (female pilot),

and both appear to have relatively high salaries compared to the rest of the sample. The results of

a regression analysis in the absence of these two outliers are as follows:

• y = -3944.10x + 55721.29

• r2 = .05

€

• t(58) = -1.76, p = .0831

• RMSE = 8584.68

Note that although the results appear to be nearly the same as those with outliers present,

there is a slight improvement in the analysis without outliers: (a) There is a gain of one

additional percent in explained variance or prediction (5% vs. 4%); (b) There is a stronger t

value, and the corresponding p value is closer to the preset threshold for committing a Type I

error (.0831 vs. .1173); and (c) There is less error when using the regression equation to predict

salaries (RMSE = 8584.68 vs. 9586.92). Based on these findings, it appears the outliers are

having a slight impact on the findings in that they are suppressing (or masking) the relationship.

Because the two outliers represent only 3% of the sample, and because they are not having that

great of an impact on the results, we choose to keep the outliers to reflect a real-world situation.

Step 4: Make a decision: Either reject or fail to reject the null hypothesis. The

decision is fail to reject the null hypothesis and conclude that sex has no significant effect on the

annual salaries of full-time ATPs from Spirit Airlines.

Post-Data Analysis

We now determine and report the corresponding effect size, power, and 95% confidence

interval. We also discuss plausible explanations for the results.

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 8

What is the effect size? Recall that effect size of regression is f 2, which is equal to the

ratio of r2 and (1 – r2). Given r2 = 04, f 2 = .04/.96 = .042, which is a small effect size.

What is the power of the study? To determine the actual power of the study we consult

G*Power and focus on the post hoc side by entering the parameters to reflect the actual results

from the study: Effect size f 2 = .042 and Total sample size = 62. When these changes are made,

the power of the study is .355, which indicates there is a 35.5% probability that the small effect

found in the sample truly exists in the population.

What is the corresponding confidence interval? The 95% CI reported in the output from

our statistical software is as follows: 95% CI = [-8420.99, 532.79], which tells us that 95% of the

time we can expect the true effect sex has on salaries to lie within the interval [-8420.99, 532.79].

In other words, 95% of the time we can expect female ATPs from Spirit Airlines to earn anywhere

from $8,420.99 less than their male counterparts to $532.79 more than their male counterparts.

Another way of looking at this is to say that if we were to randomly select 100 samples of size N =

62 from the same population, then in 95 of these samples the effect of sex on salaries would be

between -8420.99 and 532.79. Furthermore, because the null hypothesis states that the relationship

will be 0, and because 0 is within the 95% CI, we fail to reject the corresponding null hypothesis.

Finally, we would conclude that based on the width of the 95% CI, the corresponding accuracy in

parameter estimation (AIPE) is not that good (judgment call). In other words, the sample data are

not a good source for accurately predicting the true effect sex has on salaries.

What are some plausible explanations for the result? One plausible explanation for the

result of this study is that Spirit Airlines might have made a sincere effort to address the gender gap

among its ATPs. A second plausible explanation is a combination of sample size and sampling

strategy. Of the 62 participants, 27 were females, which represented approximately 44% of the

sample. According to the FAA, approximately 6% of the nation’s ATPs are female. Thus, if the

sampling strategy had been stratified random sampling instead of simple random sampling, and the

sample size was larger than 62, it is possible that the sex effect would have been significant. What

other plausible explanations for the results can you think of?

Michael A. Gallo © 2018

Week 11: Gallo Guided Example: Applying Linear Regression and Prediction Page 9

Week 12

Research Project 3

This week is dedicated exclusively to engaging in a research project that will enable you to apply

the various concepts you learned primarily from the previous 2 weeks, but also throughout the

entire course thus far. This research project uses selected variables from the certified flight

instructors (CFIs) study, which was examined in Research Project 1 (Week 4) and Research

Project 2 (Week 9). The variables that have been targeted for the current project are as follows:

X2 = Participants’ age in years

X6 = Total years participants have held a CFI certificate

X7 = Participants’ total hours dual given

X9 = Participants’ total flight time (in hours)

Y = Complacency scores, which were measured on a 7-item Likert scale ranging

from 1 = Strongly Disagree to 5 = Strongly Agree. Thus, scores could range from

7 to 35, with higher scores indicating a greater likelihood toward complacency as

a flight instructor.

The dataset is given in the Excel file “Week 12–Research Project 3 Data.” Also note that the

sample size is now N = 276 instead of the original N = 340.

1. Using your statistical software package, generate a correlation matrix that involves all five

variables. This matrix is essentially a table that contains the bivariate correlations of all the

variables. The table also will have 1s along its diagonal due to symmetry as follows:

Y

X2

X5

X7

X9

Y

1

X2

X5

X7

X9

1

1

1

1

2. Interpret each r value associated with y in the context of the given research setting. (Note:

From the correlation matrix table, the correlations will fall along the first column.)

3. For each r value in Question 2, interpret the corresponding r2 value from both an explained

variance perspective and a prediction perspective.

4. Of all of four correlation coefficients between each IV and y, which factor has the strongest

relationship with y? What is a plausible explanation for this strong relationship?

5. Conduct an outlier analysis using Jackknife distances involving all the variables. In other

words, include all the variables (Y, X2, X6, X7, and X9) in the outlier analysis collectively

(there should be 23 outliers). Repeat Question 1 and compare the two correlation matrices.

What impact did the outliers have on the correlations with y?

6. Using the data set without outliers, conduct a hypothesis test involving the IV that has the

strongest relationship with y. You are to structure this test in exactly the same manner as

given in the guided example: pre-data analysis, data analysis, and post-data analysis. Provide

a summary report of your findings.

7. Extend Question 6 by conducting a hypothesis test for regression. You are to structure this

test in exactly the same manner as given in the guided example: pre-data analysis, data

analysis, and post-data analysis. Provide a summary report of your findings.

8. Answer the following two questions:

a. Does it make sense to develop a regression equation to predict a score on one variable

from a score on a second variable if the two variables are not correlated? Why or why

not? Give an example that would support a “yes” response and an example that would

support a “no” response.

b. Based on your response to Question 8a, to what extent would the prediction equation

from Question 7 be beneficial from a practical perspective?

9. Conduct four independent bivariate regression analyses, one for each IV.

a. Summarize the results in the following table. (The first row is completed for you.)

Factor

Bi

B0

ti (251)

p

r2

X2 = Age

-0.0247

17.15

-1.63

.1047

.01

X5 = Years Held CFI Cert.

X7 = Hour Dual Given

X9 = Total Flight Time

Note. N = 253. Y = Level of complacency.

b. Interpret each Bi value in the context of the research setting.

c. Of the four IVs, which has the strongest predictive power. Is this consistent with your

response to Question 4?

10. Let’s now apply this study to research fundamentals:

a. To what extent are the results from Questions 6, 7, and 9, generalizable from both

population and ecological external validity perspectives?

b. What are some limitations/delimitations of this study?

c. What would be an appropriate recommendation for future research?

Y = Complacency

X2 =X6Age

= Years Held

X7 =CFI

Hours

Cert.Dual

X9 = Total

GivenFlight Time

21

50

30

2500

12500

19

44

13

2500

3500

18

28

6

1500

1800

24

34

14

2000

2600

13

36

15

1800

2200

14

25

2.5

1050

1500

18

29

2

1075

1075

13

30

7

2500

2900

19

31

7

1300

1600

18

23

1

75

350

12

37

18

574

4413

16

38

13

2500

3000

7

65

43

4500

6000

17

68

38

2500

4500

16

66

28

5381

8748

15

61

30

3000

7000

21

65

10

1100

2000

13

72

17

8000

12000

14

66

10

1800

3500

17

34

13

1300

4800

24

30

5

700

1350

17

78

50

5000

12500

15

75

54

4500

6500

20

56

15

800

2300

18

55

26

12000

15500

15

62

33

1200

3600

8

55

26

6000

8000

13

77

34

250

2500

16

59

20

20

5500

16

68

43

4000

15000

15

64

41

4500

9000

14

69

34

1500

4800

17

65

42

2000

14500

19

70

25

4100

5600

15

68

45

10000

35000

13

69

20

6500

8300

13

59

12

3100

4600

14

70

34

12000

14782

14

70

41

2000

3500

15

65

40

4000

7000

14

43

20

5410

7800

23

59

13

4000

5800

16

76

25

5000

7500

19

68

36

3600

8400

16

63

5

410

1500

19

37

10

1500

13000

14

18

9

20

14

19

9

12

12

8

14

15

17

15

18

20

15

8

18

17

12

16

15

17

17

13

16

8

13

9

15

18

16

17

16

15

29

18

21

19

17

19

15

16

17

17

18

44

67

55

60

68

35

63

66

56

78

72

47

48

58

72

28

58

71

73

48

35

71

78

58

61

67

61

55

64

80

74

85

76

61

72

60

62

66

48

62

78

38

64

36

56

58

48

21

8

33

37

10

15

30

10

25

59

50

29

10

25

42

7

36

40

20

20

15

48

60

38

37

21

5

25

29

55

15

70

55

30

21

8

15

43

7

20

22

4

9

9

14

17

3

2400

268

3000

2000

50

2500

1000

1500

3000

4000

8000

4000

3000

750

3500

2500

3000

1000

3500

2400

1000

4000

4200

5000

1400

2000

500

500

7500

5000

1500

3500

8000

1200

5599

820

2800

20000

1500

5300

2500

1400

4000

550

375

2000

125

8800

4310

12000

5500

1200

3900

11500

2700

20000

27000

16000

18800

4000

1600

5000

3100

20000

15000

6300

3500

4500

6500

8200

24000

10000

4500

3800

1300

8500

12000

3000

7500

13000

4740

10166

1800

4000

10000

3000

8780

6700

2200

5000

1425

1600

3100

650

20

16

16

22

13

11

15

18

21

16

17

10

13

17

10

18

16

17

16

14

16

20

19

8

13

19

16

17

16

18

14

13

14

15

8

17

16

17

24

24

18

16

13

12

14

17

17

69

42

55

58

49

66

74

48

54

66

69

50

67

52

60

60

69

31

32

50

40

40

21

78

66

67

69

65

89

53

34

29

68

58

73

55

50

52

56

72

53

51

63

51

36

79

67

46

20

1

35

15

25

20

25

9

45

21

30

48

8

27

12

23

12

4

24

17

1

1

50

4

23

25

3

60

26

1

1.5

42

15

9

5

12

8

6

3

20

15

39

13

5

41

19

1400

1100

250

2000

423

700

1165

1000

2000

2000

2700

1000

1210

500

1100

800

2000

4000

400

1555

2000

120

75

1000

700

2900

10

800

5477

1000

426

400

5000

2100

350

50

5000

1000

3

500

1465

1000

11000

1300

700

10000

1950

14600

1900

900

23700

1200

15000

3538

3000

4400

3000

6850

4000

74

5000

11000

2200

2600

5000

2900

7300

3000

750

400

2000

1500

5400

450

2800

12212

1500

1103

700

13000

4700

3000

640

6000

1450

1100

1700

2350

1700

25500

3400

1300

11700

3800

18

17

11

22

17

12

17

14

18

14

10

17

18

15

17

14

10

15

11

13

17

18

14

25

14

29

15

14

15

15

17

17

11

14

18

16

13

16

15

18

12

19

13

14

22

16

15

62

71

51

58

61

51

55

65

54

64

24

71

57

70

70

83

35

66

57

33

24

40

53

36

75

23

63

28

42

44

72

80

41

52

70

63

36

55

39

27

57

49

60

62

45

33

54

13

20

26

12

42

29

25

30

21

26

0.5

10

9

25

47

39

10

5

10

11

3

1.5

13

15

53

1

33

9

10

8

7

32

1

30

15

30

4

25

1.5

6

28

9

6

15

10

3

5

2500

2300

9900

1000

5000

600

6000

4000

2100

950

1

100

1600

650

3500

5000

2300

1900

200

500

650

1300

1900

5000

6900

66

1500

894

1750

3000

625

4800

200

1200

2500

12000

350

1800

50

600

2500

1000

1005

500

1000

39

750

4000

5050

10100

1800

20500

1800

12000

11000

2800

15460

305

1500

2500

3265

12000

8000

4300

4000

1850

1020

2000

1650

2900

7000

11000

510

3000

5175

4500

4000

2100

7100

650

1800

4700

19000

900

6000

1065

800

8700

2500

3500

2200

2000

670

4000

8

17

12

20

17

17

16

9

16

14

16

15

29

18

19

11

12

9

18

20

16

16

13

11

9

22

21

15

12

16

18

13

8

9

11

8

16

19

14

14

7

18

14

11

15

13

11

67

39

72

52

73

62

55

67

60

66

69

46

42

53

47

35

67

70

64

64

50

35

41

86

67

54

57

74

64

60

33

48

31

34

64

54

48

50

69

64

63

56

60

65

52

51

63

42

12

39

4

53

24

5

41

8

5

44

25

13

22

15

13

34

25

15

15

6

13

14

33

44

2

35

73

42

30

6

1

7

4

38

27

10

5

42

32

37

11

40

41

30

1

41

5600

560

3000

230

3500

140

400

6000

900

450

2800

1000

2600

2500

2000

1230

4500

6000

1200

800

400

4000

1500

19000

7700

347

2200

6995

4000

3100

1800

50

5900

500

5000

2400

1850

450

3500

3400

1200

860

2500

450

4000

45

2500

7000

6300

7400

1250

10500

1400

1250

8000

1500

3200

4800

12000

8600

4500

4500

4500

11100

14000

1800

3000

1100

6000

3000

23400

24200

1005

5000

8816

11000

8700

3500

1700

6800

1050

12000

4000

3500

1400

9800

4100

3200

1385

20000

22200

7000

225

7500

22

12

27

16

17

16

19

15

15

19

20

15

17

19

24

19

21

16

16

10

16

22

16

14

14

16

15

10

18

22

12

16

11

15

20

18

23

17

10

16

16

16

71

61

59

38

57

75

61

44

75

30

45

77

64

34

65

39

63

56

35

66

27

59

60

67

51

65

57

68

68

77

60

25

61

44

54

70

68

57

61

55

45

54

15

41

29

17

9

47

33

22

41

2

10

13

8

5

35

2

42

32

12

35

3

25

40

40

7

36

4

12

35

20

37

6

15

3

35

38

4

15

21

35

25

4

2000

5000

1500

3000

250

2800

1000

3700

8000

150

2000

1800

2000

1000

4024

200

400

500

900

4000

900

600

1000

9000

450

2000

260

115

5000

75

1523

150

500

100

1100

3000

20

600

1100

5000

1700

1450

2500

25000

4300

12000

2800

6600

12000

9700

13000

650

2500

4000

4000

1600

8860

800

8000

2200

2400

6000

2100

5000

4600

11000

1450

13000

750

1200

22000

7900

3195

775

1700

4500

15000

5000

2020

1200

1800

20000

4500

2000

Week 4

Research Project 1

This week is dedicated exclusively to engaging in a research project that will enable you

to apply the various concepts you learned from the previous 3 weeks. The Research Project 1

dataset, which is given as an Excel file, contains actual research data collected from a random

sample of N = 340 certified flight instructors (CFIs). The dataset consists of 10 independent

variables (IVs) and one dependent variable (DV). A description of each variable follows.

X1 = Participants’ gender

X2 = Participants’ age in years

X3 = Participants’ race/ethnicity

X4 = Participants’ marital status

X5 = Participants’ highest level of education

X6 = Total years participants have held a CFI certificate

X7 = Participants’ total hours dual given

X8 = Participants’ total hours dual given in previous 90 days

X9 = Participants’ total flight time (in hours)

X10 = Types of certificates participants current hold

Y = Complacency scores, which were measured on a 7-item Likert scale ranging

from 1 = Strongly Disagree to 5 = Strongly Agree. Thus, scores could range from

7 to 35, with higher scores indicating a greater likelihood toward complacency as

a flight instructor.

1. Based on the IVs and DV, what would be an appropriate overall RQ for this study?

There are two RQs:

(a) What is the relationship between CFIs’ demographics (X1 through X5) and their level of

complacency?

(b) What is the relationship between CFIs’ professional experience/background ((X6 through X10)

and their level of complacency?

2. What would be an appropriate research methodology/design for this study? Explain.

The research methodology is correlational. This is because we are dealing with one group (CFIs)

and multiple variables.

3. Using your statistical software package, find the appropriate measures of central tendency and

variability for Y and X1– X9. Complete a chart similar to the one below. The first row is

completed for you as a guide.

Factor

Appropriate

Measure of

Central Tendency

X1

X2 = age

Mode

Median*

X3 = race/ethnicity

X4 = marital status

X5 = Education

X6 = years held CFI

Mode

Mode

Mode

Median*

X7 = hours dual given

Median*

X8 = hours dual past

90 days

X9 = total flight time

Median*

Y

Mean

Median*

Appropriate

Measure of

Variability

Not applicable

Standard Deviation

and range (20 to 89)

Not applicable

Not applicable

Not applicable

Standard Deviation

and range (0.5 to 73)

Standard Deviation

and range (1 to 20000)

Standard Deviation

and range (0 to 3000)

Standard Deviation

and range (11.7 to

35000)

Standard Deviation

Reason

Actual Measure of

Central Tendency

Data are nominal

Data are skewed

Male

60

Data are nominal

Data are nominal

Data are nominal

Data are skewed

Caucasian

Married

4-year degree

20

Data are skewed

1850

Data are skewed

20

Data are skewed

6185

Data are continuous

15.8

*Although the median is appropriate because of the skewnees, note that it also is appropriate

to report the mean. It also is important to report the range (low to high). Also note that the

skewness could be related to outliers. Nevertheless, for descriptive statistics, we keep the

outliers because they are part of the story we are telling.

4. Explain how you would determine the measure of central tendency for X10?

There is no direct measure of central tendency because participants simply listed the types of

certificates they held. One way to report a “central” metric would be to indicate the mean

number of certificates. For example, participants averaged 2.5 certificates. Another way

would be to simply report the mode, which would be the most frequently held certificate

(CFI). Other responses are acceptable as long as they are reasonable.

5. Prepare a descriptive statistics summary table for all the continuous variables. Use the table

below as a guide. (Round to two decimal places.)

Factor

N

M

SD

Range (L–H)

Shape of

Distribution

X2 = age

X6 = years held

CFI

X7 = hours dual

given

X8 = hours dual

past 90 days

X9 = total flight

time

Y

321

307

56.71

22.26

14.8

15.4

20–89

0.5–73

Skewed left

Skewed right

309

2559

2804.0

1–20000

Skewed right

318

53.29

238.4

0–3000

Skewed right

306

6185

5836.0

11.7–35000

Skewed right

337

15.8

3.8

7–29

Symmetrical

6. Consider the outliers associated with X8. Determine which outliers you think are rare cases

and which you think might be contaminants. Explain.

There appears to be two outliers that are contaminants: Row 60: X8 = 3000 and Row 128: X8 =

2990. Given that X8 represents the total number of dual hours given in the past 90 days, it is

impossible for someone to accumulate 3000 hours dual given in this time frame. All others

hours listed appear to be reasonable. These cases definitely need to be reconciled or

eliminated for inferential statistics.

7. Based on the results for Y, how would you assess CFIs’ overall level of complacency?

As noted in the problem description, Y scores could range from 7 to 35 with higher scores

indicating a greater likelihood toward complacency as a flight instructor. The mean score was

M = 15.8, the median was Mdn = 16, the range was 7 to 29, and SD = 3.8. If we were to

consider the midrange of the possible scores (Note: The midrange is the low score + high

score divided by 2) we get 7 + 35 = 42, and 42 divided by 2 is equal to 21.

If we now compare the mean of M = 15.8 (or about 21) to the midrange, the mean is below the

midrange, which indicates that overall complacency scores were low. Thus, for this sample of

CFIs, it appears that they have a relatively low level of complacency with respect to flight

instruction.

8. Based on the results for X1 and X5, to what population do you think the results of this study

would be generalizable? Why?

Note that for gender, there were 301 male CFIs and 36 female CFIs. This makes us wonder if

gender is really a variable or a constant when 89% of the sample is male. For highest level of

education, it looks like 4-year (37.5%) and master’s degree (30.7%) are the two highest

categories. Thus, 68% (approximately two-thirds) of the sample’s highest level of education is

a 4-year degree or a master’s degree. Putting this all together, it would be reasonable to

conclude that the results of the study are generalizable to male CFIs who have at least a 4year college degree.

9. Interpret in the context of the given research setting Q1, Q2, and Q3 for Y.

• Q1 = 14 (25% of the CFIs’ scores were lower than 14)

• Q2 = 16 (50% of the CFIs’ scores were lower than 16)

• Q3 = 18 (75% of the CFIs’ scores were lower than 18)

10. What type of distribution do the complacency (Y) scores form? What statistical data do you

have to support your answer? If we were to assume that the complacency scores

approximate a normal distribution, what is the probability that a CFI selected at random will

have a complacency score of at least 21?

• Symmetrical (bell-shaped)

• The mean (M = 15.8) and the median (Mdn = 16.0) are nearly identical

• Round the mean to 16 and the standard deviation to 4. If x = 21, then the corresponding z

score z = 1.25. The area under the curve between z = 0 and z = 1.25 is .3944. Therefore, the

probability a Y score is at least 21 is 0.5000 – 0.3944 = .1056. Thus, there is a 10.56%

likelihood of a CFI scoring at least 21 on the complacency instrument.

## We've got everything to become your favourite writing service

### Money back guarantee

Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.

### Confidentiality

We don’t share your private information with anyone. What happens on our website stays on our website.

### Our service is legit

We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.

### Get a plagiarism-free paper

We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.

### We can help with urgent tasks

Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.

### Pay a fair price

Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.