Survey research and design in psychology/Assessment/Lab report/Data screening

General steps

 * 1) Obtain a copy of the data:
 * 2) The data file is provided as received, with minimal screening. There are likely to be several errors in the data file (e.g., due to mistakes made during completing the survey and/or data entry). Thus, it is important to thoroughly screen the data (check, find, and correct errors) before conducting data analysis.
 * 3) Only the data you decide to use needs screening, although it may prove worthwhile to simply screen all data.
 * 4) Follow the How to screen steps below.
 * 5) Save and backup the screened data file as you would for any other valuable electronic file - you don't want to have re-do these steps.
 * 6) Once data screening is complete, proceed to data recoding, then conduct the lab report data analyses.

Known issues for the 2018 data
This is a list of known data file errors and suggested possible corrections ... so far ...

Post about anomalies you find in the data to the discussion forum - many hands make light work - or Linus' Law "given enough eyeballs, all bugs are shallow".


 * Data enterer ID
 * 1) Data enterer 2 (Cases 6 - 10): Hard copies were not submitted, so unable to verify.  Recommended action: Remove cases.
 * 2) Data enterer 13 (Cases 61 - 65): Other than the background variables, the data is missing for all cases. Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
 * 3) Data enterer 32 (Cases 156 - 160): Hard copies were not submitted, so unable to verify.  Recommended action: Remove cases.
 * 4) Data enterer 38 (Cases 186 - 190): None of the 18 time management variables for any of the 5 cases are above 4 for an 8-point scale. Seems highly improbable and may indicate that the data was fabricated. Hard copies were not submitted, so unable to verify. Recommended action: Remove cases.
 * 5) Data enterer 78 (Cases 386 - 390): Hard copies were not submitted, so unable to verify.  Recommended action: Remove cases.
 * 6) Data enterer 80 (Cases 396 - 400): Hard copies were not submitted, so unable to verify.  Recommended action: Remove cases.
 * 7) Data enterer 109 (Cases 542 - 546): Several problems including: numeric responses appear in the open-ended variables; ages are single-digits; hours are unrealistically low. Recommended action: Remove cases.
 * Case ID
 * 1) Case 17 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
 * 2) Case 20 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
 * 3) Case 32 - pss items are all 1. Recommended action: Change all pss items to missing data for this case.
 * 4) Case 33 - hours estimates are missing or very low. Recommended action: Change all hour estimates to missing data for this case.
 * 5) Case 98 -  International04 is entered as 11 and enrol05 is missing. It was probably meant to be entered as 1 for each variable. Recommended action: Either change the 11 to missing data or replace both 04 and 05 with 1 each.
 * 6) Case 122 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
 * 7) Case 126 and 251 - Virtually identical. Recommended action: Remove one or both cases.
 * 8) Case 129 - sleepinghours09 - 18 hours: Only one estimate of hours is provided - for sleeping. And 18 seems to be unrealistic/unlikely. Recommended action: Change sleepinghours09 for this case to missing data.
 * 9) Case 149 - workhours09 - 150 hours: This is over 21 hours per day, 7 days a week, for a full-time student. There are no other estimates of hours for other activities. This seems to be an unlikely number of work hours. Recommended action: Change workshours09 for this case to missing data.
 * 10) Case 185 - hours estimates are missing or very low. Recommended action: Change all hour estimates to missing data for this case.
 * 11) Case 285 - hours add up to less than 24. Estimated total hours seem to be under-realistic. Recommended action: Change all hour estimates to missing data for this case.
 * 12) Case 307 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
 * 13) Case 339 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
 * 14) Case 347 and 358 - Virtually identical. Recommended action: Remove one or both cases.
 * 15) Case 363 - hours estimates are missing or very low. Recommended action: Change all hour estimates to missing data for this case.
 * 16) Case 437 - hours estimates are very low. Recommended action: Change all hour estimates to missing data for this case.
 * 17) Case 466 - tm variables from 06 to 18 are all 8. Seems likely that the respondent didn't think about the last dozen or so questions and just circled 8. Recommended action: Change all tm variables to missing data for this case.
 * 18) Case 470 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.
 * 19) Case 475 - hours add up to more than 168. Estimated hours are greater than than the maximum hours in a week. Recommended action: Change all hour estimates to missing data for this case.

<!--
 * Variables
 * 1) sleepinghours09 - estimates that seem too low e.g., because they were estimated daily rather than weekly:
 * 2) Case 7 - 7. Change to 49.
 * 3) Case 8 - 0. Change to missing data.
 * 4) Case 13 - 4. Change to 28.
 * 5) Case 32 - 6. Change to 42.
 * 6) Case 39 - 10. Change to 70.
 * 7) Case 84 - 12. Change to 84.
 * 8) Case 128 - 20. Change to missing.
 * 9) Case 160 - 7. Change to 49.
 * 10) Case 165 - 4. Change to 28.
 * 11) Case 197 - 7. Change to 49.
 * 12) Case 213 - 25. Change to missing.
 * 13) Case 219 - 7.5. Change to 56.
 * 14) Case 228 - 7. Change to 49.
 * 15) Case 303 - 4.5. Change to 31.5.
 * 16) Case 334 - 6. Change to 42.
 * 17) Case 407 - 7. Change to 49.
 * 18) Case 408 - 7. Change to 49.
 * 19) Case 409 - 11. Change to 77.
 * 20) Case 414 - 6.5. Change to 45.5.
 * 21) Case 424 - 7.5. Change to 52.5.
 * 22) Case 425 - 7.5. Change to 56.
 * 23) Case 433 - 20. Change to missing.
 * 24) Case 434 - 20. Change to missing.
 * 25) Case 500 - 25. Change to missing.
 * 26) Case 518 - 18. Change to missing.
 * 27) pss03
 * 28) Case 468 - 999. Recommended action: Change pss03 to missing data.
 * 29) tm10
 * 30) Case 7 - 0. Recommended action: Change tm10 to missing data.
 * 31) tm16
 * 32) Case 75 - 9. Recommended action: Change tm16 to missing data.

Known issues for the 2017 data
This is a list of known data file errors and suggested possible corrections ... so far ...


 * Data enterer
 * 1) Data enterer 36 (Cases 176 - 180): At least some of the cases appear likely to have been fabricated based on similar hand-writing on the hard copies for the open-ended responses -> remove cases
 * 2) Data enterer 71 (Cases 351 - 355): At least some of the cases appear likely to have been fabricated based on similar hand-writing on the hard copies for the open-ended responses -> remove cases
 * 3) Data enterer 75 (Cases 371 - 375): At least some of the cases appear likely to have been fabricated based on similar hand-writing on the hard copies for the open-ended responses -> remove cases
 * 4) Data enterer 98 (Cases 486 - 490): Pattern and style of responses indicate a high likelihood that some responses may have been fabricated -> remove cases


 * Cases
 * 1) Case 052: Lacks meaningful variation -> remove case
 * 2) Case 132: Lacks meaningful variation at least for ztpi items -> remove case
 * 3) Case 135: Largely empty, seemingly abandoned survey -> remove case
 * 4) Case 163: Lacks meaningful variation -> remove case
 * 5) Case 181: Veracity in doubt because PSS out of range values were entered -> remove case
 * 6) Case 244: TMS responses lack internal consistency -> remove case
 * 7) Case 363: Near identical duplicate of Case 403 -> remove case
 * 8) Case 402: Lacks meaningful variation (answered 3 to all ztpi items) -> remove case
 * 9) Case 403: Near identical duplicate of Case 363 -> remove case
 * 10) Case 432: Lacks meaningful variation (answered 3 to all ztpi items) -> remove case
 * 11) Case 535: Lacks meaningful variation -> remove case


 * Variables
 * 1) Age Case 068 (2) -> missing data
 * 2) Time per week estimates need careful scrutiny:
 * 3) The lowest values should be examined (e.g., it seems likely that some responses may have been per day rather than per week estimates). These responses could either be converted to per week estimates or changed to missing data. e.g., Case 070: Sleeping hours (1) -> missing data
 * 4) The highest values should be examined e.g., Case 516 Sleep hours (2417) -> to missing data
 * 5) Total hours (calculate via COMPUTE totalhours09.1=workhours09+classhours09+studyhours09+rechours09+socialhours09+sleepinghours09.) for Cases 121, 319, 379, 380, 387, and 438 exceeds the total hours in a week (168) - which seems unlikely (although technically, some activities could overlap). Nevertheless, it seems that the cases do not have accurate time estimates -> missing data for all the time estimates for these cases
 * 6) Stress
 * 7) Case 340: pss07 - out of range value -> missing data
 * 8) Case 540: pss03 - out of range value -> missing data
 * 9) Time perspective
 * 10) Case 013: ztpi06 - out of range value -> missing data
 * 11) Case 183: ztpi15 - out of range value -> missing data


 * 1) Time management
 * 2) Case 048: tm13 - out-of-range value –> missing data
 * 3) Case 154: tm10 - out of range value -> missing data

If you find additional errors, post to the LearnOnline discussion, email the unit convener, or edit this page.

Known issues for the 2016 data
This is a list of known data file errors and suggested possible corrections ... so far ...

If you find additional errors, post to the Moodle discussion, email the unit convener, or edit this page.

--><!--
 * 1) Duplicate cases
 * 2) CaseID 28 and 29 have very similar background profiles - remove both cases as their veracity is in doubt
 * 3) CaseID 68 and 642 have very similar responses - remove both cases as their veracity is in doubt
 * 4) CaseID 115 and 483 have responded "6" to all time management items which seems to indicate a lack of thoughtful/genuine response - remove both cases
 * 5) CaseID 133 and 602 have very similar responses - remove both cases as their veracity is in doubt
 * 6) CaseID 134 and 603 have very similar responses - remove both cases as their veracity is in doubt
 * 7) CaseID 182 and 202 have very similar responses - remove both cases as their veracity is in doubt
 * 8) Lack of variation in responses
 * 9) CaseID 443 and 444 have very similar responses to satmost1 satmost2 satleast1 and satleast2 and several other background information variables - remove both cases as their veracity is in doubt
 * 10) CaseID 635 and 636 have very similar responses to satmost1 satmost2 satleast1 and satleast2 and several other background information variables - remove both cases as their veracity is in doubt
 * 11) Inconsistent responses
 * 12) CaseID 305 - Time management responses appear to be contradictory (e.g., high responses to the positively worded items but also high responses to the negatively worded items) → remove case
 * 13) Not a current student
 * 14) CaseID 230 data indicates being 100% complete and not enrolled in any units, and is thus not currently a student → remove case
 * 15) CaseID 581 is not enrolled in any credit points, is 0% complete, and has 0 class hours, and thus appears to not be a current student → remove case
 * 16) Gender:
 * 17) In Variable View, a 3 has been specified as a missing value. This is an error, so in Variable View, change Missing values for Gender to None.
 * 18) Faculty:
 * 19) There is a 0 (→ missing data) and a 9 (→ missing data).
 * 20) Enrol:
 * 21) There are a hodge-podge of responses which are meant to be in credit points.
 * 22) Multiples of 3cp are the most frequent responses and make sense (although 24 cp probably means 12 cp - a full-time load).
 * 23) The other responses need to be each considered. For example, it is quite possible that the 4s were intended to mean 4 units (i.e., a full-time load of 12cp). So, these could be changed. In other cases, consider replacing with missing data.
 * 24) Hours:
 * 25)  There are some implausible high responses to the number of work, study, and class hours. Decide what is a maximum feasible/plausible value for hours per week in each of these activities and either make higher values missing or recode them to the maximum plausible value.
 * 26) Satisfaction:
 * 27) satis02: CaseID 412 is 40 → sysmis
 * 28) satis30: CaseID 151 is 95 → sysmis
 * 29) satis30: CaseID 133 is 85 → sysmis
 * 30) satis37: CaseID 335 is 13 → sysmis
 * 31) General life satisfaction:
 * 32) gls01: Case ID 96 is 8 → sysmis
 * 33) General health and well-being:
 * 34) ghwb04: Case ID 95 is 8 → sysmis
 * 35) ghwb08: Case ID 401 is 8 → sysmis
 * 36) Gender identity
 * 37) Case ID 93 genderidentity is 6 → sysmis

Known issues for the 2015 data
This is a list of known data file errors and suggested possible corrections. ... so far ...

If you find additional errors, post to the Moodle discussion, email the unit convener, or edit this page.


 * By variable
 * 1) Age
 * 2) Case 336 has an age of 2 - suggest replacing with missing data
 * 3) International student
 * 4) Case 3 has an international student value of 11 - suggest changing to 1
 * 5) Faculty
 * 6) Case 119 has a faculty of 12 - suggest changing to missing data
 * 7) Credit points (enrol)
 * 8) Case 354 has an excessive number of credit points (112) - may be a typo, so could be changed to 12 (the mode) or missing data.
 * 9) Case 315 has a large number of credit points (72 - the number of points for 6 full-time semesters) - change to missing data.
 * 10) Case 128 has a large number of credit points (42 - this would be 13 units in a semester) - change to missing data.
 * 11) Cases 37, 291, 565 and 595 have a large number of credit points (24 - this is the full-time load for 2 semesters) - change to either 12 (the mode, full-time load for a semester) or missing data
 * 12) There are 17 cases with 4 credit points (including Case 192 with 3.7 credit points). This is an unusual number of credit points. It seems quite likely that these respondents meant 4 units (12 credit points), a full-time load. Change these values to either 12 (the mode) or missing data.
 * 13) There are some responses indicating 7 or 8 credit points - change to missing data.
 * 14) Case 576 indicated 20 credit points - seems unlikely as most units are 3 credit points - change to missing data.
 * 15) Cases 11, 15, 27, and 224 indicated 16 credit points - seems unlikely as most units are 3 credit points - change to missing data.
 * 16) Case 23 indicated 14 credit points - seems unlikely as most units are 3 credit points - change to missing data.
 * 17) Case 358 indicated 10 credit points - seems unlikely as most units are 3 credit points - change to missing data.
 * 18) Case 490 indicated 2 credit points - this is probably meant to be 2 units - change to 6 credit points or missing data.
 * 19) Case 280 indicated 1 credit point - this is probably meant to be 1 unit - change to 3 credit points or missing data.
 * 20) There are 3 cases with 0 credit points - this is quite possible - they are students who simply haven't enrolled in any units this semester.
 * 21) There are 23 cases (initially) with missing data for credit points. These respondents may not have known how many credit points they are enrolled in or simply opted not to answer.
 * 22) Hours
 * 23) Cases 12, 263, 454, 645, and 647 have seemingly excessive estimates of weekly work, study, and/or classes hours - probably best to change these values to missing data.
 * 24) Cases 389 and 454 (80 hours of study per week), cases 33, 282, 290 and 479 (70 hours of study per week), cases 13 and 407 (80 and 84 hours of work per week respectively), and cases 25 and 263 (56 hours of classes per week), and possibly some others cases have high, but plausible (or least possible) estimates, however these do appear to be outlying values as they are ~20 hours greater than the next highest estimates. If any of the hours variables are to be used as IVs in MLRs, these could be (overly)influential cases and it may be appropriate to recode these values to, say, the next highest values.
 * 25) Genderidentity - there are several scores of 6 and one score of 8 (the scale was from 1-5) (Cases 171, 173, 203, 615, 545) - change to missing data.
 * 26) Satisfaction
 * 27) Case 48 satis34 is 0 - suggest changing to missing data
 * 28) Case 296 satis32 and satis35 are 66 and - suggest changing to 6
 * 29) Case 544 satis05 is 64 - suggest changing to missing data
 * 30) Case 620 satis32 is 77 - suggest changing to 7
 * 31) Case 645 satis18 is 0 - suggest changing to missing data
 * 32) Case 668 satis15 is 20 - suggesting changing to missing data
 * 33) GLS
 * 34) Case 334 gls05 is 8 - out of range - suggest changing to missing data
 * 35) Case 541 gls01 is 8 - out of range - suggest changing to missing data
 * 36) Case 614 gls04 is 54 - out of range - look at the intended sequence - suggest changing gls02, gls03 and gls04 to 5, 5, 4 respectively
 * 37) Time managment
 * 38) Case 167 does not have a logical set of time management responses (e.g., 8 for all of the reverse-coded items indicates very poor time management, but most of the positively worded items are rated very positively) - suggest deleting the case.
 * 39) Case 345 does not have a logical set of time management responses (e.g., 1 or 2 for all of the reverse-coded items indicates very good time management, but most of the positively worded items are 1 or 2 (very poor time management) - suggest deleting the case.
 * 40) Case 406 tm17 is 0 - out of range - suggest changing to missing data
 * 41) Case 475 does not have a logical set of time management responses (e.g., low scores for the reverse-coded items indicating very good time management, but most of the positively worded items are low scores too (very poor time management) - suggest deleting the case.
 * 42) Cases 679 to 683 are missing all time management variables. In addition, no hard copies of these surveys were submitted, thus the veracity of the data may be doubtful - suggest removing these cases.
 * 43) Missing data
 * 44) 576 is missing one page on student satisfaction, but seems to have reasonable responses for the other variables, so there is no particular reason to remove the case
 * 45) 673 has missing data in patches, but seems to have reasonable responses for many variables, so there is no particular reason to remove the case

-->
 * By case
 * 1) Cases 51 to 55 are under investigation for possible violation of academic integrity (submitted surveys appeared to have the same handwriting; fabrication suspected) - removal of these cases is recommended.
 * 2) Cases 159 to 163 are under investigation for possible violation of academic integrity (submitted surveys appeared to have the same handwriting; fabrication suspected) - removal of these cases is recommended.
 * 3) Cases 169 to 173 - veracity is doubtful (e.g., there are no satisfaction item scores about 7, there are some out of range values for genderidentity, and very brief responses are consistently provided for the open-ended questions) - removal of these cases is recommended.
 * 4) Cases 194 to 198 are under investigation for possible violation of academic integrity (e.g., Cases 197, 259, and 265 are near duplicates) - removal of these cases is recommended.
 * 5) Cases 259 to 263 are under investigation for possible violation of academic integrity (e.g., Cases 197, 259, and 265 are near duplicates) - removal of these cases is recommended.
 * 6) Cases 264 to 268 are under investigation for possible violation of academic integrity (e.g., Cases 197, 259, and 265 are near duplicates) - removal of these cases is recommended.
 * 7) Case 291 has several sets of consecutive responses with the same score which seems to suggest that the quality of response is doubtful - suggest removing the whole case, although the case could be kept and some of the more doubtful data could be made missing.
 * 8) Cases 444 to 448 are under investigation for possible violation of academic integrity (submitted surveys appeared to have the same handwriting; fabrication suspected) - removal of these cases is recommended. These also have missing data or very minimal/similar entries for the open-ended variables.
 * 9) Cases 549-550 have 6 for each of the time management items. In addition, 549 has several unlikely runs of consecutive responses for student satisfaction items. The veracity of these cases is doubtful - removal of these cases is recommended.
 * 10) Case 550 has 6s for all time management items. Given that both 549 and 550 are from the same researcher and both 6s for time management, it may be best to doubt the veracity of both cases.
 * 11) Case 608 seems unlikely to be valid - satis items are all 10 and time management items are all 3 - suggest removing this case.
 * 12) Case 620 has little variation in response within multi-item sections - suggest deleting the case.

Perceived Stress Scale

 * 1) pss04, pss05, pss07, and pss08 need recoding prior to analysis so that higher scores represent higher stress for all the pss variables.

Time Management Skills

 * 1) tm07, tm09, tm11, tm13, tm18 need recoding prior to analysis.