[1] 5709161 59913167 34993938 10698101 57848919 9440238
Data Collection
Grayson White
Math 141
Week 3 | Fall 2025
😢 “I have no idea how to do this problem.”
→ Ask someone to point you to an similar example from the lecture, handouts, and guides.
→ Talk it through with a course assistant, a fellow Math 141 student, or Grayson so together we can verbalize the process of going from Q to A.
😡 “I am getting a weird error but really think my code is correct/on the right track/matches the examples from class.”
→ It is time for a second pair of eyes. Don’t stare at the error for over 10 minutes.
🤩 And lots of other times too! 😬
Remember:
→ Struggling is part of learning.
→ But let us help you ensure it is a productive struggle.
→ Struggling does NOT mean you are bad at stats, it actually means you are doing the work to learn the material!
Key questions:
Census: We have data on the whole population!
Key questions:
Sampling bias: When the sampled units are systematically different from the non-sampled units on the variables of interest.
The Literary Digest was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932. In 1936, they conducted the most extensive (to that date) public opinion poll. They mailed questionnaires to over 10 million people (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists.
Population of Interest:
Sample:
Sampling bias:
Use random sampling (a random mechanism for selecting cases from the population) to remove sampling bias.
Simple random sampling
Cluster sampling
Stratified random sampling
Systematic sampling
Why aren’t all samples generated using simple random sampling?
Mission: “Make and keep current a comprehensive inventory and analysis of the present and prospective conditions of and requirements for the renewable resources of the forest and rangelands of the US.”
Need a random sample of ground plots to say something about the state of our nation’s forests!
Thoughts on this sampling design?
Thoughts on this sampling design?
Subsampling within each sampled cluster is much more common than subsampling the whole sampled cluster!
Are our clusters based on counties homogeneous?
Why is homogeneity important for cluster sampling?
Thoughts on this sampling design?
This is FIA’s actual sampling design (okay, slightly simplified).
Why is this design better than simple random sampling?
Mission: “Assess the health and nutritional status of adults and children in the United States.”
How are these data collected?
Stage 1: US is stratified by geography and distribution of minority populations. Counties are randomly selected within each stratum.
Stage 2: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.)
Stage 3: From sampled city blocks, households are randomly selected. (Households are clusters.)
Stage 4: From sampled households, people are randomly selected. For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements.
Why don’t they use simple random sampling?
“Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” – Committee on Professional Ethics of the American Statistical Association (ASA)
The ASA has created “Ethical Guidelines for Statistical Practice”
→ These guidelines are for EVERYONE doing statistical work.
→ There are ethical decisions at all steps of the Data Analysis Process.
→ We will periodically refer to specific guidelines throughout this class.
“Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical.”
“The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research.”
Why do you think the Age
variable maxes out at 80?
“Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records.”