Real World Issue that Impact Inferences that can be made from Samples
Most textbooks ignore or gloss over some important issues that must be addressed when statistics are used to make inferences from a sample to a population. Failure to adequately address these issues can result in reporting misleading results. A few of these issues are addressed below.
Most textbooks emphasize the importance of a probability sample (initially focusing on a simple random sample). Yet few clearly communicate all of the steps involved in selecting and collecting such a sample. For simplicity, the following discussion will focus on selecting a simple random sample from a finite population.
In order to make a statistical inference from a sample, the sample needs to be a probability sample selected from a population. There are multiple types of probability sampling methods (e.g., simple random sampling, cluster sampling, stratified random sampling). These are usually discussed in textbooks--with simple random sampling being the focus in most introductory courses.
To obtain a random sample each member of the population must have a chance of being selected. This requires the ability to identify every member of the population. The "idea" of the population is usually fairly easy to express. We call this "idea" the target population. Said another way, the target population represents the group that we want to study.
In practical applications identification of the members of the population is often problematic; or not completely straight-forward. There are often multiple ways to attempt to identify the members of the population--and the different approaches do not always include the same members. The "list" that attempts to identify the members of the target population forms what is called a sampled population (or a frame). Said another way, the sampled population is the list that can be used to identify individual elements (members) of the population that could be selected for the sample--that could provide data. There can be multiple sampled populations for any target population. The selection of the sampled population is based on how closely the sampled population matches the target population. Subject-matter expertise is necessary to identify the sampled population that most closely matches the "idea" expressed in the target population. [One trick for identifying a sampled population for a finite group would be to find a way to justify the answer to the following question: "How many members of the population exist?" If you can answer this question by saying, ## because the list I obtained from source X contained that many items, you could consider that list as a sampled population--i.e., the list that you will use to select your sample.]
Once a sampled population is identified the sample is drawn from that group using a well defined sampling method, data are collected, a sample statistic is calculated, and a statistical inference is provided. The sample consist of the data obtained from the members of the sampled population that actually respond. [Note: Statistical inference assumes that all members selected for the sample will be located and will respond.] The statistical inference (generalization) applies to the population that produced the sample (therefore, the statistical inference is to the sampled population).
Statistical inferences are accompanied with a measure of sampling error. Sampling error helps explain differences in the observed sample statistic and the associated population parameter than can be accounting for by the way that the sample was selected (the fact that you looked at a subset of the population). Sampling error does not account for every error that can occur when data are collected. In fact, sampling error does not capture the impact of errors that could happen even if you attempted to collect data on each member of the sampled population.
Examples of errors that are not included in any estimate of sampling error include: non-response (identifying a member of the population to sample and that member of the population not responding); a mismatch between the sampled population and the target population (identifying a member of the sampled population as a member of the population and collecting data from that "member" when, in reality, they are not part of the target population OR failing to identify an actual member of the target population when the sampled population is determined and, as a consequence, not allowing that member to have a chance to be selected for the sample); collecting data in one way when a different method is appropriate (collecting data face-to-face when privacy is an issue); inaccurately recording data (intentional or not); lack of operational definitions; and inaccurate answers from a member selected for the sample (intentional or not).
Two examples will be used to illustrate the concepts:
Example 1: Suppose you wanted to select a random sample of gas stations in the state of Georgia in order to estimate the average price of a gallon of regular gas.
Example 2: Suppose you wanted to select a random sample of business students at North Georgia College & State University in order to estimate the average number of hours per week these students work.
| Concept/Topic | Example 1: Gas Stations | Example 2: Students |
| Target population | All gas stations in the state of Georgia | All business students at North Georgia |
| Characteristic of interest | price of a gallon of regular gas | number of hours per week students work |
| Sampled population | 1) a list of gas stations obtained from the
State Department of Agriculture (the agency that certifies pumps) 2) a list of gas stations obtained from the State Department of Revenue (from tax collection records) 3) a list of gas stations obtained from each county clerk's office based on business licenses 4) a list of gas stations compiled from the yellow pages (of major phone companies providing service in Georgia) [or the electronic counterpart] 5) a list of gas stations compiled from a website where individuals can post information about gas prices |
1) a list obtained from the Registrar's Office - on a
given day or at a given point in the semester 2) a list obtained from the School of Business Record's Coordinator based on advisor assignments 3) a list obtained by combining names of students on the rolls of courses taught by the School of Business 4) a list obtained by listing all students in the yearbook who list business as a major |
| Differences between the sampled population and the target population | 1) new stations may not have been added;
closed stations may not have been deleted; private pumps may be included 2) same issues plus taxes may be grouped from several stations owned by the same company/individual 3) same issues as the first case plus some stations may be part of another company (e.g., Kroger, Sam's) and not be identified as a gas station 4) same issues as the first case plus limited to stations with landline phones and that request/pay to have their number in the yellow pages; may also miss some phone books 5) essentially this is a self selecting sample rather than any kind of population; these sites provide names and prices that individuals voluntarily select to place on the site |
1) is based on the major the student has entered in
Banner at some point in time (usually prior to Drop/Add for the given
semester); some students may have changed majors since then but not have
them show up; others may have made the decision to change majors, but
not told the university 2) students are more likely to tell the School of Business when they change into Business than when they change out of Business 3) this list includes many students who are not Business majors but are taking such courses as ACCT 2101, ECON 2105, MGMT 3661, MKTG 3700; the list also fails to include freshmen Business majors who are not yet taking Business courses 4) this list omits a large number of students since many do not have their picture taken for the yearbook |
| "Best" sampled population (from those suggested) | The first option is probably best since all pumps are expected to be inspected and elimination of private pumps should be easy to address. All of the other options include the same differences between the sampled population and the target population plus some additional differences. Therefore, the first option appears to be the closest to identifying all of the gas stations in Georgia without too much "extra." | A combination of the first two options probably does the best to identify all students who currently consider themselves as Business majors. Unfortunately, this list will probably include some students who were, but are no longer, Business majors. The last two options produce lists that would be expected to differ substantially from the target population. |
| Parameter of interest | Average price of a gallon of regular gas in Georgia (for all stations) | Average number of hours per week worked by all business students at North Georgia |
| Sample (this is where we are talking about data) | The gas stations (actually the prices at the gas stations) that are actually selected to be included in the study. | The students (actually the hours per week worked by students) that are actually selected to be included in the study. |
| Statistic to use for estimate | Average price of a gallon of regular gas in a random sample selected from the sampled population | Average number of hours per week worked by the business students in a random sample selected from the sampled population |
| Examples of errors that are not included in sampling error | a) the method of collecting the data may
impact the responses (phone calls can be made in a shorter period of
time than visiting stations to observe prices--since prices change
often, prices are more likely to change while data are collected if the
collection process takes more time) b) data may be recorded incorrectly (numbers transposed, rounded, typo, etc.) c) regular gas has not been operationally defined (is regular 85, 87, 89 octane) d) some stations selected for the sample may no longer be in business (so there is a mismatch between the sampled and target populations) e) some people asked for information may fail to respond f) some stations charge more than one price for regular gas (e.g., one price for members and another price for non-members); there may not be uniform reporting of the price at such stations |
a) the method of collecting the data may impact the
responses (face-to-face question asked by career placement or by another
student) b) the responses may be influenced by the perceived use of the data--selection for job recommendation vs. recommendation for number of hours of academic credit) c) data may be recorded incorrectly d) "worked" is not operationally defined (is student worker counted; what about work for a non-profit agency where the student is not paid) e) some individuals questioned my not still be a business major f) some people may fail to respond g) some people may provide false information h) some people may respond with information about a typical week while others may respond with information about the most recent week or an extreme week |