Data, Information, & Knowledge
in BUS 110
Learning objectives. Students will be able to:
- Identify facts as either data, information, or knowledge
- Discuss ways in which data can be relevantly transformed into information
- Describe the integrity and usefulness of a dataset
- Apply statistical methods to a dataset to make meaningful conclusions about a business
Presentation
Outline
- Exploratory Analysis
- Organizing Data - Graphical Displays
- [Analyzing Data] (https://ramnauth.github.io/bus%20110/2019/02/14/data/#analyzing-data) using center, spread, shape, clusters, gaps, modes, and outliers.
- Planning & Conducting Studies
- Summary
Exploratory Analysis
In today’s technologically advanced world, everything is data. Data is a collection of facts that is meaningless without the context or information that makes that data relevant. As the famous statistician, Nate Silver wrote in his book The Signal and the Noise, “The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.” When we give data meaning and context, it becomes information. When we apply this information to a business, for example, it becomes business knowledge.
Organizing Data - Graphical Displays
There are a variety of ways to imbue data with meaning. Statistics is the mathematical science including methods of collecting, organizing, and analyzing data in such a way that meaningful conclusions can be drawn from them. To begin, most all forms of data can be put into tables, but these arrays of bare figures tend to be spiritless and perhaps even forbidding:
Some form of graphical display is often best for seeing patterns and shapes and for presenting an immediate summary of the data. Among the most common visual representations of data are bar charts, histograms, and stem plots.
Bar charts are useful when representing categorical or qualitative variables; these are “buckets” or categories for which each individual belongs. This contrasts with quantitative variables which are usually of numerical values. Sizes can be measured as frequencies or percentages.
Example 1.1: In a survey taken a week before Valentine’s Day, 25 participants recorded which color of flowers were they expecting to buy for their Valentine. 5 individuals expected to buy pink flowers, 1 expected purple, 8 expected white, 8 expected red, and 3 expected to buy yellow flowers. Above are two bar charts illustrating the frequencies and percentages of each flower type. What conclusions can you draw from these charts?
Dot plots can be used with categorical or quantitative variables. There are several kinds of dot plots, but the type demonstrated below is informally called the “Kindergarten” dot plot. It is built by adding a dot to the chart every time you encounter the given value in a set of numbers. See another value, dip your finger in paint and put another blot on the paper. In dot plots that deal with larger quantities, a single dot can represent a certain amount of records.
Example 1.2: A dot plot showing the age of 38 children in a school daycare:
Histograms are usually a better representation of the distribution of your data. The simple bars allow you to easily find the values without having to count all the dots. Unlike bar graphs, both axes in a histogram are of numerical values, representing quantitative variables. Notice that a histogram uses “buckets” such as in a bar graph, but these bins represent ranges as opposed to categorical values.
Example 1.3: In a survey taken after Valentine’s Day, 25 participants recorded how many days before Valentine’s Day they bought their valentine a gift as well as how much they spent on the gift. How does the above histogram compare to the histogram below? How does the size of the bins affect our interpretation of the data?
A scatterplot is a type of dot plot in which two quantitative variables are related with a series of dots (or markers).
Example 1.4:
Analyzing Data
Center & Spread
In order to interpret data in a graphical display, we must notice two essentials aspects of the overall pattern:
- The center, which separates the values. In the case of a histogram, we can determine this mathematically by dividing the area under the curve by two.
- The spread, that is, the scope of the values from smallest to largest.
In the dot plot of Example 1.2, the center is about 2 years old, and the spread is from 0 to 11 years of age.
Example 1.5: In a survey taken a week before Valentine’s Day, 25 participants recorded their heights in inches. What type of graphical display is the following? What are the center and spread?
Clusters & Gaps
Other important aspects of the overall pattern are
- Clusters, which show natural subgroups in which the values fall. For example, for the salaries of teachers in New York, fall into three overlapping clusters, one for public school teachers, a higher one for LIU professors, and an even higher one for NYU professors.
- Gaps, which show holes where no values fall. For example, the Office of the Dean sends letters to students being put on the honor roll and to those receiving an academic warning for low grades; thus, the GPA distribution of students receiving letters from the Dean has a huge middle gap.
Example 1.6: In a survey taken a week before Valentine’s Day, 25 participants recorded their age. What clusters and gaps do you see in the following histogram? How can a Valentine’s gift shop use this information to increase their profit? Is our interpretation data, information or knowledge?
Outliers
Extreme values, called outliers, are found in many distributions. Sometimes they are the result of errors in measurements and deserve scrutiny. However, outliers can also be the result of natural chance variation. Outliers can occur on one side or both sides of a distribution. When analyzing your data, highlight the outliers and speculate as to why that outlier exists.
Identify the outlier(s) in the scatterplot that is Example 1.5. How can a Valentine’s gift shop use this information to determine its market for and when to sell Valentine’s Day gifts?
Modes
Some distributions have one or more major peaks, called modes. With only one peak (such as in Example 1.5), the distribution is said to be unimodal. With two peaks (such as in Example 1.6), the distribution is said to be bimodal.
Not every peak is a mode! You should always look at the big picture to decide whether or not two or more phenomena are affecting the graph. Then, ask yourself, what can you do with this information?
Compare the two histograms in Example 1.3. How many modes? Which histogram is more effective for a Valentine’s gift shop manager that wants to determine when they should open
Shape
Distributions come in many shapes. Some common and certain patterns are:
- A symmetric distribution is one in which two halves are mirror images of each other. For example, for the heights of all students at LIU fall into symmetric distributions, perhaps with two mirror-image peaks, one for men’s heights and one for women’s heights.
- A distribution is skewed to the right if it spreads far and thinly towards higher values. A poorly written exam may show a right-skewed distribution since fewer students received high grades. In contrast, an easy exam may show a distribution skewed to the left where results are spread far and thinly towards lower scores.
- A bell-shaped distribution is unimodal, symmetric, and has two sloping tails. For example, the distribution of IQ scores across the general American population is roughly symmetric with a center of 100 and slopes downwards near extreme high and low values.
- A distribution is uniform if its graphical display is a horizontal line. For example, tossing a fair die and not how many pips (i.e., the spots) appear on top yields a uniform distribution with 1 through 6 all equally likely to appear.
Planning & Conducting Studies
In the real world, time and potential costs usually make it impossible to gather data on every member of a population. Does the government question you or your parents before announcing the monthly unemployment rates? Does a television producer ask every household’s viewing preferences before deciding whether a pilot program will be continued? In statistics, we learn how to estimate characteristics of the population by taking a sample, or a small portion, of it.
To make relevant conclusions about the larger population, we need to be confident that the sample we have chosen represents the population fairly. Many people misattribute computers algorithms for handling data to mean that the results are reliable. But it’s frequently quoted, “Garbage in, garbage out.” Nothing can help if the data was initially badly collected. Taking badly collected (e.g., biased) data and running it through the computer, make the conclusions that come from it just as bad, perhaps worse. Unfortunately, many of the statistics with which we are bombarded by newspapers, radio, and television are based on poorly designed data collection procedures.
Census
A census is a complete enumeration of the entire population. It is often thought of as an official attempt to contact every member of the population, usually with regards to details such as age, marital status, race, gender, occupation, income, education, and so on. Every 10 years, the U.S. Census Bureau divides the nation into nine regions and attempts to gather information about everyone in the country. A massive amount of data is obtained, but even with the resources of the U.S. government, the census is not complete. Before you continue reading, can you think of reasons why the census is never complete?
For example, many homeless people are missed or double counted at temporary residences. There are always households that do not respond even after repeated requests for information. It is estimated that the 2010 census missed about 2.1% of African Americans and 1.5% of Hispanics, together accounting for some 1.5 million people.
In most studies, a complete census is unreasonable because of the time and costs involved. Furthermore, attempted to gather complete population data have been known to lead to carelessness. Ultimately, a well-designed, well-conducted sample survey is far better than a poorly-designed study involving a complete census. For example, a poorly worded question might give meaningless data even if everyone in the population answers it.
Sample Survey
Because a census tries to count everyone, it is not a sample. A sample survey aims to obtain information about the whole population by studying a part of it. However, one thing that quickly invalidates a sample and makes useful information impossible to obtain is bias. A sample is biased if it does not, in some critical way, represent the entire population. The main technique to avoid bias is to incorporate randomness into the selection processes. That’s because randomization protects us from the multiple variables at play, the effects and influences, both known and unknown. Finally, what is critical is the sample size—the percentage or fraction of the population is not important. For example, a random sample size of 500 from a population of 100,000 is just as representative as a random sample of size 500 from a population of size 1,000,000.
Sampling methods
- Systematic sampling involves listing the population in some order (for example, alphabetically), choosing a random point to start, and then picking every nth person from the list (e.g., every tenth person). This gives a reasonable sample as long as the original order of the list is not in any way related to the variables under considerations.
- In stratified sampling, the population is divided into homogeneous groups called strata, and random samples of persons from all strata are chosen. For example, we can strategy by gender and pick a sample of people from each stratum if we wanted to compare data from the different genders (e.g., the wage gap between females and males). We could further do proportional sampling, where the size of random samples from each stratum depend on the proportion of the total population represented by the stratum (e.g., the number of female professors is less than the number of male professors in LIU Business, so if selected proportionally from each stratum, the sample size of female professors would be smaller than the sample size of male professors).
- Cluster sampling is when the population is divided into heterogeneous groups called clusters and we then take a random sample of clusters from among all the clusters. For example, to survey Business-major college seniors, we could randomly pick several senior-level business classes to study. Make sure that each cluster resembles the entire population.
- Multistage sampling refers to a procedure of many steps that could involve any of the various sampling techniques. For example, a charity organization may follow a process where nationwide locations are randomly selected, then neighborhoods are randomly selected in each of these locations, and finally, households are randomly selected in each of these neighborhoods.
- Simple random sampling is randomly selecting n members of the population.
- Convenience sampling is any of the above methods with one less degree of randomness, usually for the convenience of the surveyor. For example, it is perhaps more convenient to test my BUS 110 prototype with a cluster sample of current LIU undergraduates than to sample across the numerous schools nationwide. When using this sampling method, you evaluate whether your results may include an under-coverage bias—the sample is not fully representative of its population.
Sources of Bias in Surveys
- Household bias is when a sample includes only one member of any given household because large households are underrepresented. To respond to this, pollsters sometimes choose larger samples for larger households.
- Nonresponse bias. A good example is that most mailed questionnaires tend to have very low response percentages, and it is often unclear which part of the population is responding. To maximize response rates, one can use multiple follow-up contacts and cash, or other incentives. When a portion of your population refuses to respond, it may be for some underlying reason.
- Quota sampling bias happens when interviewers are given free choice in picking people. For example, in order to demonstrate diversity—as if to fill a certain quota— the school may admit a particular percentage of women to the bachelors in computer science program.
- Response bias is when a survey question itself can lead to misleading results. For example, students may leave their name on their teacher’s evaluations, hoping to be in the good graces of their teacher when it comes to grading. Patients may lie about following a doctors’ orders, dieters may be dishonest about how strictly they’ve followed a weight loss program, students may exaggerate how many hours they’ve studied for exams.
- Selection bias is simply a sample is not representative of the entire population. For example, the Literary Digest opinion poll predicted a landslide victory for Alfred Landon over Franklin D. Roosevelt in the 1936 presidential election. The Digest surveyed people with cars and telephones, but in 1936 only the wealthy minority, who mainly voted Republican, had cars and telephones. Despite having obtained more than two million responses, the Digest picked a landslide win for the wrong presidential candidate.
- Size bias. Throwing darts at a U.S. map to decide which states to sample would be biased in favor of geographically large states. Having each student pick a coin out of a bag of various size coins would bias in favor of large coins, for example, quarters over dimes.
Experiment
In a controlled study (aka an experiment), the researcher should randomly divide subjects into appropriate groups. Some action is taken on one or more of the groups, and the response is observed. For example, patients may be given unmarked capsules of either aspirin or drug X and the effects of both measured. Experiments usually have a treatment group and a control group in order to measure the effect of the treatment.
Observational Study
In an observational study, the researcher plays no other part except to observe. Usually, the researcher strives to determine which variables affect the noted response. While results may suggest relationships, it is difficult to conclude cause and effect. Furthermore, observational studies on the impact of some variable on another variable often fail because, in a natural environment, there are countless variables at play; as such, we say that these variables are confounded with other variables.
Summary
- When describing a distribution, always comment on its shape, outliers, center, and spread.
- Also consider clusters, gaps, modes, and outliers; look for reasons behind these features.
- Always provide context.
- Common shapes arise from symmetric, skewed, bell-shaped, and uniform distributions.
- For categorical (qualitative) data, dot plots and bar charts give useful displays.
For discrete (quantitative) data, histograms and scatterplots give useful displays.
- A complete census is unreasonable usually because of time and cost constraints.
- Estimate population characteristics by considering statistics from a sample.
- Analysis of badly gathered sample data results in meaningless conclusions.
- A sample is biased if it is not representative of the population.
- A technique to avoid bias is to incorporate randomness into the selection process.
- Experiments involve applying a treatment group(s) and observing the responses.
- Observational studies involve observing natural responses and nothing more.
- A simple random sample is one in which every possible sample of the desired size as an equal chance of being selected.
- Under-coverage bias occurs when part of the popular is ignored/missed.