A guide to Political Science resources at Cressman Library.

- Statistical Methods in Social Science Research byPublication Date: 2018
- Translating statistics to make decisions : a guide for the non-statisticianPublication Date: 2017

- Freely Accessible Statistics Online ResourcesResources from the Library of Congress
- STATcompilerOrganize demographic data by country and indicator. A tool from USAID.
- Tips for journalists working with math, statistics: A list of key resourcesDenise-Marie Ordway. May 20, 2016. A project of Harvard Kennedy School's Shorenstein Center.

This article first appeared on The Journalist's Resource and is republished here under a Creative Commons license.

You're reading a delightful paper all about your chosen topic, when BAM! suddenly there's tables, charts, percentages, and math just *ruining *your flow. Don't skip this section or fall into despair just yet! This section will help you understand what it is the results sections are trying to say, so you can confidently make the best decision based on the data.

So, what is statistics? **Statistics is finding the story behind the numbers. **Its a field of study that makes sense of data by organizing, analyzing, and interpreting it. With statistics, we can uncover patterns, trends, and relationships hidden within the data. These findings help us make informed decisions, predict outcomes, and understand the world around us better.

There are two main branches of statistics. Descriptive and Inferential.

**Descriptive statistics**summarizes and organizes data to give us a clear picture of its characteristics.**Inferential statistics**helps us make predictions or generalizations about a larger group based on a sample.

For example, If you were doing a study surveying the Cressman librarians about their favorite authors, then descriptive statistics would help you *summarize and describe* the most popular authors among them. However, if you found that a certain author was highly favored among the Cressman librarians, then inferential statistics could help you* predict *whether that author is likely to be popular among librarians in general.

To understand statistics, you first have to understand data. **Data is a collection of observations, typically coming from a sample of a population.** An observation is the unit of measurement in your data. Observations will represent different things for different data.

For example, if your data is describing a population of students, each student is considered an observation. If your data is measuring the price of fruit at the grocery store, each apple or orange would represent a distinct observation.

The characteristics that describe these observations (price, weight, height, gender) are called variables. **A variable is a measure of something that differs between observations or can change over time. **

Descriptive statistics is what it sounds like: measures that are intended to *describe* the variables in your data. **Descriptive statistics include measures like minimums, maximums, averages (means), medians, mode, percentiles, and range.**

The **mean** (also called the average) is calculated by adding up each observation’s value for a certain variable (like height) and dividing by the number of observations. This gives you a sense of what value the variable tends to take on for the observations in your data – like how tall a 5th grader tends to be.

*Not*e: **Mean skews to outliers. **This means that if you are taking a sample, and there are some really big (or small) numbers in the group, they can make the mean get bigger too, *even if most of the numbers are smaller*. So, those outlier numbers are like pulling the mean towards them, making it not show what most of the numbers are like.

The **median** is calculated by listing the values of a variable for all observations from least to greatest and by finding the center value. If there are an even number of observations, the median is calculated by taking the average of the two center values.

For example: Suppose your variable is age

Student A: 8 y.o. | Student B: 9 y.o. | Student C: 10 y.o. | Student D: 11 y.o.

If we were only taking the median of students A, B, and C, then the median would be 9 years old because that is the center value. But since there are an even number of students in the sample, the median is the average of the two center values, 9 and 10. So the median student’s age is 9 and a half years old.

**Percentiles** are used to describe how observations rank *relative to other observations *in a sample.

For example, having a GPA in the 90th percentile means that 90% of students (observations) have a lower GPA.

**Quartile** refers to the 25th, 50th, and 75th percentiles. You might see the term “IQR” or “interquartile range” which refers to the range of values between the 25th and 75th percentile.

Most often, you'll see descriptive data neatly summarized in what is a called a **box plot. **A box plot contains 5 key pieces of information: the minimum, the first quartile, the median, the third quartile, and the maximum.

**Inferential statistics** helps us make conclusions or predictions about a big group by looking at a smaller part of it. We use inferential statistics when we can't measure or observe everything in a group, but we still want to know something about it.

**Point estimation** is the process of *estimating* a characteristic (statistic) about a population when you only have access to a sample of that population.

For example, you might be interested in knowing the class’s average test score for an exam you just took, but only three of your friends agreed to tell you their grade! If you take the average score across those three students, you can use this to guess at the average score of the class – but that guess might not be very accurate. The more students you survey (the larger the number of observations) the more likely you are to guess correctly.

Because point estimates are guesses at what the population looks like, they inherently have what we call **sampling error** because we can’t know exactly what the population looks like if we only have a sample. In general, the larger sample reduces the sampling error. (Other things can reduce the sampling error, too. Like making sure to randomly sample. For example, if you only ask the slackers for what they got on the exam, your sample average is likely to be lower than the class average.)

An example of a point estimate is the average score that a sample of 30 high school students received on the SAT. The median describes all 30 scores with one number — making it a point — and that point may differ from the average SAT score for all high school students nationwide — making it an estimate.

Because point estimates are *estimates. ***Margins of error** tell you how much a point estimate from a sample may differ from the true population value.

**The larger the margin of error, the less confident researchers can be that the point estimate is approximating the population value. **

A **confidence interval** is calculated by adding and subtracting the margin of error from the point estimate. Confidence intervals suggest what *range of values* around the point estimate is likely to include the population characteristic.

**The wider the margin of error, the wider the confidence interval, and the more uncertainty about what the population characteristic might be.**

**Regression analysis **is a statistical tool that is used to understand how and if two factors are connected, and how one factor can change when another factor does. For example, regression analysis might be used to estimate the average effect an extra hour of studying will have on a student’s exam grade. Regressions use one or more** explanatory variables** (like time spent studying) to estimate an **outcome variable** (like a test score).

The researchers choose which explanatory variables are used in their regression analysis. **The importance of this choice cannot be understated.** Having too few or too many explanatory variables (or simply choosing the wrong ones) can render the results of a regression analysis completely useless!

For example, you probably know that "time spent studying for an exam" is not the only determinant of a student’s score. Other factors could include: how much sleep the student got the night before, if the student has test anxiety, and the student's general mastery over the material. These factors can have equal, if not greater, influence on the student’s score. If you don't include those variables in your anaylsis, the you get something called **omitted variable bias** and that can discredit the results of the the whole regression analysis.

**One consequence of omitting key explanatory variables is that it can make two factors seem related when they are not. **The classic example is a regression that looks at the effect of ice cream sales on shark attacks. If you were to run a regression analysis on data that measures ice cream sales and shark attacks over time, you would find that ice cream sales are heavily correlated with shark attacks! Before you start crafting theories about shark’s having a sweet tooth, remember that correlation does not equal causation!.

**A regression result showing that two variables are related does not prove that the two variables are causally related. A regression is not a complex model that replicates the real world.**

Statistics can inform your understanding of a research topic, and it can provide evidence to inform your choices. But!** It’s important to think about statistics as being able to support an idea but being unable to prove it. ** Statistics can be a powerful type of evidence, but there are several pitfalls to avoid.

The safest rule for interpreting research statistics in peer-reviewed sources is to rely on authors’ own description of statistical results, the authors’ own interpretation and discussion of the results as evidence to inform a research question, and the authors’ own assessment of the limitations of the statistical evidence.

Researchers typically focus on narrow questions, but their data can be misinterpreted when used to address a different question. After identifying research studies that use statistics that seem to directly address your research question, read the authors’ own interpretation of the statistics. Then, ask yourself:

Did the authors design their statistical analysis in a way that *directly* helps address my question, or would it take a leap to use this data for my research?

**Statistical tests are complicated and designed for a narrow question**— so, only use statistical evidence that was designed to address your specific question.

*Note:* More complex statistical methods generally require narrower applications and interpretations.

Finally, recognize that a statistically significant result is not necessarily a meaningful result in the real world: ask yourself:

Is the result summarized in the study clinically meaningful or compelling evidence for answering your research question?

In describing your conclusions about the statistics, be sure to stick with other lessons from this guide: correlation is not causation, and good research means acknowledging the limitations of your research sources.

Whenever we do an experiment, we need to rule out that our results could have occurred simply to chance. **Hypothesis testing** is a subset of inferential statistics, and its main goal is to help determine if the results you found in your study (which were found with a sample of a population) could be applied to the larger population and still hold true with a certain level of confidence.

But lets back up a bit. Logically speaking, It is much easier to prove something is false than to prove something is true. If we want to prove something is false, then we only need one counter example. But, if we want to prove something is true, then we need to prove it is true in every possible situation.

For example, the classic argument is the claim "all swans are white." If I want to prove that the statement "all swans are white" is true, then I would need to go out and *check every single swan.* That's impossible! However, I can prove that the statement "all swans are white" is false by finding just *one swan that is isn't white. *

When we start doing an experiment, we create a hypothesis about a population. A **hypothesis (H _{A}), **also know as an alternate hypothesis, is a proposed explanation or prediction about a phenomenon or a relationship between variables. For example, we may hypothesize that Cedar Crest students, on average, chat with their librarians

The** null hypothesis ( H_{0})** is a statement that suggests there is no significant difference, effect, or relationship between the phenomena or variables being studied. The null hypothesis exists as the status quo, or default assumption. In our example, the null hypothesis is that Cedar Crest students chat with their librarians

Because it would be much harder to prove our hypothesis was true in every case, we instead prove that the null hypothesis was false in this singular case. This is because if the null hypothesis (that there is no change) is false, then our hypothesis (that there is significant change) is *more likely* to be true.

The **p-value** is a measure of how likely the sample results are, assuming the null hypothesis is true. To put it simply, the p-value is the probability that your observed phenomonon happened simply due to chance.

The P value is the probability that any particular outcome would have arisen by chance. Standard scientific practice usually deems a P value of less than 1 in 20 (expressed as P=0.05, and equivalent to a betting odds of 20 to 1) as "statistically significant" and a P value of less than 1 in 100 (P=0.01) as "statistically highly significant."

- 1,001 statistics practice problems for dummies byPublication Date: 2014
- Starting Statistics : A Short, Clear Guide byPublication Date: 2011
- Statistical Methods in Social Science Research byPublication Date: 2018

- Common Sources of Data Errors and Error-Checking Techniquesa brief guide from the Institute of Education Statistics
- Decision Errors, Effect Sizes, and Powera chapter from Understanding Inferential Statistics

- Survey methodology and missing data : tools and techniques for practitioners byPublication Date: 2018

- Our World in Data This link opens in a new window

- Data Analysis with SPSS Software : Data Types, Graphs, and Measurement Tendencies byPublication Date: 2016
- SPSS Statistics for Dummies byPublication Date: 2015

- Excel 2016 for social science statistics : a guide to solving practical problems byPublication Date: 2016
- Excel Data Analysis : Modeling and Simulation byPublication Date: 2019

- Introduction to Probability, Statistics & R : Foundations for Data-Based Sciences byPublication Date: 2024
- Political Analysis Using R byPublication Date: 2015

- Posit Recipesinstructions by type in Rstudio