Skip to Main Content

Political Science: Data and Statistics

A guide to Political Science resources at Cressman Library.

Making Sense of Statistics

What is Statistics?

You're reading a delightful paper all about your chosen topic, when BAM! suddenly there's tables, charts, percentages, and math just ruining your flow. Don't skip this section or fall into despair just yet! This section will help you understand what it is the results sections are trying to say, so you can confidently make the best decision based on the data. 

Statistics, made simple.

So, what is statistics? Statistics is finding the story behind the numbers. Its a field of study that makes sense of data by organizing, analyzing, and interpreting it. With statistics, we can uncover patterns, trends, and relationships hidden within the data. These findings help us make informed decisions, predict outcomes, and understand the world around us better.

There are two main branches of statistics. Descriptive and Inferential. 

  • Descriptive statistics summarizes and organizes data to give us a clear picture of its characteristics. 
  • Inferential statistics helps us make predictions or generalizations about a larger group based on a sample.

For example, If you were doing a study surveying the Cressman librarians about their favorite authors, then descriptive statistics would help you summarize and describe the most popular authors among them. However, if you found that a certain author was highly favored among the Cressman librarians, then inferential statistics could help you predict whether that author is likely to be popular among librarians in general.

What is data?

To understand statistics, you first have to understand data. Data is a collection of observations, typically coming from a sample of a population. An observation is the unit of measurement in your data. Observations will represent different things for different data.

For example, if your data is describing a population of students, each student is considered an observation. If your data is measuring the price of fruit at the grocery store, each apple or orange would represent a distinct observation.

The characteristics that describe these observations (price, weight, height, gender) are called variables. A variable is a measure of something that differs between observations or can change over time. 


 

Descriptive Statistics

Descriptive statistics is what it sounds like: measures that are intended to describe the variables in your data. Descriptive statistics include measures like minimums, maximums, averages (means), medians, mode, percentiles, and range.

Mean

Diagram of central tendency with positive and negative skew
Note. A negatively skewed, normal, and positively skewed distribution and their respective mean, median, and mode. Adapted from Ledidi Academy  by Parys A. V., (n.d.), https://ledidi.com/academy/measures-of-central-tendency-mean-median-and-mode, C.C. 2.0.

The mean (also called the average) is calculated by adding up each observation’s value for a certain variable (like height) and dividing by the number of observations. This gives you a sense of what value the variable tends to take on for the observations in your data – like how tall a 5th grader tends to be.
Note: Mean skews to outliers. This means that if you are taking a sample, and there are some really big (or small) numbers in the group, they can make the mean get bigger too, even if most of the numbers are smaller. So, those outlier numbers are like pulling the mean towards them, making it not show what most of the numbers are like.

Median

The median is calculated by listing the values of a variable for all observations from least to greatest and by finding the center value. If there are an even number of observations, the median is calculated by taking the average of the two center values.

For example: Suppose your variable is age

Student A: 8 y.o. | Student B: 9 y.o. | Student C: 10 y.o. | Student D: 11 y.o.

If we were only taking the median of students  A, B, and C, then the median would be 9 years old because that is the center value. But since there are an even number of students in the sample, the median is the average of the two center values, 9 and 10. So the median student’s age is 9 and a half years old.

Percentiles and Quartiles

Percentiles are used to describe how observations rank relative to other observations in a sample. 

For example, having a GPA in the 90th percentile means that 90% of students (observations) have a lower GPA. 

Quartile refers to the 25th, 50th, and 75th percentiles. You might see the term “IQR” or “interquartile range” which refers to the range of values between the 25th and 75th percentile. 

Box Plots

box plot summaryNote. Understanding boxplots: How to read and interpret a boxplot, Galarnyk, M., (2023) Built In. 

Most often, you'll see descriptive data neatly summarized in what is a called a box plot. A box plot contains 5 key pieces of information: the minimum, the first quartile, the median, the third quartile, and the maximum.

What is Inferential Statistics?

Inferential statistics helps us make conclusions or predictions about a big group by looking at a smaller part of it. We use inferential statistics when we can't measure or observe everything in a group, but we still want to know something about it.

Point Estimation

sampling error

Point estimation is the process of estimating a characteristic (statistic) about a population when you only have access to a sample of that population.

For example, you might be interested in knowing the class’s average test score for an exam you just took, but only three of your friends agreed to tell you their grade! If you take the average score across those three students, you can use this to guess at the average score of the class – but that guess might not be very accurate. The more students you survey (the larger the number of observations) the more likely you are to guess correctly.

Because point estimates are guesses at what the population looks like, they inherently have what we call sampling error because we can’t know exactly what the population looks like if we only have a sample. In general, the larger sample reduces the sampling error. (Other things can reduce the sampling error, too. Like making sure to randomly sample. For example, if you only ask the slackers for what they got on the exam, your sample average is likely to be lower than the class average.)

Example:

An example of a point estimate is the average score that a sample of 30 high school students received on the SAT. The median describes all 30 scores with one number — making it a point — and that point may differ from the average SAT score for all high school students nationwide — making it an estimate. 

Margins of Error

Because point estimates are estimates. Margins of error tell you how much a point estimate from a sample may differ from the true population value.

The larger the margin of error, the less confident researchers can be that the point estimate is approximating the population value. 

Confidence Intervals

A confidence interval is calculated by adding and subtracting the margin of error from the point estimate. Confidence intervals suggest what range of values around the point estimate is likely to include the population characteristic.

The wider the margin of error, the wider the confidence interval, and the more uncertainty about what the population characteristic might be.
 

What is Regression Analysis?

Regression analysis is a statistical tool that is used to understand how and if two factors are connected, and how one factor can change when another factor does. For example, regression analysis might be used to estimate the average effect an extra hour of studying will have on a student’s exam grade. Regressions use one or more explanatory variables (like time spent studying) to estimate an outcome variable (like a test score). 

The researchers choose which explanatory variables are used in their regression analysis. The importance of this choice cannot be understated. Having too few or too many explanatory variables (or simply choosing the wrong ones) can render the results of a regression analysis completely useless! 

So, what can go wrong?

For example, you probably know that "time spent studying for an exam" is not the only determinant of a student’s score. Other factors could include: how much sleep the student got the night before, if the student has test anxiety, and the student's general mastery over the material. These factors can have equal, if not greater, influence on the student’s score. If you don't include those variables in your anaylsis, the you get something  called omitted variable bias and that can discredit the results of the the whole regression analysis.


One consequence of omitting key explanatory variables is that it can make two factors seem related when they are not. The classic example is a regression that looks at the effect of ice cream sales on shark attacks. If you were to run a regression analysis on data that measures ice cream sales and shark attacks over time, you would find that ice cream sales are heavily correlated with shark attacks! Before you start crafting theories about shark’s having a sweet tooth, remember that correlation does not equal causation!. 

 

Remember! Correlation does not equal Causation!

A regression result showing that two variables are related does not prove that the two variables are causally related. A regression is not a complex model that replicates the real world.

Interpreting Research Statistics

Statistics can inform your understanding of a research topic, and it can provide evidence to inform your choices. But! It’s important to think about statistics as being able to support an idea but being unable to prove it.  Statistics can be a powerful type of evidence, but there are several pitfalls to avoid. 

The safest rule for interpreting research statistics in peer-reviewed sources is to rely on authors’ own description of statistical results, the authors’ own interpretation and discussion of the results as evidence to inform a research question, and the authors’ own assessment of the limitations of the statistical evidence.

 

Researchers typically focus on narrow questions, but their data can be misinterpreted when used to address a different question. After identifying research studies that use statistics that seem to directly address your research question, read the authors’ own interpretation of the statistics. Then, ask yourself:

Did the authors design their statistical analysis in a way that directly helps address my question, or would it take a leap to use this data for my research?

  • Statistical tests are complicated and designed for a narrow question — so, only use statistical evidence that was designed to address your specific question.

Note: More complex statistical methods generally require narrower applications and interpretations.

Finally, recognize that a statistically significant result is not necessarily a meaningful result in the real world: ask yourself:

Is the result summarized in the study clinically meaningful or compelling evidence for answering your research question?

In describing your conclusions about the statistics, be sure to stick with other lessons from this guide: correlation is not causation, and good research means acknowledging the limitations of your research sources.

 

What is Hypothesis Testing?

Whenever we do an experiment, we need to rule out that our results could have occurred simply to chance. Hypothesis testing is a subset of inferential statistics, and its main goal is to help determine if the results you found in your study (which were found with a sample of a population) could be applied to the larger population and still hold true with a certain level of confidence. 

But lets back up a bit. Logically speaking, It is much easier to prove something is false than to prove something is true. If we want to prove something is false, then we only need one counter example. But, if we want to prove something is true, then we need to prove it is true in every possible situation. 

For example, the classic argument is the claim "all swans are white." If I want to prove that the statement "all swans are white" is true, then I would need to go out and check every single swan. That's impossible! However, I can prove that the statement "all swans are white" is false by finding just one swan that is isn't white. 

When we start doing an experiment, we create a hypothesis about a population. A hypothesis (HA), also know as an alternate hypothesis, is a proposed explanation or prediction about a phenomenon or a relationship between variables. For example, we may hypothesize that Cedar Crest students, on average, chat with their librarians more so than the average college student.

The null hypothesis (H0) is a statement that suggests there is no significant difference, effect, or relationship between the phenomena or variables being studied. The null hypothesis exists as the status quo, or default assumption. In our example, the null hypothesis is that Cedar Crest students chat with their librarians just as much as the average college student. 

Because it would be much harder to prove our hypothesis was true in every case, we instead prove that the null hypothesis was false in this singular case. This is because if the null hypothesis (that there is no change) is false, then our hypothesis (that there is significant change) is more likely to be true. 

P-values

The p-value is a measure of how likely the sample results are, assuming the null hypothesis is true. To put it simply, the p-value is the probability that your observed phenomonon happened simply due to chance. 

The P value is the probability that any particular outcome would have arisen by chance. Standard scientific practice usually deems a P value of less than 1 in 20 (expressed as P=0.05, and equivalent to a betting odds of 20 to 1) as "statistically significant" and a P value of less than 1 in 100 (P=0.01) as "statistically highly significant."
 

Methods by Software Program

 Chat with a librarian