Objectives
SL 4.1
Concepts of population, sample, random sample, discrete and continuous data.
Sampling techniques and their effectiveness.
Reliability of data sources and bias in sampling.
Interpretation of outliers.
SL 4.2
Presentation of data (discrete and continuous): frequency distributions (tables).
Histograms. Cumulative frequency; cumulative frequency graphs; use to find median, quartiles, percentiles, range and interquartile range (IQR).
Production and understanding of box and whisker diagrams.
SL 4.3
Measures of central tendency (mean, median and mode).
Measures of dispersion (interquartile range, standard deviation and variance).
Introduction
The study of statistics deals with data collection, presentation, analysis and interpretation. The data can come from a population (an entire group) or a sample ( a population subset).
Usually, a small sample of a population is used to represent the whole population.
Data
Data can be split into two categories.
Qualitative Data - non-numerical data, e.g. colours, names or labels.
Quantitative Data - numerical data that can be measured.
Quantitative data can be either discrete or continuous.
Watch Sampling Discrete or Continuous by TLMaths.
Discrete data - only take specific, separate values. Often involves counting e.g. the number of pets or shoe sizes.
Continuous data - any value within a range. Often, it involves measurement, e.g. height, weight or temperature.
You need to know which type of data you are dealing with so that you can make the right decisions about how to present and analyse the data.
Sampling
When it comes to sampling, that is, selecting a subset of a population, it is important to ensure that the sample is representative of the population being studied. So, for example, if you want to investigate the effect exercise has on memory and you choose children from a local primary school as the sample group. This data could not be applied to children in other areas of the country, children in different countries, or people in different age groups. It is very important to make sure your sample group is selected from the population you wish to study.
How you choose your sample is important, as you need to avoid bias. Sampling bias occurs when your sample is not random and is, therefore, more likely to produce a certain outcome. This can lead to misleading conclusions about a population. To avoid this, the sample needs to be random. This means that each member of the population has an equal chance of being selected to be part of the sample. The sample is representative of the larger population.
Types of Sampling
Simple random sampling - All 1,000 members of the population place their name in a hat. We select 100 people out of a hat. Each member has an equal probability of being chosen.
Systematic sampling - The nth member of the population is selected from a random starting point. We pick a random starting point (e.g.
the 8th person) and pick every 1000th person (i.e. 8th, 1008th, 2008th, …)
Stratified sampling - We divide the population into subgroups (strata), and individuals are chosen randomly from each stratum (say blue eyes and brown eyes, or under and over 40 years old). We pick a sample from each group.
Quota sampling - We divide the population into strata based on pre-defined characteristics. We pick proportional samples according to the proportion of the subgroups in the population.
Convenience sampling - Select individuals that are readily available to you, e.g. the students of your school. This method often produces bias.
Watch Sampling Techniques by IB Mathematics Revision.
Watch OSC Sampling and Outliers.
Methods of Data Presentation
Watch the video below from my YouTube channel.
Complete Khans Academy Algebra 1 Unit 1 - One-Variable Statistics down to the end of comparing data displays.
Measure of Central Tendency
Central tendency refers to statistical measures that summarise a dataset by identifying a single value that represents the centre or average of the data. These measures provide insight into where most of the data values lie, helping us understand the overall "typical" value in a dataset. The three main measures of central tendency are mean, median, and mode. It is essential you understand these measures and can use them successfully.
1. Mean (Arithmetic Average)
The mean is the sum of all data values divided by the total number of values in the dataset. It is commonly used when the data is evenly distributed without extreme outliers.
Formula:
Mean=∑xi
n
Where:
xi represents each data value
n is the total number of values
Example:
For the dataset {4,7,9}
Mean = 4+7+9 = 6.67
3
Strengths: Accounts for all data values, providing an overall average.
Limitations: Sensitive to extreme values (outliers), which can distort the mean.
2. Median (Middle Value)
The median is the middle value when the data is arranged in order. If there is an even number of values, the median is the average of the two middle values. It is particularly useful when the dataset contains outliers or is skewed.
Steps to Find the Median:
Arrange the data in ascending order.
If nnn is odd, the median is the middle value.
If nnn is even, the median is the average of the two middle values.
Example:
For the dataset {3,5,8,12,15}
Median = 8 (middle value).
For the dataset {3,5,8,12}
Median = 5+8 = 6.5
2
Strengths: Not affected by outliers.
Limitations: Does not consider all data points.
3. Mode (Most Frequent Value)
The mode is the data value that occurs most frequently. A dataset can have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all if all values are unique.
Example:
For the dataset {2,4,4,6,8,8,8}
Mode = 8 (occurs most frequently).
Strengths: Useful for categorical data or datasets with repeating values.
Limitations: May not provide meaningful information if all values occur with the same frequency.
Choosing the Right Measure
Use mean for symmetrical data without extreme values.
Use median for skewed data or when outliers are present.
Use mode for categorical data or when identifying the most common value is useful.
Understanding these measures allows you to summarise and interpret data effectively, a critical skill in statistics and real-world problem-solving.
Definitions of Key Data Types
1. Symmetrical Data
Symmetrical data refers to datasets where the values are evenly distributed around the center. When plotted as a graph (e.g., a histogram or curve), symmetrical data forms a shape where the left and right sides are mirror images.
Example: In a bell-shaped normal distribution, the mean, median, and mode are the same and lie at the center of the distribution.
Characteristics:
The data is evenly spread around the mean.
There are no extreme outliers or significant skew.
Commonly found in natural phenomena (e.g., heights, test scores).
2. Skewed Data
Skewed data occurs when the distribution of values is not symmetrical, causing one "tail" to be longer than the other. This means the data is stretched out more on one side.
Positively Skewed Data: The tail on the right side (higher values) is longer. Most data values are concentrated on the left. The mean is usually greater than the median.
Negatively Skewed Data: The tail on the left side (lower values) is longer. Most data values are concentrated on the right. The mean is usually less than the median.
Example:
Positively skewed: Household incomes in a population (most people earn less, but a few high earners stretch the right tail).
Negatively skewed: Age at retirement (most people retire at a similar age, but a few retire earlier than the norm).
3. Categorical Data
Categorical data represents qualitative information and is divided into distinct categories or groups. These categories are typically non-numerical, though they can sometimes be assigned numbers for coding purposes.
Types of Categorical Data:
Nominal Data: Categories with no natural order (e.g., colors, types of pets).
Ordinal Data: Categories with a meaningful order but no consistent difference between them (e.g., rankings like "low," "medium," "high").
Examples:
Eye color: blue, brown, green (nominal).
Education level: high school, undergraduate, graduate (ordinal).
Characteristics:
Cannot be used directly in numerical calculations.
Best visualised using bar charts or pie charts.
Watch this Revision Village video on mean, median and mode.
Watch Advanced Maths calculator video.
Measure of Spread
The measure of spread gives an indication of how spread out the data is. The measure of spread helps us understand how well the mean or median represents a set of data and how reliable our conclusions are.
1. Range
This is the simplest method of determining spread is find out the range of a set of data. The range is simply the difference between the largest and smallest values in a dataset.
2.Interquartile Range (IQR)
This is the difference between the upper and lower quartile.
3.Interpercentile Range
This is the difference between the values of two given percentiles.
Standard Deviation
The standard deviation is perhaps the most “reliable” measure for spread, as it takes all data into consideration. It measures the amount of variation of the values of a variable about its mean. A low standard deviation value indicates that the values tend to be close to the mean a high value indicates that the values are spread out across a wider range. The standard deviation is denoted either by σ or by Sn .
Watch Mr Flynns Averages and Spread.
Watch TLMaths Introducing the Variance and Standard Deviation to dig deeper into standard deviation.
Watch Advanced Maths calculator video to help you learn how to do this all on a calculator.
Task
In your maths journal, create a cover page for this unit of work: Statistics and Probability.
On the next page, make a note of all the necessary basics for this topic. Ensure you can define all the keywords on this page.
Depending on your level of confidence, complete the Transum activity from either Level 1 or Level 4 to Level 7.
MEP Chapter 3 Descriptive Statistics. This chapter will give you practice using and interpreting different types data presentation, finding the mean, median and mode, and calculating spread and variance.
Exam practice - Answer questions 1-11 of descriptive statistics from Christos Nikolaidis' site.
Comments