Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

In probability theory, the central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution (i.e., a “bell curve”) as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population's actual distribution shape.

Put another way, CLT is a statistical premise that, given a sufficiently large sample size from a population with a finite level of variance, the mean of all sampled variables from the same population will be approximately equal to the mean of the whole population. Furthermore, these samples approximate a normal distribution, with their variances being approximately equal to the variance of the population as the sample size gets larger, according to the law of large numbers.

Although this concept was first developed by Abraham de Moivre in 1733, it was not formalized until 1930, when noted Hungarian mathematician George Pólya dubbed it the central limit theorem.

  • The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population's distribution.
  • Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.
  • A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.
  • A sufficiently large sample size can predict the characteristics of a population more accurately.
  • CLT is useful in finance when analyzing a large collection of securities to estimate portfolio distributions and traits for returns, risk, and correlation.

According to the central limit theorem, the mean of a sample of data will be closer to the mean of the overall population in question, as the sample size increases, notwithstanding the actual distribution of the data. In other words, the data is accurate whether the distribution is normal or aberrant.

As a general rule, sample sizes of around 30-50 are deemed sufficient for the CLT to hold, meaning that the distribution of the sample means is fairly normally distributed. Therefore, the more samples one takes, the more the graphed results take the shape of a normal distribution. Note, however, that the central limit theorem will still be approximated in many cases for much smaller sample sizes, such as n=8 or n=5.

The central limit theorem is often used in conjunction with the law of large numbers, which states that the average of the sample means and standard deviations will come closer to equaling the population mean and standard deviation as the sample size grows, which is extremely useful in accurately predicting the characteristics of populations.

Investopedia / Sabrina Jiang

The central limit theorem is comprised of several key characteristics. These characteristics largely revolve around samples, sample sizes, and the population of data.

  1. Sampling is successive. This means some sample units are common with sample units selected on previous occasions.
  2. Sampling is random. All samples must be selected at random so that they have the same statistical possibility of being selected.
  3. Samples should be independent. The selections or results from one sample should have no bearing on future samples or other sample results.
  4. Samples should be limited. It's often cited that a sample should be no more than 10% of a population if sampling is done without replacement. In general, larger population sizes warrant the use of larger sample sizes.
  5. Sample size is increasing. The central limit theorem is relevant as more samples are selected.

The CLT is useful when examining the returns of an individual stock or broader indices, because the analysis is simple, due to the relative ease of generating the necessary financial data. Consequently, investors of all types rely on the CLT to analyze stock returns, construct portfolios, and manage risk.

Say, for example, an investor wishes to analyze the overall return for a stock index that comprises 1,000 equities. In this scenario, that investor may simply study a random sample of stocks to cultivate estimated returns of the total index. To be safe, at least 30-50 randomly selected stocks across various sectors should be sampled for the central limit theorem to hold. Furthermore, previously selected stocks must be swapped out with different names to help eliminate bias.

The central limit theorem is useful when analyzing large data sets because it allows one to assume that the sampling distribution of the mean will be normally-distributed in most cases. This allows for easier statistical analysis and inference. For example, investors can use central limit theorem to aggregate individual security performance data and generate distribution of sample means that represent a larger population distribution for security returns over a period of time.

A sample size of 30 is fairly common across statistics. A sample size of 30 often increases the confidence interval of your population data set enough to warrant assertions against your findings. The higher your sample size, the more likely the sample will be representative of your population set.

The central limit theorem doesn't have its own formula, but it relies on sample mean and standard deviation. As sample means are gathered from the population, standard deviation is used to distribute the data across a probability distribution curve.

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If the population is normal, then the theorem holds true even for samples smaller than 30. In fact, this also holds true even if the population is binomial, provided that min(np, n(1-p))> 5, where n is the sample size and p is the probability of success in the population. This means that we can use the normal probability model to quantify uncertainty when making inferences about a population mean based on the sample mean.

For the random samples we take from the population, we can compute the mean of the sample means:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

and the standard deviation of the sample means:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Before illustrating the use of the Central Limit Theorem (CLT) we will first illustrate the result. In order for the result of the CLT to hold, the sample must be sufficiently large (n > 30). Again, there are two exceptions to this. If the population is normal, then the result holds for samples of any size (i..e, the sampling distribution of the sample means will be approximately normal even for samples of size less than 30).

Central Limit Theorem with a Normal Population

The figure below illustrates a normally distributed characteristic, X, in a population in which the population mean is 75 with a standard deviation of 8.

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

If we take simple random samples (with replacement)

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
of size n=10 from the population and compute the mean for each of the samples, the distribution of sample means should be approximately normal according to the Central Limit Theorem. Note that the sample size (n=10) is less than 30, but the source population is normally distributed, so this is not a problem. The distribution of the sample means is illustrated below. Note that the horizontal axis is different from the previous illustration, and that the range is narrower.

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

The mean of the sample means is 75 and the standard deviation of the sample means is 2.5, with the standard deviation of the sample means computed as follows:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

If we were to take samples of n=5 instead of n=10, we would get a similar distribution, but the variation among the sample means would be larger. In fact, when we did this we got a sample mean = 75 and a sample standard deviation = 3.6.

Central Limit Theorem with a Dichotomous Outcome

Now suppose we measure a characteristic, X, in a population and that this characteristic is dichotomous (e.g., success of a medical procedure: yes or no) with 30% of the population classified as a success (i.e., p=0.30) as shown below.

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

The Central Limit Theorem applies even to binomial populations like this provided that the minimum of np and n(1-p) is at least 5, where "n" refers to the sample size, and "p" is the probability of "success" on any given trial. In this case, we will take samples of n=20 with replacement, so min(np, n(1-p)) = min(20(0.3), 20(0.7)) = min(6, 14) = 6. Therefore, the criterion is met.

We saw previously that the population mean and standard deviation for a binomial distribution are:

Mean binomial probability:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Standard deviation:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

The distribution of sample means based on samples of size n=20 is shown below.

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

The mean of the sample means is

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

and the standard deviation of the sample means is:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Now, instead of taking samples of n=20, suppose we take simple random samples (with replacement) of size n=10. Note that in this scenario we do not meet the sample size requirement for the Central Limit Theorem (i.e., min(np, n(1-p)) = min(10(0.3), 10(0.7)) = min(3, 7) = 3).The distribution of sample means based on samples of size n=10 is shown on the right, and you can see that it is not quite normally distributed. The sample size must be larger in order for the distribution to approach normality.

Central Limit Theorem with a Skewed Distribution

The Poisson distribution is another probability model that is useful for modeling discrete variables such as the number of events occurring during a given time interval. For example, suppose you typically receive about 4 spam emails per day, but the number varies from day to day. Today you happened to receive 5 spam emails. What is the probability of that happening, given that the typical rate is 4 per day? The Poisson probability is:

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Mean = μ

Standard deviation =

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

The mean for the distribution is μ (the average or typical rate), "X" is the actual number of events that occur ("successes"), and "e" is the constant approximately equal to 2.71828. So, in the example above

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Now let's consider another Poisson distribution. with μ=3 and σ=1.73. The distribution is shown in the figure below.

 

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

This population is not normally distributed, but the Central Limit Theorem will apply if n > 30. In fact, if we take samples of size n=30, we obtain samples distributed as shown in the first graph below with a mean of 3 and standard deviation = 0.32. In contrast, with small samples of n=10, we obtain samples distributed as shown in the lower graph. Note that n=10 does not meet the criterion for the Central Limit Theorem, and the small samples on the right give a distribution that is not quite normal. Also note that the sample standard deviation (also called the "standard error

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
") is larger with smaller samples, because it is obtained by dividing the population standard deviation by the square root of the sample size. Another way of thinking about this is that extreme values will have less impact on the sample mean when the sample size is large.

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?

Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?
Which formula is used when the sample size is larger than 30 and the population standard deviation is unknown?