What happens if a hypothesis Cannot be tested?

A hypothesis is a testable explanation of cause and effect.

A scientific hypothesis must be a testable hypothesis. Hypotheses that cannot be tested, such as cause and effect attributed to a supernatural being or an invisible fifth dimension that cannot be detected, are not part of science. They are pseudo science.

A good hypothesis is a productive one. A productive hypothesis can:

Be easily learned and applied

Explain the past accurately and persuasively

Make accurate predictions about the future

Generate new even more useful hypotheses

Be applied to a wide variety of situations

Be easily tested

The more of these attributes a hypothesis has, the more problems it can solve.

Why this abstraction is important for solving difficult problems

The Scientific Method

1. Observe a phenomenon that has no good explanation.

2. Formulate a hypothesis.

3. Design an experiment(s) to test the hypothesis.

4. Perform the experiment(s).

5. Accept, reject, or modify the hypothesis.

The second step in the Scientific Method is to formulate a hypothesis. Here the challenge is to synthesize the most productive hypothesis possible.

Problem solvers utilize a constant stream of hypotheses. Some are reused. The rest are new. A problem is solved by the combination of hypotheses that led to its solution. The more hypotheses you can reuse when solving a problem, the greater your chance of success and the faster and more efficiently you will solve the problem.

The best way to maximize hypothesis reuse is to use a mature problem solving process. A process is a reusable series of steps and related practices for achieving a goal, such as solving a class of similar problems. Each step is a reusable hypothesis of what will help you the most in the many steps required to achieve the goal.

What process are you using to solve your biggest problems? How well is it working? What's the percentage of hypotheses that you have to work out on your own, out of the total number needed to solve the problem? The higher this is, the harder the problem is to solve.

Updated April 26, 2018

By Mariecor Agravante

The most common framework used when performing an experiment is the Scientific Method. The hallmarks of the Scientific Method include: asking a specific question, devising a hypothesis, experimenting to gather data, analyzing the data, and then evaluating whether the hypothesis is correct based on the experimental data. When the data support the hypothesis, the findings can be published or shared. However, what happens if the findings do not confirm the hypothesis? Here are possible next steps to take.

The write-up is part of the evaluation process of the experiment. No matter what happened during the experiment, the results have to be shared, whether they confirm or deny the hypothesis. Assess all stages of the experiment – the hypothesis, the experimental stage and the analysis phase – and disclose the results. Next, identify problems that arose during the experimental process and follow that in the write-up with suggestions for improvements and future courses of action. The key to crafting the section on future courses of action is to work systematically backward to ascertain where the error might have taken place and then to make corrections to see if changes in those gap areas might lead to different results. The write-up is necessary to document what happened during the experiment. It becomes part of the background literature surrounding the issue being questioned or experimented on.

Make slight changes in the process by methodically working backward, starting with a check on the analysis process. Was the analysis off? Sometimes experimental data are incorrectly assessed. That means you have to ascertain if the analysis is where the error lies. For example, some physics experiments require mathematical calculations. If these calculations contain errors, then the analysis shows data that does not coincide with the hypothesis. Correcting any mathematical calculations is a necessary step after any experiment, especially if they have a bearing on whether the data confirms the hypothesis. Besides mathematical calculation analyses, evaluations that center on comparisons, making predictions or making discoveries can occur. If analyses reveal discrepancies, check whether there were any errors in the comparisons, predictions or discoveries processes. Rooting out these errors can alleviate any data-to-hypothesis discrepancies.

Human error can skew experimental data, and human error can rear its head at the experimental stage – whether in setting up the experiment, running the experiment, observing the experiment or in tabulating the experimental results. Minimizing errors at the experimental stage can affect whether the results confirm the hypothesis or not. There may have been other variables that arose that were not anticipated or could not be measured that affected experimental results.

Perhaps a different experiment can better test the hypothesis. There are situations in which an experiment is not the appropriate type to test a hypothesis. Perhaps design problems arose that were not evident in theory or on paper but became apparent during actual application. If so, an entirely different experiment may be needed. Experiments are essentially approaches and data-gathering methodologies to test a hypothesis. In other words, Experiment A utilizes Approach/Methodology A to test the hypothesis. If the results do not confirm the hypothesis, devise Experiment B with Approach/Methodology B.

If several different experiments all reveal that the hypothesis has not been confirmed, a revision of the hypothesis is in order. Perhaps it was the hypothesis all along that needed amendment. If so, devise a new way to ask a question and formulate an educated guess. Was there something amiss in the cause-and-effect relationship? Were associations and correlations assumed incorrectly? Remember that a hypothesis is a tentative description of some phenomenon. If several reproducible experiments show the hypothesis does not work, then it might be time to reject the hypothesis and replace it with a more viable one.

Department of Community Medicine, D. Y. Patil Medical College, Pune, India

Find articles by Amitav Banerjee

Department of Community Medicine, D. Y. Patil Medical College, Pune, India

Find articles by U. B. Chitnis

Department of Community Medicine, D. Y. Patil Medical College, Pune, India

Find articles by S. L. Jadhav

Department of Community Medicine, D. Y. Patil Medical College, Pune, India

Find articles by J. S. Bhawalkar

1Department of Psychiatry, RINPAS, Kanke, Ranchi, India

Find articles by S. Chaudhury

Author information Copyright and License information Disclaimer

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Hypothesis testing is an important activity of empirical research and evidence-based medicine. A well worked up hypothesis is half the answer to the research question. For this, both knowledge of the subject derived from extensive review of the literature and working knowledge of basic statistical concepts are desirable. The present paper discusses the methods of working up a good hypothesis and statistical concepts of hypothesis testing.

Keywords: Effect size, Hypothesis testing, Type I error, Type II error

Karl Popper is probably the most influential philosopher of science in the 20thcentury (Wulff et al., 1986). Many scientists, even those who do not usually read books on philosophy, are acquainted with the basic principles of his views on science. The popularity of Popper’s philosophy is due partly to the fact that it has been well explained in simple terms by, among others, the Nobel Prize winner Peter Medawar (Medawar, 1969). Popper makes the very important point that empirical scientists (those who stress on observations only as the starting point of research) put the cart in front of the horse when they claim that science proceeds from observation to theory, since there is no such thing as a pure observation which does not depend on theory. Popper states, “… the belief that we can start with pure observation alone, without anything in the nature of a theory, is absurd: As may be illustrated by the story of the man who dedicated his life to natural science, wrote down everything he could observe, and bequeathed his ‘priceless’ collection of observations to the Royal Society to be used as inductive (empirical) evidence.

The first step in the scientific process is not observation but the generation of a hypothesis which may then be tested critically by observations and experiments. Popper also makes the important claim that the goal of the scientist’s efforts is not the verification but the falsification of the initial hypothesis. It is logically impossible to verify the truth of a general law by repeated observations, but, at least in principle, it is possible to falsify such a law by a single observation. Repeated observations of white swans did not prove that all swans are white, but the observation of a single black swan sufficed to falsify that general statement (Popper, 1976).

A good hypothesis must be based on a good research question. It should be simple, specific and stated in advance (Hulley et al., 2001).

A simple hypothesis contains one predictor and one outcome variable, e.g. positive family history of schizophrenia increases the risk of developing the condition in first-degree relatives. Here the single predictor variable is positive family history of schizophrenia and the outcome variable is schizophrenia. A complex hypothesis contains more than one predictor variable or more than one outcome variable, e.g., a positive family history and stressful life events are associated with an increased incidence of Alzheimer’s disease. Here there are 2 predictor variables, i.e., positive family history and stressful life events, while one outcome variable, i.e., Alzheimer’s disease. Complex hypothesis like this cannot be easily tested with a single statistical test and should always be separated into 2 or more simple hypotheses.

A specific hypothesis leaves no ambiguity about the subjects and variables, or about how the test of statistical significance will be applied. It uses concise operational definitions that summarize the nature and source of the subjects and the approach to measuring variables (History of medication with tranquilizers, as measured by review of medical store records and physicians’ prescriptions in the past year, is more common in patients who attempted suicides than in controls hospitalized for other conditions). This is a long-winded sentence, but it explicitly states the nature of predictor and outcome variables, how they will be measured and the research hypothesis. Often these details may be included in the study proposal and may not be stated in the research hypothesis. However, they should be clear in the mind of the investigator while conceptualizing the study.

The hypothesis must be stated in writing during the proposal state. This will help to keep the research effort focused on the primary objective and create a stronger basis for interpreting the study’s results as compared to a hypothesis that emerges as a result of inspecting the data. The habit of post hoc hypothesis testing (common among researchers) is nothing but using third-degree methods on the data (data dredging), to yield at least something significant. This leads to overrating the occasional chance associations in the study.

For the purpose of testing statistical significance, hypotheses are classified by the way they describe the expected difference between the study groups.

The null hypothesis states that there is no association between the predictor and outcome variables in the population (There is no difference between tranquilizer habits of patients with attempted suicides and those of age- and sex- matched “control” patients hospitalized for other diagnoses). The null hypothesis is the formal basis for testing statistical significance. By starting with the proposition that there is no association, statistical tests can estimate the probability that an observed association could be due to chance.

The proposition that there is an association — that patients with attempted suicides will report different tranquilizer habits from those of the controls — is called the alternative hypothesis. The alternative hypothesis cannot be tested directly; it is accepted by exclusion if the test of statistical significance rejects the null hypothesis.

A one-tailed (or one-sided) hypothesis specifies the direction of the association between the predictor and outcome variables. The prediction that patients of attempted suicides will have a higher rate of use of tranquilizers than control patients is a one-tailed hypothesis. A two-tailed hypothesis states only that an association exists; it does not specify the direction. The prediction that patients with attempted suicides will have a different rate of tranquilizer use — either higher or lower than control patients — is a two-tailed hypothesis. (The word tails refers to the tail ends of the statistical distribution such as the familiar bell-shaped normal curve that is used to test a hypothesis. One tail represents a positive effect or association; the other, a negative effect.) A one-tailed hypothesis has the statistical advantage of permitting a smaller sample size as compared to that permissible by a two-tailed hypothesis. Unfortunately, one-tailed hypotheses are not always appropriate; in fact, some investigators believe that they should never be used. However, they are appropriate when only one direction for the association is important or biologically meaningful. An example is the one-sided hypothesis that a drug has a greater frequency of side effects than a placebo; the possibility that the drug has fewer side effects than the placebo is not worth testing. Whatever strategy is used, it should be stated in advance; otherwise, it would lack statistical rigor. Data dredging after it has been collected and post hoc deciding to change over to one-tailed hypothesis testing to reduce the sample size and P value are indicative of lack of scientific integrity.

A hypothesis (for example, Tamiflu [oseltamivir], drug of choice in H1N1 influenza, is associated with an increased incidence of acute psychotic manifestations) is either true or false in the real world. Because the investigator cannot study all people who are at risk, he must test the hypothesis in a sample of that target population. No matter how many data a researcher collects, he can never absolutely prove (or disprove) his hypothesis. There will always be a need to draw inferences about phenomena in the population from events observed in the sample (Hulley et al., 2001). In some ways, the investigator’s problem is similar to that faced by a judge judging a defendant [Table 1]. The absolute truth whether the defendant committed the crime cannot be determined. Instead, the judge begins by presuming innocence — the defendant did not commit the crime. The judge must decide whether there is sufficient evidence to reject the presumed innocence of the defendant; the standard is known as beyond a reasonable doubt. A judge can err, however, by convicting a defendant who is innocent, or by failing to convict one who is actually guilty. In similar fashion, the investigator starts by presuming the null hypothesis, or no association between the predictor and outcome variables in the population. Based on the data collected in his sample, the investigator uses statistical tests to determine whether there is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis that there is an association in the population. The standard for these tests is shown as the level of statistical significance.

The analogy between judge’s decisions and statistical tests

Judge’s decision	Statistical test
Innocence: The defendant did not commit crime	Null hypothesis: No association between Tamiflu and psychotic manifestations
Guilt: The defendant did commit the crime	Alternative hypothesis: There is association between Tamiflu and psychosis
Standard for rejecting innocence: Beyond a reasonable doubt	Standard for rejecting null hypothesis: Level of statistical significance (à)
Correct judgment: Convict a criminal	Correct inference: Conclude that there is an association when one does exist in the population
Correct judgment: Acquit an innocent person	Correct inference: Conclude that there is no association between Tamiflu and psychosis when one does not exist
Incorrect judgment: Convict an innocent person.	Incorrect inference (Type I error): Conclude that there is an association when there actually is none
Incorrect judgment: Acquit a criminal	Incorrect inference (Type II error): Conclude that there is no association when there actually is one

Just like a judge’s conclusion, an investigator’s conclusion may be wrong. Sometimes, by chance alone, a sample is not representative of the population. Thus the results in the sample do not reflect reality in the population, and the random error leads to an erroneous inference. A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population. Although type I and type II errors can never be avoided entirely, the investigator can reduce their likelihood by increasing the sample size (the larger the sample, the lesser is the likelihood that it will differ substantially from the population).

False-positive and false-negative results can also occur because of bias (observer, instrument, recall, etc.). (Errors due to bias, however, are not referred to as type I and type II errors.) Such errors are troublesome, since they may be difficult to detect and cannot usually be quantified.

The likelihood that a study will be able to detect an association between a predictor variable and an outcome variable depends, of course, on the actual magnitude of that association in the target population. If it is large (such as 90% increase in the incidence of psychosis in people who are on Tamiflu), it will be easy to detect in the sample. Conversely, if the size of the association is small (such as 2% increase in psychosis), it will be difficult to detect in the sample. Unfortunately, the investigator often does not know the actual magnitude of the association — one of the purposes of the study is to estimate it. Instead, the investigator must choose the size of the association that he would like to be able to detect in the sample. This quantity is known as the effect size. Selecting an appropriate effect size is the most difficult aspect of sample size planning. Sometimes, the investigator can use data from other studies or pilot tests to make an informed guess about a reasonable effect size. When there are no data with which to estimate it, he can choose the smallest effect size that would be clinically meaningful, for example, a 10% increase in the incidence of psychosis. Of course, from the public health point of view, even a 1% increase in psychosis incidence would be important. Thus the choice of the effect size is always somewhat arbitrary, and considerations of feasibility are often paramount. When the number of available subjects is limited, the investigator may have to work backward to determine whether the effect size that his study will be able to detect with that number of subjects is reasonable.

After a study is completed, the investigator uses statistical tests to try to reject the null hypothesis in favor of its alternative (much in the same way that a prosecuting attorney tries to convince a judge to reject innocence in favor of guilt). Depending on whether the null hypothesis is true or false in the target population, and assuming that the study is free of bias, 4 situations are possible, as shown in Table 2 below. In 2 of these, the findings in the sample and reality in the population are concordant, and the investigator’s inference will be correct. In the other 2 situations, either a type I (α) or a type II (β) error has been made, and the inference will be incorrect.

Truth in the population versus the results in the study sample: The four possibilities

Truth in the population	Association + nt	No association
Reject null hypothesis	Correct	Type I error
Fail to reject null hypothesis	Type II error	Correct

The investigator establishes the maximum chance of making type I and type II errors in advance of the study. The probability of committing a type I error (rejecting the null hypothesis when it is actually true) is called α (alpha) the other name for this is the level of statistical significance.

If a study of Tamiflu and psychosis is designed with α = 0.05, for example, then the investigator has set 5% as the maximum chance of incorrectly rejecting the null hypothesis (and erroneously inferring that use of Tamiflu and psychosis incidence are associated in the population). This is the level of reasonable doubt that the investigator is willing to accept when he uses statistical tests to analyze the data after the study is completed.

The probability of making a type II error (failing to reject the null hypothesis when it is actually false) is called β (beta). The quantity (1 - β) is called power, the probability of observing an effect in the sample (if one), of a specified effect size or greater exists in the population.

If β is set at 0.10, then the investigator has decided that he is willing to accept a 10% chance of missing an association of a given effect size between Tamiflu and psychosis. This represents a power of 0.90, i.e., a 90% chance of finding an association of that size. For example, suppose that there really would be a 30% increase in psychosis incidence if the entire population took Tamiflu. Then 90 times out of 100, the investigator would observe an effect of that size or larger in his study. This does not mean, however, that the investigator will be absolutely unable to detect a smaller effect; just that he will have less than 90% likelihood of doing so.

Ideally alpha and beta errors would be set at zero, eliminating the possibility of false-positive and false-negative results. In practice they are made as small as possible. Reducing them, however, usually requires increasing the sample size. Sample size planning aims at choosing a sufficient number of subjects to keep alpha and beta at acceptably low levels without making the study unnecessarily expensive or difficult.

Many studies set alpha at 0.05 and beta at 0.20 (a power of 0.80). These are somewhat arbitrary values, and others are sometimes used; the conventional range for alpha is between 0.01 and 0.10; and for beta, between 0.05 and 0.20. In general the investigator should choose a low value of alpha when the research question makes it particularly important to avoid a type I (false-positive) error, and he should choose a low value of beta when it is especially important to avoid a type II error.

The null hypothesis acts like a punching bag: It is assumed to be true in order to shadowbox it into false with a statistical test. When the data are analyzed, such tests determine the P value, the probability of obtaining the study results by chance if the null hypothesis is true. The null hypothesis is rejected in favor of the alternative hypothesis if the P value is less than alpha, the predetermined level of statistical significance (Daniel, 2000). “Nonsignificant” results — those with P value greater than alpha — do not imply that there is no association in the population; they only mean that the association observed in the sample is small compared with what could have occurred by chance alone. For example, an investigator might find that men with family history of mental illness were twice as likely to develop schizophrenia as those with no family history, but with a P value of 0.09. This means that even if family history and schizophrenia were not associated in the population, there was a 9% chance of finding such an association due to random error in the sample. If the investigator had set the significance level at 0.05, he would have to conclude that the association in the sample was “not statistically significant.” It might be tempting for the investigator to change his mind about the level of statistical significance ex post facto and report the results “showed statistical significance at P < 10”. A better choice would be to report that the “results, although suggestive of an association, did not achieve statistical significance (P = .09)”. This solution acknowledges that statistical significance is not an “all or none” situation.

Hypothesis testing is the sheet anchor of empirical research and in the rapidly emerging practice of evidence-based medicine. However, empirical research and, ipso facto, hypothesis testing have their limits. The empirical approach to research cannot eliminate uncertainty completely. At the best, it can quantify uncertainty. This uncertainty can be of 2 types: Type I error (falsely rejecting a null hypothesis) and type II error (falsely accepting a null hypothesis). The acceptable magnitudes of type I and type II errors are set in advance and are important for sample size calculations. Another important point to remember is that we cannot ‘prove’ or ‘disprove’ anything by hypothesis testing and statistical tests. We can only knock down or reject the null hypothesis and by default accept the alternative hypothesis. If we fail to reject the null hypothesis, we accept it by default.

Source of Support: Nil

Conflict of Interest: None declared.

Daniel W. W. In: Biostatistics. 7th ed. New York: John Wiley and Sons, Inc; 2002. Hypothesis testing; pp. 204–294. [Google Scholar]
Hulley S. B, Cummings S. R, Browner W. S, Grady D, Hearst N, Newman T. B. 2nd ed. Philadelphia: Lippincott Williams and Wilkins; 2001. Getting ready to estimate sample size: Hypothesis and underlying principles In: Designing Clinical Research-An epidemiologic approach; pp. 51–63. [Google Scholar]
Medawar P. B. Philadelphia: American Philosophical Society; 1969. Induction and intuition in scientific thought. [Google Scholar]
Popper K. Unended Quest. An Intellectual Autobiography. Fontana Collins; p. 42. [Google Scholar]
Wulff H. R, Pedersen S. A, Rosenberg R. Oxford: Blackwell Scientific Publicatons; Empirism and Realism: A philosophical problem. In: Philosophy of Medicine. [Google Scholar]