2. Sample size
The sample size relates to the number of experimental units in each group at the start of the study, and is usually represented by n (see item 1 – Study design for further guidance on identifying and reporting experimental units). This information is crucial to assess the validity of the statistical model and the robustness of the experimental results.
The sample size in each group at the start of the study may be different from the n numbers in the analysis (see item 3 – Inclusion and exclusion criteria), this information helps readers identify attrition or if there have been exclusions, and in which group they occurred. Reporting the total number of animals used in the study is also useful to identify if any were re-used between experiments.
Report the exact value of n per group and the total number in each experiment (including any independent replications). If the experimental unit is not the animal, also report the total number of animals to help readers understand the study design. For example, in a study investigating diet using cages of animals housed in pairs, the number of animals is double the number of experimental units.
“Treatment and transplantation received by the animals for each group. The group were named after the treatment they received in treatment phase 1 and 2: S = saline, L = L-DOPA thus, SS group received saline in treatment phase 1 and 2, SL group received saline in treatment phase 1 and L-DOPA in treatment phase 2, LS received L-DOPA in Treatment phase 1 and saline in treatment phase 2 and LL received L-DOPA in treatment phase 1 and 2. *These groups originally numbered 12 animals but some developed tumors unrelated to the experiment, so had be removed from the study.” 
- Breger LS, Kienle K, Smith GA, Dunnett SB and Lane EL (2017). Influence of chronic L-DOPA treatment on immune response following allogeneic and xenogeneic graft in a rat model of Parkinson's disease. Brain Behav Immun. doi: 10.1016/j.bbi.2016.11.014
For any type of experiment, it is crucial to explain how the sample size was determined. For hypothesis-testing experiments, where inferential statistics are used to estimate the size of the effect and to determine the weight of evidence against the null hypothesis, the sample size needs to be justified to ensure experiments are of an optimal size to test the research question [1,2] (see item 13 – Objectives). Sample sizes that are too small (i.e. underpowered studies) produce inconclusive results, whereas sample sizes that are too large (i.e. overpowered studies) raise ethical issues over unnecessary use of animals and may produce trivial findings that are statistically significant but not biologically relevant . Low power has three effects: first, within the experiment, real effects are more likely to be missed; second, where an effect is detected, this will often be an over-estimation of the true effect size ; and finally, when low power is combined with publication bias, there is an increase in the false positive rate in the published literature . Consequently, low powered studies contribute to the poor internal validity of research and risk wasting animals used in inconclusive research .
Study design can influence the statistical power of an experiment and the power calculation used needs to be appropriate for the design implemented. Statistical programs to help perform a priori sample size calculations exist for a variety of experimental designs and statistical analyses, both freeware (web based applets and functions in R) and commercial software [7-9]. Choosing the appropriate calculator or algorithm to use depends on the type of outcome measures and independent variables, and the number of groups. Consultation with a statistician is recommended, especially when the experimental design is complex or unusual.
Where the experiment tests the effect of an intervention on the mean of a continuous outcome measure, the sample size can be calculated a priori, based on a mathematical relationship between the predefined, biologically relevant effect size, variability estimated from prior data, chosen significance level, power and sample size (See "Information used in a power calculation", below, and [10,11] for practical advice). If you have used an a priori sample size calculation, report:
- the analysis method (e.g. two-tailed student’s t-test with a 0.05 significance threshold)
- the effect size of interest and a justification explaining why an effect size of that magnitude is relevant
- the estimate of variability used (e.g. standard deviation) and how it was estimated
- the power selected
Sample size calculation is based on a mathematical relationship between the following parameters: effect size, variability, significance level, power and sample size. Questions to consider are:
The primary objective of the experiment – what is the main outcome measure?
The primary outcome measure should be identified in the planning stage of the experiment; it is the outcome of greatest importance, which will answer the main experimental question.
The predefined effect size – what is a biologically relevant effect size?
The effect size is estimated as a biologically relevant change in the primary outcome measure between the groups under study. This can be informed by similar studies and involves scientists exploring what magnitude of effect would generate interest and would be worth taking forward into further work. In preclinical studies, the clinical relevance of the effect should also be taken into consideration.
What is the estimate of variability?
Estimates of variability can be obtained:
Significance threshold – what risk of a false positive is acceptable?
The significance level or threshold (α) is the probability of obtaining a false positive. If it is set at 0.05 then the risk of obtaining a false positive is 1 in 20 for a single statistical test. However, the threshold or the p values will need to be adjusted in scenarios of multiple testing (e.g. by using a Bonferroni correction).
Power - what risk of a false negative is acceptable?
For a predefined, biologically meaningful effect size, the power (1-β) is the probability that the statistical test will detect the effect if it genuinely exists (i.e. true positive result). A target power between 80-95% is normally deemed acceptable, which entails a risk of false negative between 5-20%.
Directionality - will you use a one or two-sided test?
The directionality of a test depends on the distribution of the test statistics for a given analysis. For tests based on t or z distributions (such as t-tests), whether the data will be analysed using a one or two-sided test relates to whether the alternative hypothesis (H1) is directional or not. An experiment with a directional (one-sided) H1 can be powered and analysed with a one-sided test with the goal of maximising the sensitivity to detect this directional effect. Controversy exists within the statistics community on when it is appropriate to use a one-sided test . The use of a one-sided test requires justification of why a treatment effect is only of interest when it is in a defined direction and why they would treat a large effect in the unexpected direction no differently from a non-significant difference . Following the use of a one-sided test, the investigator cannot then test for the possibility of missing an effect in the untested direction. Choosing a one-tailed test for the sole purpose of attaining statistical significance is not appropriate.
Two-sided tests with a non-directional H1 are much more common and allow researchers to detect the effect of a treatment regardless of its direction.
Note that analyses such as ANOVA and chi-square are based on asymmetrical distributions (F- distribution and chi-square distribution) with only one tail. Therefore, these tests do not have a directionality option.
There are several types of studies where a priori sample size calculations are not appropriate. For example, the number of animals needed for antibody or tissue production is determined by the amount required and the production ability of an individual animal. For studies where the outcome is the successful generation of a sample or a condition (e.g. the production of transgenic animals), the number of animals is determined by the probability of success of the experimental procedure.
In early feasibility or pilot studies, the number of animals required depends on the purpose of the study. Where the objective of the preliminary study is primarily logistic or operational (e.g. to improve procedures and equipment), the number of animals needed is generally small. In such cases power calculations are not appropriate and sample sizes can be estimated based on operational capacity and constraints . Pilot studies alone are unlikely to provide adequate data on variability for a power calculation for future experiments. Systematic reviews and previous studies are more appropriate sources of information on variability .
If no power calculation was used to determine the sample size, state this explicitly and provide the reasoning that was used to decide on the sample size per group. Regardless of whether a power calculation was used or not, when explaining how the sample size was determined take into consideration any anticipated loss of animals or data, for example due to exclusion criteria established upfront or expected attrition (see item 3 – Inclusion and exclusion criteria).
- Vahidy F, Schäbitz W-R, Fisher M and Aronowski J (2016). Reporting standards for preclinical studies of stroke therapy. Stroke. doi: 10.1161/STROKEAHA.116.013643
- Muhlhausler BS, Bloomfield FH and Gillman MW (2013). Whole animal experiments should be more like human randomized controlled trials. PLoS Biol. doi: 10.1371/journal.pbio.1001481
- Jennions MD and Møller AP (2003). A survey of the statistical power of research in behavioral ecology and animal behavior. Behavioral Ecology. doi: 10.1093/beheco/14.3.438
- Lazic SE, Clarke-Williams CJ and Munafò MR (2018). What exactly is ‘N’ in cell culture and animal experiments? PLOS Biology. doi: 10.1371/journal.pbio.2005282
- Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ and Munafo MR (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. doi: 10.1038/nrn3475
- Würbel H (2017). More than 3Rs: The importance of scientific validity for harm-benefit analysis of animal research. Lab animal. doi: 10.1038/laban.1220
- R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing,. Available at: https://www.R-project.org/
- Peng C-YJ, Long H and Abaci S (2012). Power Analysis Software for Educational Researchers. The Journal of Experimental Education. doi: 10.1080/00220973.2011.647115
- Charan J and Kantharia ND (2013). How to calculate sample size in animal studies? J Pharmacol Pharmacother. doi: 10.4103/0976-500X.119726
- Bate ST and Clark RA (2014). The design and statistical analysis of animal experiments. Cambridge University Press. https://www.cambridge.org/core/books/design-and-statistical-analysis-of-animal-experiments/BDD758F3C49CF5BEB160A9C54ED48706
- Festing MFW (2018). On determining sample size in experiments involving laboratory animals. Laboratory Animals. doi: 10.1177/0023677217738268
- Freedman LS (2008). An analysis of the controversy over classical one-sided tests. Clin Trials. doi: 10.1177/1740774508098590
- Ruxton GD and Neuhäuser M (2010). When should we use one-tailed hypothesis testing? Methods in Ecology and Evolution. doi: 10.1111/j.2041-210X.2010.00014.x
- Reynolds PS (2019). When power calculations won’t do: Fermi approximation of animal numbers. Lab Animal. doi: 10.1038/s41684-019-0370-2
- Bate S. How to decide your sample size when the power calculation is not straightforward. (Access Date: 02/08/2018). Available at: https://www.nc3rs.org.uk/news/how-decide-your-sample-size-when-power-calculation-not-straightforward
“The sample size calculation was based on postoperative pain numerical rating scale (NRS) scores after administration of buprenorphine (NRS AUC mean = 2.70; noninferiority limit = 0.54; standard deviation = 0.66) as the reference treatment…and also Glasgow Composite Pain Scale (GCPS) scores…using online software (Experimental design assistant; https://eda.nc3rs.org.uk/eda/login/auth). The power of the experiment was set to 80%. A total of 20 dogs per group were considered necessary.” 
“We selected a small sample size because the bioglass prototype was evaluated in vivo for the first time in the present study, and therefore, the initial intention was to gather basic evidence regarding the use of this biomaterial in more complex experimental designs.” 
- Bustamante R, Daza MA, Canfrán S, García P, Suárez M, Trobo I and Gómez de Segura IA (2018). Comparison of the postoperative analgesic effects of cimicoxib, buprenorphine and their combination in healthy dogs undergoing ovariohysterectomy. Veterinary Anaesthesia and Analgesia. doi: 10.1016/j.vaa.2018.01.003
- Spin JR, Oliveira GJPLd, Spin-Neto R, Pires JR, Tavares HS, Ykeda F and Marcantonio RAC (2015). Avaliação histomorfométrica da associação entre biovidro e osso bovino liofilizado no tratamento de defeitos ósseos críticos criados em calvárias de ratos. Estudo piloto. Revista de Odontologia da UNESP. http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1807-25772015000100037&nrm=iso