Thanks man. Just for further information, I'm following the methods section of two similar systematic reviews (extracts below) so my thinking is kind of based on what they've detailed as to be logical when facing this challenge (i.e. different research papers reporting efficacy in different statistical ways):
So one article giving me good data in an effectsize way for what I'm interested in would be this:
Then there are a selection of articles giving what I think to be effect size data in a different way (McNemar), like this one:
And then there are less useful articles reporting efficacy in ways like this:
1 - Each study was examined for the question(s) which it addressed and relevant pre- and post-therapy data were extracted.We computed statistical significance for the pre- and post- treatment scores using the McNemar’s change test (p < 0.05, Seigel & Castellan,1988).3 There were two primary reasons for performing the statistical computations. Some studies failed to report any statistical measure (e.g., Faroqi & Chengappa, 1996; Gil & Goral, 2004; Khamis, Venkert-Olenik, & Gil, 1996). A few other studies reported parametric statistical tests whose assumptions of normality and independence were not met by the study design and data. McNemar’s change test is a non-parametric test for paired nominal measures such as accuracy data, and has been used by several aphasiologists to compute statistical significance of treatment-induced changes in behavioral scores (Faroqi-Shah, 2008; Rochon, Laird, Bose, & Scofield, 2005). The use of a consistent statistical measure makes comparisons of statistical significance across studies more valid.
2 - As well as describing the treatment outcomes of included studies, the clinical efficacy of SFA was determined by calculating effect sizes. Effect sizes could be calculated only in those studies that reported sufficient data. To calculate, it was necessary to determine the individual values for the pretreatment and posttreatment phases for each set of trained items. Cohen’s d statistic was used to calculate effect size as described by Busk and Serlin (1992). The magnitude of change in performance was determined according to the benchmarks for lexical retrieval studies described by Beeson and Robey (2006). The benchmarks were 4.0, 7.0, and 10.1 for small, medium, and large effect sizes, respectively. Where Cohen’s d could not be calculated, the percent of nonoverlapping data (PND) was calculated. PND is the most widely used method of calculating effect size in single case experimental designs (Gast, 2010; Schlosser, Lee, & Wendt, 2008). PND is the percentage of Phase B data points (the treatment phase) that do not overlap with Phase A data points (baseline or no treatment). To determine the magnitude of effect, benchmarks put forth by Scruggs, Mastropieri, and Casto (1987) were used. PND scores higher than 90% were considered to demonstrate a highly effective treatment, PND of 70%–90% were interpreted as a moderate treatment outcome, and PND scores of 50%–70% were considered a questionable effect. PND scores less than 50% were interpreted as an ineffective intervention because performance during intervention had not affected behavior beyond baseline performance.