Randomized Trialomania

This story is old, but the age of the story should not detract from the lessons of the story.

It was 1982, the place was Tsukuba, Ibaraki Prefecture, Japan. Workers at Fujisawa pharmaceuticals began testing fermented broths of Streptomyces species that had been retrieved from soil samples at the base of Mount Tsukuba.  They were working to solve the remaining achilles heel of organ transplantation – effective suppression of the immune system that would prevent the body from attacking its new guest.  It had quickly became apparent to the medical community that the key to long term survival of patients now lay in the development of effective, non-toxic immunosuppressive agents. 


After two years of testing, isolate no. 9993, which later came to be named FK506, or Tacrolimus, showed promise in inhibiting lymphocyte reactions.  First reports emerged in the literature in 1987, and were impressive.  The agent appeared to suppress mixed lymphocyte culture at concentration 30 to 100 times lower than the gold standard at the time: cyclosporin.  (1)

The father of organ transplant, Thomas Starzl was in Pittsburgh at the time, and quickly seized on the potential of this new agent.  By 1990, he had used the drug successfully in patients who were rejecting their liver transplants on conventional cyclosporine based immunosuppression. The positive results of the ‘rescue’ trial prompted initiation of a randomized control trial in Pittsburgh that compared cyclosporine to FK506 from the time of transplant.
At the time, the randomized control trial was in its relative infancy, and had not yet achieved the hallowed status it has today.  This, of course, was changing rapidly. Physicians recognized the fallacy of epistemology sourced purely from intuition and tradition, and sought the shelter of certainty that randomized control trials (RCTs) promised with the random allocation of patients to treatment and control arms.  The Pittsburgh team thus randomized 81 patient, 40 to cyclosporine – the conventional treatment – and 41 to FK506, the new kid on the block.  Investigators studied patient mortality and survival of the transplanted organ at various time points.  By convention, results were analyzed using statistical hypothesis testing – and to the lay person would seem to be underwhelming.  

At every time point studied after initiation of the drugs, patient and graft survival were better in the FK506 group, but not statistically significant.  Explaining this seeming paradox of better but not statistically better relates to understanding p-values.  Normally distributed (gaussian) populations operate on a bell curve.   A p value can be thought of as the probability that the difference observed between two groups may happen by random chance.  By convention, p values less than .05, or 5% are imparted the title of statistical significance.  The p-value in this case suggested that patient and graft survival were not significantly different, but using the metric of patient and graft survival was always one that was doomed to fail, for it neglected the investigators actions in the two arms. Patients that rejected their organs received augmented immunosuppression. In the case of those randomized to cyclosporine, 29/40 patients were rescued with FK506, while only 8/41 patients in the FK506 arm required additional immunosuppression.  (2)  The Pittsburgh group, not surprisingly, concluded that the superiority of FK506 was settled.  Statistical purists, and the FDA, unfortunately, did not.  The seemingly reasonable conclusions of the Starzl’s team were statistically unreasonable because the trial results ran afoul of another important statistical tenet – the intention-to-treat analysis.  

You see, the weakness in RCTs centers around patients that do not follow through and receive their assigned treatment for the duration of the trial.  These are patients that get ‘lost’ after being assigned to one group or the other.  A new drug may, for instance, cause kidney failure and force half of patients to stop taking the drug.  Ignoring those ‘censored’ patients could lead to the misleading conclusion that the new drug was better than standard therapy.  Accounting for this potential bias, means using intention-to-treat (ITT) – evaluating the results of an experiment based on the treatment assigned rather than the treatment eventually received.  In this sterile world far removed from the trenches of clinical medicine, it mattered not that almost 75% of the patients randomized to conventional cyclosporine had to be rescued by crossing over to use FK506 – it only mattered that those initially randomized to cyclosporine were not significantly different than those initially randomized to FK506.

So, an FDA advisory committee demanded two multicenter randomized control trials be done – one in Europe and one in North America.  

The trials were designed independent of those with the most experience on the matter – the Pittsburgh group.  No doubt, the intention was to replicate results without bias, but as a result the historical record shows that the design of the trials and the analysis of the trials suffered.  The Pittsburgh protocol for FK506 administration was rejected in favor of a protocol mandating 50% higher starting doses for FK506 than what was being used in Pittsburgh successfully.  In addition, adjustments in immunosuppression levels took days to do because samples had to be shipped to reference labs in distant cities.  Starting doses of FK506 were ultimately revised but not before a significant number of patients had been enrolled.  

Not surprisingly, analysis of the North American trial published in the prestigious New England Journal of Medicine lead to the flaccid conclusion that immunosuppressive regimens based on tacrolimus and cyclosporin were comparable in terms of patient and graft survival (3).  Glossed over was the fact that 32 patients in the cyclosporine group had refractory rejection compared to only 6 patients in the FK506 group, and that, again, 22 /32 of the cyclosporin rejections were switched to FK506.   Reanalysis of the original database using an endpoint that included rejection confirmed the superiority of FK506 to cyclosporine.  


Fig 1. Re-analysis of the Multi-center FK506 trials

The end result of all of this was that FK506 became the standard for immunosuppression after organ transplant.  The FDA, in the name of patient safety, forced a four year detour that did little to change what was known unequivocally to the Pittsburgh researchers in the spring of 1990.  Sadly the idolaters of empiricism cared little for the protestations of clinicians like Starzl charged with taking care of patients with transplants.  They now mandated similar multicenter randomized trials for FK506, organ by organ.  

The interests of the sick patients were forgotten in all of this.  The patients who entered these trials had no options other than to refuse participation in the experiment.  If they refused they continued on standard therapy.  If they  accepted, they had a flip of a coin’s chance of receiving therapy that was regarded as superior because of the convincing single center work already done by Starzl.  Starzl had been forced to jump through the same hoops 10 years prior when cyclosporine had first been introduced.  It was deja vu all over again.  Despite the accumulated evidence from the transplant community at the time suggesting cyclosporine was superior to standard therapy, the FDA had demanded – as they did now – another trial of standard therapy vs. cyclosporin.  Starzl spoke of this initial experience with what he considered unethical human experimentation painfully in his memoirs, The Puzzle People, (4).

No one who drew the long straw (actually a sealed envelope) ever asked for the conventional therapy.  Those who drew the short straw that meant consignment to the older therapy were angry.  They had lost their chance to obtain the drug that had brought many of them to Pittsburgh.  Their anger deepened when they began to see the actual results in cyclosporine treated patients with whom they shared the hospital ward.  Not only was graft survival better with cyclosporine, but the doses of prednisone needed were lower.  They came to understand that they were part of a human experiment comparing two methods of treatment in which the answer was already known.

The trial was disastrous.  At the end of one year primary graft survival was 90% in the cyclosporine arm, and 50% in the control group.  Starzl and his colleagues who had protested the ethics of the trial had been right.  Try to imagine the despair of patients whose transplanted organs failed – some of whom died – let down by a clueless bureaucracy supposed to protect them.  

Randomized trials are not sacrosanct – mandating them requires uncertainty that the treatment arm is no better than the control arm.  This, of course, requires the now much maligned clinical judgement.   There has never been a RCT of defibrillation for sudden cardiac arrest due to ventricular fibrillation, and there never will be.  We can thank clinical judgement for this.  Traumatized by being forced to oversee these trials in order to get approval for these new therapies, Starzl coined an apt moniker – randomized trialomania.  (5)

It should come as little surprise that the current regulatory excesses grew from well intentioned policy in the form of a bill benignly named the National Research Act.  The bill, signed by Gerald Ford in 1974 created the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research.  The commission, made up of 11 people, 3 of whom were physicians, generated the now famous Belmont Report.  The Belmont Report identified three principles, respect for persons, beneficence, and justice. The report spoke movingly of anecdotes through history of egregious misconduct with regards to medical research.  Interestingly one of the famous injustices noted was the Tuskegee syphilis experiment (1932-1972) conducted by the US public health service that withheld treatment for syphilis even after penicillin was validated in the 1940’s as an effective cure for the disease.

Shortly after the widely acclaimed Belmont Report was released, the Secretary of the Department of the Health released regulations – failure to comply would result in loss of all federal funds.  Institutions scrambled to comply, and overnight an enormous and complicated bureacracy arose.  Local Institutional Review Boards (IRBs) were mandated and given unprecedented autonomy to block proposed work, or formulate their own experiments.  Unfortunately, those who sat on IRBs sat in conference rooms far removed from the heat of clinical battle, and not infrequently had little role in patient care.  It was easy to see why this group of relative laypeople would retreat to a place of comfort – the randomized control trial – when having to make decisions on topics they understood little about.

Proponents of randomized trials point to egregious examples where clinical judgement has lead to harm.  They would be right.  There are many examples where conventional wisdom turned out not to be wisdom at all.  Randomized control trials have certainly helped inform my clinical decision making in a meaningful way.  But this can, and is taken too far by some.  The famous CAST trial confirmed the suspicions of some that suppressing extra heart beats after a heart attack was not beneficial and could actually be harmful. Certainly, without the CAST trial, this very well could be a common practice even now.  But initiating the CAST trial requires some level of clinical uncertainty about the practice. Expertise and clinical judgement still should have primacy.  If I was asked to enroll a patient in CAST like trial today, I would be tweeting Sergio Pinsky and Andrea Natale first.
Many see these trials as antidotes to manipulations by the medical-pharmaceutical-industrial complex.  Unfortunately, those concerned about the very real conflicts of interest inherent with industry or pharmaceutical funding ignore the conflicts of interest inherent in the frequently non-clinical academic class infected with trialomania. Professional careers were built on the multicenter transplant RCTs mandated by the FDA.  Starzl was even pointedly told that the only path to publishing papers in prestigious journals like the NEJM now required these types of randomized control trials.  To these academic purists and intellectuals the currency that mattered was high impact journal publishing – the patients became secondary concerns.

Unfortunately, the lessons of this story have been lost. Trialomania continues unchecked and unabated.  There have been no new class of immunosuppressives developed since Fk506 was discovered in 1982.  

Everyone has conflicts of interest, and I would be remiss to not include mine.  My daughter is 9 years old, loves sushi, and thinks its funny when her dad looks like he has bunny ears. She is also lucky enough to have two birthdays – the day she was born, and the day she received her liver transplant.  She dutifully takes FK506 every day.  Her daddy loves her very much.



  1. Wallemacq, P. E., and R. Reding. 1993. “FK506 (tacrolimus), a Novel Immunosuppressant in Organ Transplantation: Clinical, Biomedical, and Analytical Aspects.” Clinical Chemistry 39 (11 Pt 1): 2219–28.
  2. Fung, J., K. Abu-Elmagd, A. Jain, R. Gordon, A. Tzakis, S. Todo, S. Takaya, M. Alessiani, A. Demetris, and O. Bronster. 1991. “A Randomized Trial of Primary Liver Transplantation under Immunosuppression with FK 506 vs Cyclosporine.” Transplantation Proceedings 23 (6): 2977–83.
  3. “A Comparison of Tacrolimus (FK 506) and Cyclosporine for Immunosuppression in Liver Transplantation. The U.S. Multicenter FK506 Liver Study Group.” 1994. The New England Journal of Medicine 331 (17): 1110–15.
  4. Starzl TE. The puzzle people. University of Pittsburgh Press; Pittsburgh: 1992. pp. 231–42.
  5. Starzl, T. E., A. Donner, M. Eliasziw, L. Stitt, P. Meier, J. J. Fung, J. P. McMichael, and S. Todo. 1995. “Randomised Trialomania? The Multicentre Liver Transplant Trials of Tacrolimus.” The Lancet 346 (8986): 1346–50.

Categories: Uncategorized

Tagged as: ,

2 replies »

  1. A p-value is a point estimate like any other empirical derivation, one with its own “distribution” upon repeated re-calculation. And, yes, its (limited) utility assumes Gaussian normality. More importantly, no serious practitioners of applied commercial stats use the p-value cut-offs that are still uncritically taught, they use stress-tested “expected value” calculations. (I worked for a number of years in subprime credit risk modeling and mgmt. When serious money is on the line, you don’t screw around with sophomoric p-value stuff. You check your distributional characteristics, and you stress-test for worst-mid-best case expected values.)

    “Naturally, there are many data that are skewed and are not symmetrical.”

    First, thank you for not saying “data IS.” Second, tests for “significance” of differences between sampled distributions of MEANS are actually pretty “robust” with respect to parent distributional abnormalities.

    You can easily test this. Assemble a data set of, say, sequential integers ranging from 1 to 10,000 (a “flat distribution,” as flat as they get). Use a random number generator to pick off samples of, say, 100 out of those data. Compute the mean. Lather, rinse, repeat. The “distribution” of those means (ranging around 5,000) will pass a “normality test” every time, notwithstanding that the source data distribution is flat as a pancake.

    I’ve done that in SAS. It works.

    BTW, Gauss, color me Chebychev.

  2. Very good story. Fun reading. Thank you.

    When one uses the P value, you have to be dealing with data that has what is called a Gaussian or normal distribution. This essentially means that the shape of the curves is symmetrical and that the mean, the mode, and the median are the same number.

    Naturally, there are many data that are skewed and are not symmetrical. This is true whenever outlier values cause lethal effects, for example, if they are extreme toward either high or low end.

    When the reasons for variance are all independent variables and if these reasons are additive, then one can get a normal distribution and use the P value. Otherwise, it should not be used.