If John Carlisle had a cat flap, the scientific scammers could rest more relaxed at night. Carlisle gets up routinely at 4.30 a.m. to let out Wizard, the family pet. Then, unable to sleep, he takes his laptop and starts writing data from published papers on clinical trials. Before his wife's alarm clock rings 90 minutes later, he generally manages to fill out a spreadsheet with the ages, weights and heights of hundreds of people, some of whom, he suspects, never existed.
During the day, Carlisle is an anesthetist who works for the National Health Service of England in the coastal city of Torquay. But in his spare time, he relies on the scientific record of suspicious data in clinical research. Over the past decade, his research has included trials used to investigate a wide range of health problems, from the benefits of specific diets to hospital treatment guidelines. This has led to hundreds of documents being retracted and corrected due to their misconduct and mistakes. And it has helped put an end to the careers of some large-scale fakers: of the six scientists in the world with the most retractions, three were shot down using variants of Carlisle's data analysis.
"His technique has proven incredibly useful," says Paul Myles, director of anesthesia and perioperative medicine at Alfred Hospital in Melbourne, Australia, who worked with Carlisle to examine research articles that contain unreliable statistics. "He has used it to demonstrate some important examples of fraud."
Carlisle's statistical line is not popular with everyone. Critics argue that it has sometimes led to the questioning of articles that obviously have no defects, resulting in an unjustified suspicion.
But Carlisle believes he is helping to protect patients, so he spends his free time studying the studies of others. "I do it because my curiosity motivates me to do it," he says, not because of an overwhelming enthusiasm to discover the faults: "It is important not to become a crusader against misconduct."
Along with the work of other researchers who stubbornly review academic papers, their efforts suggest that the guardians of science (journals and institutions) could do much more to detect errors. In medical trials, the type Carlisle focuses on can be a matter of life or death.
Anesthetists behave badly.
Torquay looks like any other traditional city in the English provinces, with beautiful floral displays at the roundabouts and just enough pastel-colored cabins to get attention. Carlisle has lived in the area for 18 years and works in the city's general hospital. In an empty operating room, after a patient has just been sewn and removed, he explains how he started looking for false data in medical research.
More than ten years ago, Carlisle and other anesthesiologists began talking about the results published by a Japanese researcher, Yoshitaka Fujii. In a series of randomized controlled trials (RCTs), Fujii, who worked at Toho University in Tokyo, claimed to have examined the impact of several medications on the prevention of vomiting and nausea in patients after surgery. But the data seemed too clean to be true. Carlisle, one among many interested parties, decided to verify the figures, using statistical tests to detect unlikely patterns in the data. He showed in 2012 that, in many cases, the likelihood that the patterns arose by chance was "infinitesimally small"one. As a result of this analysis, the editors of the magazines asked current and former Fujii universities to investigate; Fujii was fired from Toho University in 2012 and had removed 183 of his documents, a historical record. Four years later, Carlisle co-published an analysis of the results of another Japanese anesthesiologist, Yuhji Saitoh, a frequent co-author of Fujii, and showed that his data was also extremely suspicious.two. Saitoh currently has 53 retractions.
Other researchers soon cited Carlisle's work in their own analyzes, which used variants of their approach. In 2016, researchers in New Zealand and the United Kingdom, for example, reported problems in documents by Yoshihiro Sato, a bone researcher at a hospital in southern Japan.3. Ultimately, there were 27 retractions, and a total of 66 articles created by Sato have been removed.
The anesthesia was shaken by several fraud scandals before the Fujii and Saitoh cases, including that of the German anesthetist Joachim Boldt, to whom more than 90 articles have been removed. But Carlisle began to wonder if only his own camp was guilty. So he chose eight leading journals and, working on his spare time, reviewed thousands of randomized trials that they had published.
In 2017, he published an analysis in the magazine. Anesthesia indicating that he had found suspicious data in 90 of more than 5,000 trials published over 16 years4. At least ten of these documents have been removed and six corrected, including a high-profile study published in the New England Journal of Medicine (NEJM) On the health benefits of the Mediterranean diet. In that case, however, there was no suggestion of fraud: the authors had made a mistake in the way of randomizing the participants. After the authors deleted erroneous data, the article was republished with similar conclusions.5.
Carlisle has kept walking. This year, he warned about dozens of anesthesia studies conducted by an Italian surgeon, Mario Schietroma, from the University of L’Aquila, in central Italy, and said they were not a reliable basis for clinical practice.6. Myles, who worked on the report with Carlisle, had raised the alarm last year after detecting suspicious similarities in the raw data for the control and patient groups in five of Schietroma's documents.
The challenges to Schietroma's claims have had an impact on hospitals around the world. The World Health Organization (WHO) cited Schietroma's work when, in 2016, it issued a recommendation that anesthetists should routinely increase the oxygen levels they supply to patients during and after surgery, to help reduce the infection. That was a controversial call: anesthetists know that in some procedures, too much oxygen can be associated with an increased risk of complications, and the recommendations would have meant that hospitals in the poorest countries would spend more of their budget on expensive bottled oxygen, says Myles .
The five articles on which Myles warned were quickly withdrawn, and WHO revised its recommendation from "strong" to "conditional," meaning that doctors have more freedom to make different decisions for several patients. Schietroma says that his calculations were evaluated by an independent statistician and by a peer review, and that he selectively selected similar groups of patients, so it is not surprising that the data coincide. He also says that he lost raw data and documents related to the trials when L & # 39; Aquila was hit by an earthquake in 2009. A university spokesman says he left consultations to "the competent investigative bodies", but did not explain which agencies They were or if any investigation was ongoing.
Location of unnatural data
The essence of Carlisle's approach is nothing new, he says: it is simply that real-life data has natural patterns that artificial data struggle to replicate. Such phenomena were detected in the 1880s, were popularized by American electrical and physical engineer Frank Benford in 1938, and since then they have been used by many statistical verifiers. Political scientists, for example, have long used a similar approach to analyze data from a survey, a technique they call the Stouffer method after sociologist Samuel Stouffer, who popularized it in the 1950s.
In the case of RCTs, Carlisle observes the reference measurements that describe the characteristics of the volunteer groups in the trial, usually the control group and the intervention group. These include height, weight and relevant physiological characteristics, which are generally described in the first table of a document.
In a genuine RCT, volunteers are randomly assigned to the control or (one or more) intervention groups. As a result, the mean and standard deviation for each characteristic must be approximately the same, but not too identical. That would be suspiciously perfect.
Carlisle first builds a P value for each match: a statistical measurement of the probability that the reported baseline data points are assumed to be, in fact, that the volunteers were randomly assigned to each group. He then gathers all these P values to get an idea of how random the measurements are in general. A combined P the value that seems too high suggests that the data is suspiciously well balanced; too low and could show that patients have been randomized incorrectly.
The method is not infallible. Statistical controls require that the variables in the table be truly independent, while in reality, they often are not. (Height and weight are linked, for example). In practice, this means that some documents that are marked as incorrect are not really, and for that reason, some statisticians have criticized Carlisle's work.
But Carlisle says that the application of his method is a good first step, and that he can highlight studies that might deserve a more detailed look, such as requesting data from individual patients behind the document.
“You can put a red flag. Or an amber flag, or five or ten red flags to say it's very unlikely to be real data, "says Myles.
Errors against malefactors
Carlisle says he is careful not to attribute any cause to the possible problems he identifies. In 2017, however, when Carlisle's analysis of 5,000 trials appeared in Anesthesia – of which he is editor – an accompanying editorial of anesthetists John Loadsman and Tim McCulloch at the University of Sydney in Australia took a more provocative line7.
He spoke of "dishonest authors" and "evildoers" and suggested that "more authors of ECA already published will finally get their touch on the shoulder." He also said: "A strong argument could be made that all the journals in the world should now apply the Carlisle method to all the RCTs they have published."
This provoked a response strongly written by the editors in a magazine, Anesthesiology, which had published 12 of the articles that Carlisle highlighted as problematic. "Carlisle's article is ethically questionable and detrimental to the authors of previously published" called "articles, wrote the editor in chief of the magazine, Evan Kharasch, an anesthesiologist at Duke University in Durham, North Carolina.8. His editorial, co-written with anesthesiologist Timothy Houle at the Massachusetts General Hospital in Boston, who is the statistical consultant for Anesthesiology, highlighted problems such as the fact that the method could identify false positives. "A valid method to detect fabrication and counterfeiting (similar to plagiarism control software) would be welcome. Carlisle's method is not such," they wrote in a correspondence to Anesthesia9.
In May, Anesthesiology He corrected one of the documents that Carlisle had highlighted, noting that he had reported "systematically incorrect" P values in two tables, and that the authors lost the original data and could not recalculate the values. Kharasch, however, says he defends his point of view in the editorial. Carlisle says that the Loadsman and McCulloch editorial was "reasonable" and that criticism of their work does not undermine its value. "I feel comfortable thinking that the effort is worthwhile, while others are not," he says.
The data verifiers
Carlisle is not the only method that has emerged in recent years to double-check published data.
Michèle Nuijten, who studies analytical methods at the University of Tilburg in the Netherlands, has developed what she calls a "spell checker for statistics" that can scan journal articles to verify if the statistics described are internally consistent. Called statcheck, verify, for example, that the data reported in the results section matches the calculation P values. It has been used to mark errors, usually numerical typographical errors, in journal articles dating back decades.
And Nick Brown, a graduate student in psychology at the University of Groningen, also in the Netherlands, and James Heathers, who studies scientific methods at Northeastern University in Boston, Massachusetts, have used a program called GRIM to verify statistical calculation. It means, as another way to mark suspicious data.
Neither technique would work on articles describing RCTs, such as the studies that Carlisle has evaluated. Statcheck runs in the strict data presentation format used by the American Psychological Association. GRIM works only when the data is integers, such as discrete numbers generated in the psychology questionnaires, when a value is scored from 1 to 5.
There is a growing interest in these types of controls, says John Ioannidis at Stanford University in California, who studies scientific methods and advocates for a better use of statistics to improve reproducibility in science. "They are wonderful and very ingenious tools." But he warns about jumping to conclusions about the reason for the problems encountered. "It is a completely different picture if we are talking about fraud rather than if we are talking about a typo," he says.
Brown, Nuijten and Carlisle agree that their tools can only highlight the problems that should be investigated. "I really don't want to associate statcheck with fraud," says Nuijten. The true value of these tools, says Ioannidis, will be to analyze documents for problematic data before they are published, and thus prevent fraud or errors from reaching the literature in the first place.
Carlisle says that a growing number of magazine publishers have contacted him about using his technique in this way. Currently, most of this effort is carried out unofficially on an ad hoc basis, and only when publishers are already suspicious.
At least two journals have taken things further and now use statistical controls as part of the publication process for all articles. Carlisle's own diary, Anesthesia, use it routinely, as do the editors of NEJM. "We are looking to prevent a rare but potentially shocking negative event," a spokesman for the NEJM He says. "It's worth the extra time and expense."
Carlisle says he is very impressed that a newspaper with the status of the NEJM He has introduced these controls, which he knows firsthand are laborious, take a long time and are not universally popular. But automation would be needed to introduce them on the scale required to review even a fraction of the approximately two million articles published worldwide each year, he says. Think it could be done. Statcheck works this way, and is being used routinely by several psychology journals to examine presentations, says Nuitjen. And text mining techniques have allowed researchers to evaluate, for example, P Values in thousands of articles as a way to investigate. P-hacking – in which the data is modified to produce significant P values.
One problem, several researchers in the field say, is that funders, magazines and many in the scientific community give such controls a relatively low priority. "It's not a very rewarding kind of job," says Nuijten. "It's just that you're trying to find fault with other people's work, and that's not something that makes you very popular."
Even discovering that a study is fraudulent does not always end the matter. In 2012, researchers in South Korea introduced Anesthesia and Analgesia a report from an essay that analyzed how facial muscle tone could indicate the best time to insert breathing tubes into the throat. When asked, unofficially, to take a look, Carlisle found discrepancies between patient and summary data, and the document was rejected.
Surprisingly, it was then sent to Carlisle's own diary with different patient data, but Carlisle recognized the document. It was rejected again, and the editors of both journals contacted the authors and their institutions with their concerns. To Carlisle's astonishment, a few months later, the document, which has not changed since the last version, was published in the European Journal of Anesthesiology. After Carlisle shared the dubious story of the newspaper with the magazine's editor, he retired in 2017 due to "irregularities in his data, including misrepresentation of the results."10.
After seeing so many cases of fraud, along with typographical errors and errors, Carlisle has developed his own theory of what drives some researchers to invent their data. "They think that the random opportunity on this occasion got in the way of the truth, how they know that the Universe really works," he says. "So they change the result to what they think it should have been."
As Carlisle has shown, a certain data tester is needed to detect deception.