Self-citation and corruption: cross-sectional, cross-country study

# Background
Self-citation appears to be widely prevalent. However, the structural drivers of self-citation are poorly understood. 

# Methods
Data for this study were obtained from a recently published study of Scopus data aggregated across all authors with \>5 publications, across all scientific fields, which yielded aggregate, country-level data on the mean co-author self-citation rate for the period 1960-2018. These data were merged with 2018 data from Transparency International on corruption, and additional data extracted from the World Development Indicators. The country-level association between the self-citation rate and the corruption index was estimated using multivariable linear regression.

# Results
Across 178 countries, the correlation between the mean self-citation rate and the corruption index was -0.52, 95% confidence interval, CI=-0.62 to -0.41. Among the 49 countries in the lowest quartile of the corruption index, the mean self-citation rate was 0.24 (standard deviation, SD=0.06). Among the 44 countries in the highest quartile of the corruption index, the mean self-citation rate was 0.21 (SD=0.05). In a weighted linear regression model with robust estimates of variance, the corruption index had a statistically significant association with the mean self-citation rate (2nd quartile compared with 1st quartile: *b*=-0.08 (95% CI=-0.17 to -0.01); 3rd quartile: *b*=-0.11 (95% CI=-0.19 to -0.02); 4th quartile: *b*=-0.10 (95% CI=-0.19 to -0.01; N=165). The implied effect size was large in magnitude and robust to potential confounding by unmeasured covariates.

# Conclusions
In this cross-sectional, cross-country analysis, there was a strong correlation between a country’s overall level of corruption and the mean self-citation rate. The estimated association was statistically significant, large in magnitude, and unlikely to be explained away by unmeasured confounding. Better understanding of how corruption norms evolve is likely to be critical in addressing the problem of extreme self-citation and other forms of citation manipulation.


Results
Across 178 countries, the correlation between the mean self-citation rate and the corruption index was -0.52, 95% confidence interval, CI=-0.62 to -0.41. Among the 49 countries in the lowest quartile of the corruption index, the mean self-citation rate was 0.24 (standard deviation, SD=0.06). Among the 44 countries in the highest quartile of the corruption index, the mean self-citation rate was 0.21 (SD=0.05). In a weighted linear regression model with robust estimates of variance, the corruption index had a statistically significant association with the mean self-citation rate (2nd quartile compared with 1st quartile: b=-0.08 (95% CI=-0.17 to -0.01); 3rd quartile: b=-0.11 (95% CI=-0.19 to -0.02); 4th quartile: b=-0.10 (95% CI=-0.19 to -0.01; N=165). The implied effect size was large in magnitude and robust to potential confounding by unmeasured covariates.

Conclusions
In this cross-sectional, cross-country analysis, there was a strong correlation between a country's overall level of corruption and the mean self-citation rate. The estimated association was statistically significant, large in magnitude, and unlikely to be explained away by unmeasured confounding. Better understanding of how corruption norms evolve is likely to be critical in addressing the problem of extreme self-citation and other forms of citation manipulation.
Self-citation, which occurs when the authors of a published journal article cite a previously published journal article in which any of their names appear as authors, 1 is widely employed by researchers to disseminate their findings and influence the trajectory of the literature. 2,3 This behavior is not inappropriate by definition, given that it can also reflect genuine acknowledgment of scientific influence and priority. 4,5 Further, while this behavior is most often discussed in reference to individual authors, 6 it can also be a characteristic of journals and journals editors (i.e., to inflate journal impact factors) 7,8 and institutions. 9 One type of self-citation behavior, extreme self-citation 10 -which occurs when an inordinately large propor-tion of an author's total citation count is derived from selfcitation behavior -has also been described in the literature. No specific threshold proportion (i.e., of self-citations relative to total citations) has been identified, but Ioannidis et al. 10 identified more than 250 researchers for whom at least half of their total citations were derived from self-citation. This practice is a problematic behavior in academic research, as it is a form of citation manipulation that distort decisions about hiring, promotions, and research funding. There is debate in the literature about the extent to which men are more likely to engage in self-citation compared with women, 11-13 but other structural drivers of self-citation are poorly understood.

DATA SOURCES
Data for this study were obtained from Baas et al. 14 In brief, Scopus data were aggregated across all authors with >5 publications, across all scientific fields, to calculate the mean co-author self-citation rate for 1960-2018. 10,14 These data were merged with: 2018 data from Transparency International, which calculates an annual country-level Corruption Perceptions Index, a composite measure based on data from a variety of sources intended to measure "the overall extent of corruption (frequency and/or size of bribes) in the public or political sectors" 15 ; and 2018 data on per capita gross domestic product and the total number of scientific and engineering articles, both extracted from the World Development Indicators. 16

STATISTICAL ANALYSIS
First, I estimated the correlation, at the country-level, between the mean self-citation rate and the Corruption Perceptions Index. I estimated the mean self-citation rate in each quartile of the Corruption Perceptions Index. Then I used linear regression with robust estimates of variance to estimate the association between the two variables, specifying the mean self-citation rate as the dependent variable and quartiles of the Corruptions Perception Index as the primary explanatory variables of interest, adjusting for per capita gross domestic product and the total scientific publication output. Observations were weighted by the total number of authors, computed by Baas et al. 14 I conducted several sensitivity analyses. First, I specified the median rather than the mean self-citation rate as the dependent variable. Second, I used the e-value to estimate the degree of unmeasured confounding that would be needed to completely explain the observed association. 17 Third, I used the method proposed by Oster 18 to estimate the extent to which selection on unmeasured variables would be, relative to selection on measured variables (per capita gross domestic product and total scientific publication output), to completely explain the observed association.
Specifying the median self-citation rate as the dependent variable did not substantively shift the estimated associations (2nd quartile compared with 1st quartile: b=-0.08 (95% CI=-0.18 to 0.01); 3rd quartile: b=-0.12 (95% CI=-0.21 to -0.03); 4th quartile: b=-0.11 (95% CI=-0.20 to -0.01); N=165). The e-value associated with the highest quartile of the Corruption Perceptions Index was 7.58 for the point estimate and 5.13 for the confidence interval, indicating that an unmeasured confounding variable would need to have a very strong association with both corruption and self-citation (greater than 5 on the risk ratio scale) in order to shift the confidence interval of the estimated association to include zero. The R-squared from the regression model with all covariates was 0.60, so I assumed a maximum R-squared value of 0.60 × 1.3 = 0.78 in applying the procedures described by Oster. 18 I calculated a delta of 7.23, indicating that the model for the association between corruption and self-citation is fairly robust: selection on unmeasured variables would need to be more than 7 times as important as selection on the measured variables to generate an estimated regression coefficient equal to zero.

DISCUSSION
In this cross-sectional, cross-country analysis of data from 178 countries, I estimated a strong correlation between the country's overall level of corruption and the mean self-citation rate. The estimated association was statistically significant, large in magnitude, and unlikely to be explained away by unmeasured confounding. My findings are consistent with prior work from Italy showing that self-citation behavior can be shifted dramatically in response to incentives. 19,20 What my analysis adds is an assessment of country-level norms in explaining a behavior that is widely understood to be a form of citation manipulation. In this regard, my findings are consistent with those of Fisman and Miguel, 21 who found that New York City-based foreign diplomats from high-corruption countries were more likely than diplomats from low-corruption countries to accumulate unpaid parking tickets.
Interpretation of my findings is subject to several important limitations. First, unlike the study by Fisman and Miguel, 21 the data are ecological in nature and therefore potentially subject to the ecological fallacy: it would be an overreach to conclude that individuals from more corrupt countries are more likely to engage in self-citation behavior, or to conclude that individuals who are more corrupt are more likely to engage in self-citation behavior. Second, the ecological variables used in this analysis were aggregate (derived) measures, and no covariate adjustment for individual-level variables was used. 22 Third, the estimated association between corruption and self-citation could potentially be confounded by unmeasured variables. Fourth, and relatedly, the self-citation measures were based on data from Scopus, 10,14 which is known to have significantly broader coverage of journals and research output compared with other leading scholarly databases such as Web of Science. 23 If the coverage is over-inclusive of lower-quality research output characterized by a greater rate of self-citation, and is also over-inclusive of research output from countries with a higher degree of corruption, the estimated association between corruption and self-citation could be biased away from the null. However, the sensitivity analyses indicate that only very strong confounding could completely explain the estimated associations.

CONCLUSIONS
The limitations notwithstanding, my analysis shows that, at the country level, corruption is strongly associated with self-citation. Better understanding of how corruption norms evolve is likely to be critical in addressing the problem of extreme self-citation and other forms of citation manipulation. ACKNOWLEDGMENTS I thank Jeroen Baas (Director, Funding Content & Analytics, Elsevier Research Intelligence) for providing the data on self-citation. FUNDING None.

COMPETING INTERESTS
The author completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available upon request from the corresponding author) and declares no conflicts of interest.