Are Countries Manipulating COVID-19 Data?

Testing Benford's Law on COVID-19 Data Reported by Countries

Are COVID-19 numbers manipulated? I test the validity of COVID-19 daily reported cases world-wide using Benford's Law. Since the pandemic gained global centre stage, there has been a surge in data manipulation accusations. Independent media agencies questioned country-level data, and all of us made our conclusions if the data is correct.

Formally, it "states that in many naturally occurring collections of numbers, the leading digit is likely to be small", according to Wikipedia. The digits' occurrence probability is modelled using Benford's Distribution, with the following probability distribution function.

The probability of each digit comes out exactly as the following.

The law is so universal that the Income Tax Department uses it to detect fraud, legal cases have admitted it as evidence, regulators analyse prices to see cartel-like behaviour, forensics use it to identify deep-fakes and doctored videos, among others, and in our case, COVID-19 data reported by countries. The Netflix TV-series "Connected" did an episode "Digits" on Benford's Law. It is absolutely brilliant and you should watch it.


My approach is simple. Using the COVID-19 data available at Our World in Data (John Hopkins University), I modelled each country's daily cases using R and found first-digit distribution using benford package in R. Then, I measured how much they differed from the expected proportion as Root-Mean-Square-Error (RMSE). A lower RMSE value would mean more accurate data reporting. As you can see from the figure below, most countries reported their COVID-19 data correctly — including China. A high-res version of the map-plot can be found here.

The countries with little evidence of manipulation — RMSE less than 5 — are the following.

Botswana, Republic of Congo, Comoros, Dominica, Eritrea, Federated States of Micronesia, Equatorial Guinea, Grenada, Laos, Saint Lucia, Marshall Islands, Malawi, Nicaragua, Solomon Islands, United Republic of Tanzania, Vatican, Vanuatu, Samoa.

The countries with evidence of manipulation — RMSE more than 15 — are the following.

Albania, Argentina, Bangladesh, Bahrain, Belarus, Bolivia, Brazil, Chile, Colombia, Costa Rica, Cuba, Egypt, Ethiopia, Iran, Iraq, Italy, Japan, Kuwait, Sri Lanka, Mexico, Macedonia, Malta, Netherlands, Philippines, Poland, Portugal, Paraguay, Qatar, Russia, Saudi Arabia, Senegal, Syria, Tajikistan, Tunisia, Turkey, Uzbekistan, Venezuela.

The countries with very strong evidence of manipulation — RMSE more than 20— are the following.

Belarus, Chile, Colombia, Egypt, Iran, Kuwait, Qatar, Russia, Tajikistan, Turkey, Venezuela.

If you are curious about a specific country, I made a simple Shiny app for each country's Benford distribution plot. Check it out below or here.

Here is the complete list of countries and their RMSE values. Following that, I have included the R codes I used for this analysis and generating plots.

Of course, much detailed analysis is required to conclude anything confidently. Benford's Law can give misleading conclusions like 2020 US Elections, and it might as well be the case here. This is only a first-level analysis. Beyond first-order, expert-eyes are required to find how fair is Benford in this case.

Are Countries Manipulating COVID-19 Data?

R Codes



dat = read.csv("owid-covid-data.csv")

# data snapshot on 17th Feb, 2021 at 7 pm IST from

# check_benford function takes in the country name and plots its Benford distribution. While benford package does provide a plot, it was little informative and so, I made my own.

check_benford = function(country)


daily_cases = dat %>% filter(location == country) %>% select(new_cases) %>% unlist() %>% as.numeric() %>% na.omit()

bn = benford(daily_cases)

mat = bn[[2]]

plot(mat[,1], mat[,3], ylim = c(0, max(mat[,2], mat[,3])+5), type = "b", pch = 20, xlab = "Digit", ylab = "Frequency", main = paste0("Benford Distribution for ", country))

lines(mat[,1], mat[,2], col = "red")

points(mat[,1], mat[,2], col = "red", pch = 20)

legend("topright", c("Expected", "Actual"), fill = c("black", "red"))



# Calculating Benford RMSE

rmse = numeric()

for (i in 1:length(unique(dat$location)))


country = unique(dat$location)[i]

daily_cases = dat %>% filter(location == country) %>% select(new_cases) %>% unlist() %>% as.numeric() %>% discard(

if(length(daily_cases) == 0) next;

bn = benford(daily_cases)

mat = bn[[2]]

actual = mat[,2]

expected = mat[,3]

rmse[i] = sqrt(mean((actual - expected)^2, na.rm = T))


world1 = data.frame(location = unique(dat$location), iso = unique(dat$iso_code), rmse = rmse)

##### Map plot




world = ne_countries(scale = "medium", returnclass = "sf")

world1 = merge(world,world1,by.x = "iso_a3", by.y = "iso")

world_points<- st_centroid(world)

world_points <- cbind(world, st_coordinates(st_centroid(world$geometry)))

ggplot(data = world1) +

theme_bw() +

geom_sf(aes(fill = rmse)) +

geom_text(data= world_points,aes(x=X, y=Y, label=name), col = "grey", check_overlap = T, size = 1.5) +

scale_fill_viridis_c(option = "plasma") +

labs(x = "", y = "", fill = "RMSE", caption = "Benford analysis is used for fraud detection. Low RMSE is associated with low fraud probability.\nData from John Hopkins University (Our World in Data, Feb 17, 2021). Analysis and viz by Harshvardhan.", title = "Are Countries Manipulating COVID-19 Data?", subtitle = "Benford Analysis on COVID-19 Daily Cases")



# Saving RMSE and Country Rank

country_summ = data.frame("Country/Region" = world1$admin, "RMSE" = world1$rmse)

country_summ$Rank[order(country_summ$RMSE)] = 1:nrow(country_summ)

# Finding countries beyond thresholds

x = c(country_summ$Country.Region[which(country_summ$RMSE <3)])

cat(x, sep = ", ")

write.csv(country_summ, file = "summary_results.csv")