# How a Simple Statistic Law Can Help Detecting Fraud With AWS and Python | by Alexandre Bruffa | May, 2022

## Checking the validity of Benford’s law with Amazon Web Services and Python

Frank Benford was an electrical engineer known for rediscovering a statistical curiosity about the occurrence of digits in lists of data. This curiosity is known as Benford’s law or the law of anomalous numbers, and has applications in accounting fraud detection, criminal trials, election data, among others.

This is very simple. Make a list of everyday numbers you can find like the number of pages of your favorite book, the length of the river near your home, the number of inhabitants of your town, the land area of ​​your country, etc.

Then, calculate how many of those numbers begin with 1, with 2, etc. Common sense would dictate that the distribution of each leading digit is similar, but it’s not. The distribution is logarithmic, as shown below:

The formula is the following:

`P(d) = log(1 + 1 / d)d ∈ [1, ..., 9]`

Notes:

• `d` is an integer between 1 and 9 (0 is not taken into account).
• `P(d)` stands for the probability that `d` is the leading digit.
• `log` refers to the logarithm base 10.

In other words, it is more probable that your city had 100,000 or 1 million inhabitants than 900,000 or 9 million. It sounds incredible, right? Let’s check it!

First, we need to build an enormous list of everyday life numbers. Those numbers can easily be found on the internet, we just have to copy them from websites and then analyze them. But oh boy, it sounds so boring! We want to work with massive data, it could take hours!

Another solution would be working with websites or platforms that have an API like Reddit. However, this is quite limited: not all websites have an API, and we would have to realize one integration per website, which is laborious and eventually boring.

Hopefully, there is a better workaround. Do you remember my previous article about the PDF generation system? Well, we will reuse the main components to create an automated data extraction system.

This is how we will do it: run a Chromium instance on Lambda, visit a list of websites, and retrieve relevant data from them. Then we will process the data and compare it with the expected result.

This article will focus on the Lambda part and the data extraction. If you want to make a great integration with Cognito, API Gateway + Authorizer, and an RDS database, I invite you to read the following article:

Now, we need to find websites containing relevant data for our experiment. I highly recommend the following:

• Also, make sure that the data provided is consistent and well-referenced: we need high-quality data to check the validity of Benford’s law.
• Prefer websites with well-formatted data, if possible contained in HTML tables.

Let’s begin with this great Wikipedia article: List of mountains by elevation.

We can figure out that the relevant data of the article (elevation of each mountain in meters and feet) is located in the second and third columns of several HTML tables with a `wikitable` class:

We know where the data is, we can now extract it thanks to the querySelectorAll javascript method. We use the CSS selectors `.wikitable` and `td` with the `:nth-child` CSS pseudo-class.

Not bad! We retrieved all the elements we need. Now we will extract the value of each element thanks to the Spread syntax, the map method, and the `textContent` property:

Awesome! Our selectors and Javascript expression are ready, let’s go to the server-side.

We will reuse the same Lambda Layer as in my previous article. It contains the headless Chromium, the Pyppeteer library, and other dependencies.

## The config file

In a config file, we set up a list of websites with relevant data and their corresponding selector.