Practice Python analysis with a #MakeOverMonday dataset

#MakeOverMonday Week 33 2019; A bird’s-eye view of clinical trials

In this blog I have explored and visualized a dataset with Python.

Exploring the data

The dataset A bird’s-eye view of clinical trials is provided as an excel file. It is bigger than other weeks, this file has 13,748 rows and 11 columns.

First I need to know what is in the dataset and I use the .head() comment.

Then I want to know the dtypes.
There are 3 columns with numbers and 7 with text (categorical variables).

For the 3 columns with numbers I can use the function .describe(), for 2 out of 3 that will give a funny result, because it is no use to describe the ‘Start_Year’ and ‘Start_Month. The ‘Enrollment’ is the only one that can give useful answers to this request.
Now I need to get some domain knowledge, what does enrollment means? Ah found it in the text of the original visualization: Enrollment are the numbers of patients enrolled.
Result: The mean is 441 patients with a max of 84496 and a min of 0. The value 365 in the 75% quartile tells me more trails or done with less than the mean

Next up is counting the unique categories of the categorical variable’s.
For this I need to write a definition:

There are 10 different sponsors, 7 Phases, 9 different States. For in total 867 conditions/diseases.
There are no double NCT’numbers so every row is unique. There are 13.434 different titles and 13564 summaries.

Visualizing the data

Bar Charts

First, I will examine the frequency distributions of the categorical variables, ‘’Sponsor’, ‘Phase’ and ‘Status’ with bar charts. Visualizing the others with bar charts is crazy, because of the number of bars you will get.

The ‘Sponsor’ GSK has with almost 2.500 trails the highest amount of trails

Looking at the Phases, most trails have the label Phase 3

And logically the most trails have the ‘Status’ Completed.


I made histograms of the numeric variables ‘Start_Year’, ‘Start_Month’ and ‘Enrollment’.

The histogram of ‘Start_Year’ shows that in the beginning there where not much trails and that from 2000 the amount of trails went up with a peak in 2005, 2006 and 2007.

The histogram of ‘Start_Month’ shows a more or less even distribution.

The histogram of the ‘Enrollment’ shows what was already written in the describe function of the column. Most trails have a low amount of participants. So, it is right skewed.

Kernel density estimation

For ‘Start_Year’ and ‘Enrollment’ I would like to know the kde (Kernel density estimation) to find out what the density of occurrence of the trails is.
For ‘Start_Year’ the graph does not give extra insight, for ‘Enrollment’ it does.

This kde shows that there are also some large trails with a bigger group of participants.

Scatter Plot

I have made a Scatter Plot. A Scatter Plot is used to find out if there is a relationship between two variables. I wanted to see what the relationship was between ‘Enrollment’ and ‘Start_Year’. In this Scatter Plot it is visible when the bigger trails have started.

Box Plot

With a Box plot I was able to make visible when what sponsor was doing its trails.

Now I have explored the data and I can start the next step. Telling a story with visualizations in power BI.

Plaats een reactie