Czechitas Digital Academy Project: Trial Licenses Analysis

10 min readNov 27, 2020

Data presented in this blog post were anonymized. I’ve selected random dates for displayed visuals not to break the non-disclosure agreement I have signed with the company. For future reference I’ve chosen random company name as DataSec to be used within this post.

Introduction

My name is Jitka, I like to look at the things from different perspective and analyze them. I’ve always worked with data during my career. Recently I’ve decided to work on my analytical skills and joint the Czechitas Digital Academy: Data to learn more about data analytics, databases, SQL, Python and visualization tools. This project is a great way to utilize what I’ve learned during 3 months of intensive training.

In the beginning there were many ideas on what I want to cover — and it was quite clear for me that I wanted to cover everything I learned in Digital academy.

My dream project was about something practical and business oriented. I was lucky to get in touch with the company DataSec to give me possibility to work with their data on Trial Licenses.

Trial License is usually offered to customer to try fully working product for limited period of time to decide on purchase.

Objectives of The Project

Original goal of the project was to find out as much as I can about customers who got Trial License and answer the following questions:

What is a real conversion rate — how many of customers purchase?
Is the conversion rate influenced by a number of trial licenses provided to the customer?
What is the time to purchase?
Is there any correlation between number of trial licenses and orders?
Is there any seasonality regarding conversion?
Can number of seats influence conversion?

Data Cleaning

For the start I got several data tables to work with (in CSV format). Only knowledge I got by that time was data modeling and some SQL. I started with what I’ve already learned and put together first data model. Then I started with data import to MS SQL. The changes made in MS SQL were:

Removal of disabled licenses
Removal of data entries with missing values— customer ID, country, name
Removal of entries that were supposed to be numbers but weren’t
Removal of data older than 2016 (agreed with DataSec that those should not be considered)
Creation of new trial licenses table
Set up relationships — set primary and foreign keys and select right data types

Example of simple SQL script used to create the new table:

CREATE TABLE Trial_Licenses_Appended
(
 [id] [bigint] NOT NULL,
 [created] [datetime] NOT NULL,
 [partner] [bigint] NULL,
 [customer] [bigint] NOT NULL,
 [serial] [nvarchar](50) NOT NULL,
 [valid_from] [datetime] NOT NULL,
 [valid_to] [datetime] NOT NULL,
 [count] [bigint] NOT NULL,
 [info] [nvarchar](50) NULL,
);INSERT INTO Trial_Licenses_Appended
 SELECT 
  *
 FROM trial_licenses
 WHERE DATEPART(YEAR,[valid_from]) > 2015;INSERT INTO Trial_Licenses_Appended
 SELECT 
  *
 FROM new_trial_licenses
 WHERE DATEPART(YEAR,[valid_from]) > 2015;

Python Pandas and Jupyter Notebook

Python was my favorite class since the day one. In this project I’ve used that for data merging and to work with the exported data from license server. The same could be done in MS SQL or Power BI, however this was a good practice for me.

Examples of scripts I run in Jupyter Notebook are below.

How to add renewals dimension into orders table:

import pandas as pdorders = pd.read_csv('orders_suborders_items_valid_from_Y2016.csv', encoding='utf-8')rwns = pd.read_csv('renewals.csv', encoding='utf-8')merged = pd.merge(orders, rwns, how='left', left_on='id', right_on='original order', validate='many_to_many')merged['renewal'] = merged['new order'].notna()
merged = merged.rename(columns={'id_x': 'id'})
del = merged.loc[:,'id_y':'original order']merged.drop(columns=['id_y','new order','original order'], inplace=True)merged.to_csv('orders_updated.csv', encoding='utf-8', index=False)

2. Merging two tables where IDs were in upper case and needed them in lower case:

import pandas as pdtelem = pd.read_excel('data.xlsx', sheet_name='Sheet1', header=2, index_col='Telemetry ID')telem.rename(columns={'Customer': 'Customer guid'}, inplace=True)
telem_trial = telem[telem['active_trial'] > 0 ] 
print(type(telem_trial['Customer guid']))  # <class 'pandas.core.series.Series'>telem_trial['Customer guid'] = telem_trial['Customer guid'].str.lower()
telem_trial.to_csv('telem_trial.csv')customers = pd.read_csv('customers.csv', encoding='utf-8', index_col='id')customer_trial = trial.merge(customers, left_on='customer', right_on='id')final = customer_trial.merge(telem_trial, how='left', left_on='guid', right_on='Customer guid')
final.to_csv('final.csv')

Data Analyses — Power BI

Most of analyses in this project are done in Power BI . I did some data transformation and more data cleaning there such as:

Changed the dates to be able to work with calendar (set up unified format where needed)
Created a calendar that allows me to work with a date dimension
Added a new dimension — table with regions
Elimination of inactive customers

Looking for the answers

What is a real conversion rate — how many customers purchase?

There is not a direct relationship between purchase record and trial licenses — this has to be done through Customer ID. I’ve decided to keep all records regarding trial licenses under new columns in Customers table.

There are also customers that purchase product first and have trial license after that — further analysis showed that this is around 1,8% from all orders for available sample. Those were filtered out from the report.

Customers who renewed should not be taken into consideration even though they had trial license prior renewal. Those were filtered out from the report.

Final custom measure used for calculation of conversion rate is below.

m_KPI_TrialConversion =
 CALCULATE 
 (
    COUNT(customers[id]),
     FILTER(trial_licenses_appended, 
      COUNT(trial_licenses_appended[id] ) > 0 ),
     FILTER (orders_suborders_items_valid_from_Y2016,
      orders_suborders_items_valid_from_Y2016[standard_order] = 1),
     FILTER (customers, customers[OrderBeforeTrial] <> 1 )
 )
    / 
 CALCULATE 
 (
   COUNT(customers[id]),
    FILTER(trial_licenses_appended,
     COUNT(trial_licenses_appended[customer_id]) > 0)
 )

2. Is the conversion rate influenced by the number of trial licenses provided to customer?

One customer can have more trial licenses during the time — higher conversion achieved for those with 2–5 licenses. This is quite important for DataSec for further decision about possible limit for number of trial licenses provided to customer.

Overview for selected random period of time.

3. What is the time to purchase?

This was calculated as a new calculated column in Customers table.

#Calculated Column
TimeToPurchase = 
   IF ( customers[order_count] > 0 && 
        customers[trial_count_per_customer] > 0, 
        [m_TimeToPurchaseFromTrial], BLANK())#Custom Measure  
m_TimeToPurchaseFromTrial = 
   MIN(orders_suborders_items_valid_from_Y2016[valid_from]) 
   - 
   MAX(trial_licenses_appended[valid_from])

Based on the visible trend I’ve split customers into 6 categories to display data. Around 60% of customers purchase within 90 days from trial installation from available data set. There is a gap for 91–120 days and small peak for those to purchase within 121 to 365 days. The reason is that we compare 30 days period to 6 months period. Specific deals may take longer time.

Split into 6 categories on Time to Purchase for selected random period

4. Is there any correlation between number of trial licenses and orders?

Trial conversions are only around 20–25% from all purchases — this can be for two reasons:

Trial licenses are not that much used to help with conversion
Problem with the data tracking

When looking further into the data there are big differences for different countries and the way how DataSec business partners work with customers.

Order Ratio For Random Period (per country on y axes)

5. Is there any seasonality regarding conversion?

From available data there is visible minimal seasonality for DataSec Trial customers.

6. Can number of seats influence conversion?

From available data there is no visible connection between number of seats and conversion.

All questions above helped me to think about what should be included in dashboards to be provided to DataSec management.

Final set of dashboards contains:

Main metrics: trial licenses, orders, converted licenses, conversion rate, number of active trial customers , ratio of trial orders from all orders
Yearly, monthly, quarterly (slicers available to filter data), overviews over countries, business partners, regions — example below shows performance per country and per business partner (data are anonymized and only random period was used)

Example Dashboard that was created for DataSec (Active Trials are blank — filter for past dates)

What was originally not in plan but became a part of the project

1. Forecast on orders

I’ve split trial customers into 4 categories based on time to purchase (0–30 days, 30–60 days, 60–90 days, > 90), calculated average conversion rate and used that to calculate expected number of orders (in calculated table).

// Custom Measure for trial conversion rate m_TimeSlotAPct = 
 CALCULATE (
  COUNT(customers[id]),
  FILTER(customers,customers[TimeSlots] = "A 0-30 days"),
  FILTER(orders_suborders_items_valid_from_Y2016, orders_suborders_items_valid_from_Y2016[standard_order] = 1)
)
/
 CALCULATE (
  COUNT(customers[id]),
  FILTER(trial_licenses_appended, count(trial_licenses_appended[customer_id]) > 0)
)

Below part of DAX code used to calculate table with forecasted numbers.

m_BIForecastDetailed =// Calculation of trial licenses for forecast calculation
VAR ActiveTrials = [m_ActiveTrials]VAR ActiveTrialsHistSLotA = 
 CALCULATE (
  COUNT(trial_licenses_appended[id]),
  FILTER(trial_licenses_appended, 
   (trial_licenses_appended[valid_to] <= DATE(2020,9,15)) && 
   (trial_licenses_appended[valid_to] > DATE(2020,8,15)))
 )VAR ActiveTrialsHistSLotB = 
 CALCULATE (
  COUNT(trial_licenses_appended[id]),
  FILTER(trial_licenses_appended, 
   (trial_licenses_appended[valid_to] <= DATE(2020,8,15)) && 
   (trial_licenses_appended[valid_to] > DATE(2020,7,15)))
)VAR ActiveTrialsHistSLotC = 
 CALCULATE (
  COUNT(trial_licenses_appended[id]),
  FILTER(trial_licenses_appended, 
   (trial_licenses_appended[valid_to] <= DATE(2020,7,15)) && 
   (trial_licenses_appended[valid_to] > DATE(2020,6,15)))
)RETURN 
   SWITCH( SELECTEDVALUE(BI_ForecastDetailed[DaysToPurchase]),
 30  , ([m_TimeSlotAPct] * ActiveTrials + 
     [m_TimeSlotBPct] * ActiveTrialsHistSLotA + 
     [m_TimeSlotCPct] * ActiveTrialsHistSLotB + 
     [m_TimeSlotDPct] * ActiveTrialsHistSLotC),
 60  , ([m_TimeSlotBPct]  * ActiveTrials + 
        [m_TimeSlotCPct] * ActiveTrialsHistSLotA + 
     [m_TimeSlotDPct] * ActiveTrialsHistSLotB),
 90 , ([m_TimeSlotCPct] * ActiveTrials + 
        [m_TimeSlotDPct] * ActiveTrialsHistSLotA),
 120 , [m_TimeSlotDPct] * ActiveTrials,
 275 , [m_TimeSlotEPct] * ActiveTrialsHistSLotC,
 305 , [m_TimeSlotEPct]  * ActiveTrialsHistSLotB,
 310 , [m_TimeSlotFPct]  * ActiveTrialsHistSLotC,
 335 , [m_TimeSlotEPct]  * ActiveTrialsHistSLotA,
 340 , [m_TimeSlotFPct] * ActiveTrialsHistSLotB,
 365 , [m_TimeSlotEPct]  * ActiveTrials,
 370 , [m_TimeSlotFPct] * ActiveTrialsHistSLotA,
 400 , [m_TimeSlotFPct] * ActiveTrials 
)

Picture below shows example of forecast in case we’re looking at data in September (total numbers removed). Having also knowledge of average number of seats helped me to run forecast on number of seats.

Example dashboard for calculated forecast

2. Data from License Server about product usage

I was lucky to get additional data for the project from license server on how the product is used. Those needed to be cleaned to show only data regarding trial licenses. I’ve used Python for that (as showed above).

For final representation of results I’ve chosen funnel report that nicely shows positive impact of usage of the product on conversion.

Left: Usage of product for Trial Licenses, Right: Usage of product for those who purchased *(random period selected for this visual)*

3. Rolling Averages

I was looking for the way to provide DataSec with a view that allows them to monitor trends and allows to raise a flag for trial licenses. Rolling Average is a nice way how to show the trend and see a possible drop in licenses (that can have by the end of the day impact on the business).

I’ve created two views on trial licenses:

90 days rolling average to check the trend for shorter period
6 months rolling average to monitor longer trend

Rolling Averages for random period of time.

For calculation of rolling averages I’ve created custom measures working with calendar and basic calculation as you can see below.

m_rolling_avg_trials_count_daily =m_rolling_avg_trials_count_daily = 
    VAR SumPeriod = 
        CALCULATE([m_trial_count], 
            DATESINPERIOD(Kalendar[Date], 
    LASTDATE(Kalendar[Date]), -90, DAY))RETURN SumPeriod / 90
 
m_rolling_avg_trials_count_monthly = 
    VAR SumPeriod = 
        CALCULATE([m_trial_count], 
            DATESINPERIOD(Kalendar[Date],
    LASTDATE(Kalendar[Date]),-6, MONTH))RETURN SumPeriod / 6

Recommendations for DataSec

There is my list of recommendations for DataSec:

Around 1.8% of customers purchase and then install trial licenses. It would be great to look at this further to see possible process gaps or think about some training towards business partners.
There are visible differences between approach of different business partners towards customers — some of them push more direct sales (without trial license) and some of them utilize more trial licenses. I’m not sure what is the better approach but it will be good thing to look at.
Review the way the orders are tracked — the ratio of trial orders from all orders seems low (20–25%) and also the differences for various business partners are quite big.

Conclusion & Benefits

The goal of the project was to get as much as possible from data regarding trial licenses and help to drive DataSec business.

Having this report, DataSec can easily see conversions and monitor different parts of the business (per country, business partner) and see the gaps and differences through the different sales channels.

In addition to that, they can utilize forecast and work with data gathered about the product.

I’ve presented my results to the company and below I’ve included feedback from DataSec manager who will be working with the data:

“I was very pleased with the results, that Jitka prepared. She hasn’t just met the set expectation, but looked at the data from different perspective and showed us how we can better forecast the coming business. The Trial Dashboard with main overview showed us several areas, where we can better work with our partners and train them on better approach to Trial and potential customers. Now we have a dashboard, where we can easily see the current trend and forecast and actually start to use KPIs to measure the improvements that we are planning to put in place. Jitka also highlighted few areas where the data looked strange and helped us to improve internal processes”

Lessons learned

I’m very excided about how much I’ve learned with this Project. This is my top 3 for my next data analysis project:

Setup a clear methodology on how to name measures, columns and calculated columns from the beginning — this will make it easier to maintain for me and whoever will need to work with dashboards after that.
Spent more time on data cleaning and transformation — make it as clear as possible in MS SQL already — use views and select only columns that will be really used (including smarter names of columns)
It’s better to start with small and extend — this approach helped me a lot in time when my team-mate decided not to continue with Digital Academy and I had to work on the project by myself. Do not insist on using everything (e.g. Python was good to play with but not needed in this part of the project however can be good in case I need to get data from different source — e.g. CRM, telephony and other data sources).

What next?

When having direct access to DataSec database I can set up online dashboards to show real-time data. It shouldn’t be that complicated, logic works already, however I would need to find the way to clean up new data.

Model can be extended by Revenue dimension. This can really help to see detailed impact on business and overall sales results.

If we have it linked with data from CRM we can also see impact on Customer Support department and see how much time they spent with trial license customers compared to paid customers. The cost dimension can be added to see the profit.