Beyond the Download Button: Decoding Google Play's Hidden Trends using SQL

Hi, I'm Navneet. Welcome to my first blog!

As a data enthusiast, I thrive on extracting golden insights from messy data. We data analysts wrestle with chaotic information and uncover meaningful insights. From SQL queries to advanced analytics, from visualization to machine learning, we'll explore it all. Fascinated by transforming raw data into valuable knowledge? You're in the right place. Subscribe to my blog for amazing content that will teach you something new with each post.

In this blog, we've delved into Google Play Store data to uncover remarkable insights. From understanding intermediate to advanced SQL concepts to building logical queries, each step has been meticulously explained. I would also like to extend my gratitude toCampusXfor their invaluable SQL case study lectures, which have greatly enriched this exploration.

💡

Dataset : https://drive.google.com/file/d/1FtzHIfYe6B4vEwXNet-4E4M1iWFA7sgy/view?usp=sharing

We'll use this cleaned dataset for our SQL queries. For those interested in the data cleaning process, you can find the original impure dataset and the Python cleaning notebook at the links below. We'll focus on SQL analysis in this blog to keep it concise.

Impure Data With Notebook

Let's get Started !!

About Data :

Our dataset comprises 9,360 entries, each representing a unique app from the Google Play Store. It contains 13 columns describing various attributes of these app.

App: Name of the application
Category: App's classification in Play Store
Rating: Average user rating (0-5 scale)
Reviews: Number of user reviews
Size: Storage space required by the app
Installs: Estimated number of app installations
Type: Free or Paid app
Price: Cost of the app (if paid)
Content Rating: Age appropriateness (e.g., Everyone, Teen)
Genres: Specific categories or themes of the app
Last Updated: Most recent update date
Current Ver: Current version number of the app
Android Ver: Minimum compatible Android version

This dataset allows for analysis of app characteristics, popularity trends, and market dynamics in the Google Play Store ecosystem

Let's Check the Data : -

Select * from playstore; -- Name of the table

Now that we're familiar with our dataset's structure, let's dive into the query section. We'll explore a series of scenario-based questions, each designed to extract meaningful insights from our Google Play Store data

You're working as a market analyst for a mobile app development company. Your task is to identify the most promising categories(TOP 5) for launching new free apps based on their average ratings .

INTUTION

To find the top 5 promising categories for launching new free apps, we'll focus on average ratings. First, we confirm that our dataset includes both free and paid apps. Then, we'll calculate the average rating for each category, but only considering free apps. By sorting these averages from highest to lowest and selecting the top 5, we can identify which categories have the best-performing free apps. This approach gives us insight into where new free apps might find the most success, based on how users are rating existing apps in those categories. Our query essentially distills the data to show us the most highly-rated free app categories, providing a clear direction for potential new app development.

APPROACH

The first query retrieves all the unique values from the "type" column in the "playstore" table. This helps identify the different types of apps available in the dataset, such as Free and Paid. The second query focuses on free apps by filtering the data where the type is 'Free'. It calculates the average rating for these free apps within each category and rounds the average rating to two decimal places. The results are then grouped by category, sorted in descending order based on their average ratings, and finally, only the top five categories with the highest average ratings are displayed

select distinct(type) from playstore;
select category , round(avg(Rating),2) as average_rating
from playstore
where type = 'Free'
group by category
order by average_rating desc
limit 5;

Output:

EVENTS	4.44
EDUCATION	4.38
ART_AND_DESIGN	4.36
BOOKS_AND_REFERENCE	4.35
PARENTING	4.34

As a business strategist for a mobile app company, your objective is to pinpoint the three categories that generate the most revenue from paid apps. This calculation is based on the product of the app price and its number of installations.

INTUTION

As a business strategist, you want to understand which categories of paid apps generate the most revenue to maximize profitability and inform future investment decisions. By analyzing the revenue generated from the combination of app price and the number of installations, you can pinpoint which categories are the most financially successful and prioritize them for further development or marketing efforts.

APPROACH

To achieve this, the query calculates the revenue for each paid app by multiplying its price by the number of times it has been installed. It then averages these revenue figures for each category of paid apps. The query groups the apps by their category, sorts the categories in order of the highest average revenue, and finally selects the top three categories. This way, you can easily see which app categories are the most profitable and prioritize them in your business strategy.

SELECT category,
       ROUND(AVG(price * installs), 2) AS revenue
FROM playstore
WHERE type = 'Paid'
GROUP BY category
ORDER BY revenue desc
limit 3;

Output:

Category	revenue
LIFESTYLE	3199340.56
FINANCE	1979115.38
PHOTOGRAPHY	1162143.33

As a data analyst for a gaming company, you're tasked with calculating the percentage of games within each category. This information will help the company understand the distribution of gaming apps across different categories.

INTUTION

As a data analyst, your goal is to determine whether your company should develop paid or free apps in each category based on user ratings. By comparing the average ratings of free and paid apps within each category, you can make an informed recommendation on which type of app (free or paid) tends to be better received by users. Higher ratings indicate better user satisfaction, which is crucial for app success and retention

APPROACH

To accomplish this, you'll first need to calculate the average ratings for free and paid apps in each category. This involves creating two separate datasets: one for free apps and one for paid apps. Then, you'll combine these datasets to compare the average ratings. Based on this comparison, you'll determine which type of app (free or paid) has a higher average rating in each category and recommend developing that type.

Create a temporary table for free apps:

with freeapp as
(
    select category, round(avg(rating),2) as 'avg_rating_free' 
    from playstore 
    where type ='Free'
    group by category
),

Create a temporary table for paid apps:

paidapp as
( 
    select category, round(avg(rating),2) as 'avg_rating_paid' 
    from playstore 
    where type ='Paid'
    group by category
)

Join the temporary tables and compare ratings:

select *, 
       if(avg_rating_free > avg_rating_paid, 'Develop Free app', 'Develop Paid app') as 'Development' 
from
(
    select f.category, f.avg_rating_free, p.avg_rating_paid  
    from freeapp as f 
    inner join paidapp as p 
    on f.category = p.category
) k

Category	Free_avg_rating	Paid_avg_rating	Decision
BUSINESS	4.12	4.2	paid app
COMMUNICATION	4.17	4.06	free app
DATING	3.98	3.62	free app
EDUCATION	4.38	4.75	paid app
ENTERTAINMENT	4.12	4.6	paid app
FOOD_AND_DRINK	4.16	4.35	paid app
HEALTH_AND_FITNESS	4.27	4.39	paid app
GAME	4.28	4.37	paid app
FAMILY	4.18	4.3	paid app
MEDICAL	4.17	4.26	paid app
PHOTOGRAPHY	4.2	4.04	free app
SPORTS	4.22	4.25	paid app
PERSONALIZATION	4.31	4.44	paid app
PRODUCTIVITY	4.21	4.21	paid app
WEATHER	4.23	4.37	paid app
TOOLS	4.04	4.17	paid app
TRAVEL_AND_LOCAL	4.11	4.1	free app
LIFESTYLE	4.09	4.25	paid app
AUTO_AND_VEHICLES	4.18	4.6	paid app
NEWS_AND_MAGAZINES	4.13	4.8	paid app
SHOPPING	4.26	4.5	paid app
BOOKS_AND_REFERENCE	4.35	4.28	free app
SOCIAL	4.26	3.7	free app
ART_AND_DESIGN	4.36	4.73	paid app
VIDEO_PLAYERS	4.06	4.1	paid app
FINANCE	4.14	3.83	free app
MAPS_AND_NAVIGATION	4.06	3.86	free app
PARENTING	4.34	3.35	free app

Suppose you're a database administrator, your databases have been hacked and hackers are changing price of certain apps on the database , its taking long for IT team to neutralize the hack , however you as a responsible manager dont want your data to be changed , do some measure where the changes in price can be recorded as you cant stop hackers from making changes.

Now this is an interesting problem and also an industry-relevant question. You have to create a trigger**.**

Trigger Definition: A trigger is a database object that automatically executes a specified action in response to certain events on a particular table or view. Triggers are useful for maintaining data integrity, enforcing business rules, and recording changes to data for auditing purposes.

To learn More about triggers

INTUTION

As a responsible manager, you need a way to track changes to app prices in the database, especially since the system is currently compromised by hackers altering these prices. By recording every price change, you can maintain a log of what the prices were before and after each change, which is critical for data recovery and analysis once the breach is resolved.

APPROACH

Create a Logging Table: First, create a table to log the price changes, capturing the app name, old price, new price, type of operation, and the timestamp of the operation. Create a Trigger for Updates: Then, create a trigger that activates after any update to the price in the play table. This trigger will record the relevant details into the logging table.

Here is the approach with code snippets:

Create the logging table

-- This table will store the details of each price change.
CREATE TABLE PriceChangeLog ( 
    App VARCHAR(255),
    Old_Price DECIMAL(10, 2),
    New_Price DECIMAL(10, 2),
    Operation_Type VARCHAR(10),
    Operation_Date TIMESTAMP
);

Create a copy of the playstore table:

-- This step creates a working table (play) from the existing 
-- playstore table.
CREATE TABLE play AS
SELECT * FROM playstore;

-- Create the trigger:
DELIMITER //
CREATE TRIGGER price_change_update
AFTER UPDATE ON play
FOR EACH ROW
BEGIN
    INSERT INTO PriceChangeLog (App, Old_Price, New_Price, Operation_Type, Operation_Date)
    VALUES (NEW.App, OLD.Price, NEW.Price, 'update', CURRENT_TIMESTAMP);
END;
//
DELIMITER ;

This trigger is set to activate after any update operation on the play table. It logs the app name, old price, new price, operation type (update), and the current timestamp into the PriceChangeLog table.

Use this snippet to check the effect of trigger :-

SET SQL_SAFE_UPDATES = 0; -- this allow us to update
UPDATE play
SET price = 4
WHERE app = 'Infinite Painter';

UPDATE play
SET price = 5
WHERE app = 'Sketch - Draw & Paint';

select * from play where app='Sketch - Draw & Paint'

your IT team have neutralize the threat, however hacker have made some changes in the prices, but becasue of your measure you have noted the changes , now you want correct data to be inserted into the database.

INTUTION

After neutralizing the threat, you need to restore the original prices of the apps that were altered by hackers. Since you have a record of the old prices in the PriceChangeLog table, you can use this information to update the play table and revert the prices back to their correct values. This ensures data integrity and accuracy in your app pricing.

APPROACH

After dropping the trigger to stop further logging, we update the play table by joining it with PriceChangeLog based on the app names. This update reverts the prices back to their original values recorded before the hacking incident. Finally, we verify the correct restoration of prices by checking a specific app's data in the play table. This ensures the database reflects accurate app prices following the security breach.

Drop the trigger and Update the prices in theplaytable:

DROP TRIGGER price_change_update;

UPDATE play AS p1
INNER JOIN pricechangelog AS p2 ON p1.app = p2.app
SET p1.price = p2.old_price;

SELECT * FROM play WHERE app='Sketch - Draw & Paint'; -- To verify

As a data person you are assigned the task to investigate the correlation between two numeric factors: app ratings and the quantity of reviews.

INTUTION

The correlation coefficient calculation aims to quantify the relationship between app ratings and the quantity of reviews in the playstore dataset. A positive coefficient suggests that higher ratings typically correlate with more reviews, indicating strong user engagement. Conversely, a negative coefficient would imply the opposite relationship. This analysis provides valuable insights into how user perception (ratings) aligns with user activity (reviews), essential for strategic decisions in app development and marketing.

APPROACH

To determine the correlation between app ratings and the quantity of reviews in the playstore dataset, we begin by calculating the average rating (@x) and average number of reviews (@y). Using these averages, we compute deviations from the mean for both ratings and reviews, along with their squared values within a temporary table (t). This prepares the necessary components for calculating the correlation coefficient: the sum of products of these deviations (@numerator), and the square roots of the sums of their squares (@deno_1 and @deno_2). Finally, we compute the correlation coefficient itself by dividing @numerator by the square root of the product of @deno_1 and @deno_2, providing a quantitative measure of the relationship between app ratings and reviews.

![](cdn.hashnode.com/res/hashnode/image/upload/.. align="center")

-- Calculate average rating and average reviews
SET @x = (SELECT ROUND(AVG(rating), 2) FROM playstore);
SET @y = (SELECT ROUND(AVG(reviews), 2) FROM playstore);

-- Create a temporary table to compute deviations and their squares
with t as 
(
    select  *, 
            round((rating - @x), 2) as 'rat', 
            round((reviews - @y), 2) as 'rev',
            round((rating - @x) * (rating - @x), 2) as 'sqr_x',
            round((reviews - @y) * (reviews - @y), 2) as 'sqr_y'
    from playstore
)

Calculate numerator and denominators for correlation coefficient

select  
    @numerator := round(sum(rat * rev), 2), 
    @deno_1 := round(sum(sqr_x), 2), 
    @deno_2 := round(sum(sqr_y), 2)
from t;

-- Calculate correlation coefficient
select round((@numerator) / (sqrt(@deno_1 * @deno_2)), 2) as corr_coeff;

Your boss noticed that some rows in genres columns have multiple generes in them, which was creating issue when developing the recommendor system from the data he/she asssigned you the task to clean the genres column and make two genres out of it, rows that have only one genre will have other column as blank.

INTUTION

The task involves cleaning up the genres column in the dataset to facilitate the development of a recommender system. Many rows contain multiple genres separated by semicolons, which complicates the analysis. To address this, we need to split the genres column into two separate columns: one for the primary genre and another for the secondary genre. Rows that originally had only one genre will have the secondary genre column as blank. This cleanup ensures that each app's genre information is structured consistently, which is crucial for accurate recommendations in the system.

APPROACH

To achieve this, we'll use two custom SQL functions: f_name and l_name. The f_name function extracts the first genre from the genres column, handling cases where multiple genres are separated by semicolons. It identifies the position of the semicolon and retrieves the substring before it. The l_name function extracts the second genre, returning an empty string if there's only one genre present. By applying these functions in a query, we can transform the genres column into two separate columns (gene 1 and gene 2), ensuring each row is structured correctly for the recommender system's needs.

Function to extract the first genre by extracting left of ';'

DELIMITER //
CREATE FUNCTION f_name(a VARCHAR(100))
RETURNS VARCHAR(100)
DETERMINISTIC
BEGIN
    DECLARE l INT;
    DECLARE s VARCHAR(100);

    SET l = LOCATE(';', a);
    SET s = IF(l > 0, LEFT(a, l - 1), a);

    RETURN s;
END//
DELIMITER ;

Function to get right of ';'

-- Function to extract the second genre
DELIMITER //
CREATE FUNCTION l_name(a VARCHAR(100))
RETURNS VARCHAR(100)
DETERMINISTIC
BEGIN
    DECLARE l INT;
    DECLARE s VARCHAR(100);

    SET l = LOCATE(';', a);
    SET s = IF(l = 0, '', SUBSTRING(a, l + 1, LENGTH(a)));

    RETURN s;
END//
DELIMITER ;

-- Query to transform genres column
SELECT app, genres, f_name(genres) AS 'gene 1', l_name(genres) AS 'gene 2'
FROM playstore;

Finally calling 'f_name' and 'l_name' to get the cleaned genres out of them

Your senior manager wants to know which apps are not performing as par in their particular category, however he is not interested in handling too many files or list for every category and he/she assigned you with a task of creating a dynamic tool where he/she can input a category of apps he/she interested in and your tool then provides real-time feedback bydisplaying apps within that category that have ratings lower than the average rating for that specific category.

INTUTION

The task is to create a dynamic tool that allows a senior manager to input a category of apps and receive real-time feedback on apps within that category that are performing below average. This tool aims to streamline decision-making by highlighting underperforming apps within specific categories. By focusing on categories of interest and comparing app ratings against their respective category averages, the manager can quickly identify areas needing attention or improvement.

APPROACH

To achieve this, a stored procedure named checking is created. The procedure takes a category (cate) as input and calculates the average rating (@c) for that category from the playstore dataset. It uses a subquery to compute the average ratings grouped by category. Once the average rating for the specified category is determined, the procedure then selects and displays all apps within that category (cate) where the rating is lower than the computed average (@c). This approach provides a direct, real-time feedback mechanism for evaluating app performance relative to category norms, supporting informed managerial decisions.

To Learn More About SQL Procedure

Stored Procedure to check underperforming apps in a specific category

DELIMITER //
CREATE PROCEDURE checking(IN cate VARCHAR(30))
BEGIN
    -- Calculate the average rating for the specified category
    SET @c = (
        SELECT average
        FROM (
            SELECT category, ROUND(AVG(rating), 2) AS average
            FROM playstore
            GROUP BY category
        ) m
        WHERE category = cate
    );
  -- Select apps within the specified category that have ratings lower than the category average
    SELECT *
    FROM playstore
    WHERE category = cate
    AND rating < @c;
END//
DELIMITER ;

Call the stored procedure with a specific category ('business')

CALL checking('business');

This query will give underperforming apps in business category

Bonus Theory Question

what is duration time and fetch time

Duration Time: Duration time is how long it takes the system to completely understand the instructions given, from start to end, in the proper order and manner.

Fetch Time: Once the instructions are completed, fetch time is the time it takes for the system to retrieve and hand back the results. This duration depends on how quickly the system can find and bring back what you asked for.

For example, if a query is simple but needs to display a large volume of data, the fetch time will likely be longer as the system processes and retrieves extensive records. Conversely, if the query is complex with multiple criteria or parameters, the duration time may be extended as the system comprehensively processes the intricacies of the request before initiating the fetch process.

Wrapping Up: Exploring SQL Concepts and Looking Ahead!

That concludes this blog post on these concepts, which can be quite intricate. If you have any suggestions or questions, feel free to let me know at . Don't forget to subscribe for updates on the next blog, where we'll dive into data cleaning using SQL. Goodbye!