Beyond the Download Button: Decoding Google Play's Hidden Trends using SQL
Hi, I'm Navneet. Welcome to my first blog!
As a data enthusiast, I thrive on extracting golden insights from messy data. We data analysts wrestle with chaotic information and uncover meaningful insights. From SQL queries to advanced analytics, from visualization to machine learning, we'll explore it all. Fascinated by transforming raw data into valuable knowledge? You're in the right place. Subscribe to my blog for amazing content that will teach you something new with each post.
In this blog, we've delved into Google Play Store data to uncover remarkable insights. From understanding intermediate to advanced SQL concepts to building logical queries, each step has been meticulously explained. I would also like to extend my gratitude toCampusXfor their invaluable SQL case study lectures, which have greatly enriched this exploration.
We'll use this cleaned dataset for our SQL queries. For those interested in the data cleaning process, you can find the original impure dataset and the Python cleaning notebook at the links below. We'll focus on SQL analysis in this blog to keep it concise.
Let's get Started !!
About Data :
Our dataset comprises 9,360 entries, each representing a unique app from the Google Play Store. It contains 13 columns describing various attributes of these app.
App: Name of the application
Category: App's classification in Play Store
Rating: Average user rating (0-5 scale)
Reviews: Number of user reviews
Size: Storage space required by the app
Installs: Estimated number of app installations
Type: Free or Paid app
Price: Cost of the app (if paid)
Content Rating: Age appropriateness (e.g., Everyone, Teen)
Genres: Specific categories or themes of the app
Last Updated: Most recent update date
Current Ver: Current version number of the app
Android Ver: Minimum compatible Android version
This dataset allows for analysis of app characteristics, popularity trends, and market dynamics in the Google Play Store ecosystem
Let's Check the Data : -
Select * from playstore; -- Name of the table
Now that we're familiar with our dataset's structure, let's dive into the query section. We'll explore a series of scenario-based questions, each designed to extract meaningful insights from our Google Play Store data
You're working as a market analyst for a mobile app development company. Your task is to identify the most promising categories(TOP 5) for launching new free apps based on their average ratings .
INTUTION
APPROACH
select distinct(type) from playstore;
select category , round(avg(Rating),2) as average_rating
from playstore
where type = 'Free'
group by category
order by average_rating desc
limit 5;
Output:
EVENTS | 4.44 |
EDUCATION | 4.38 |
ART_AND_DESIGN | 4.36 |
BOOKS_AND_REFERENCE | 4.35 |
PARENTING | 4.34 |
As a business strategist for a mobile app company, your objective is to pinpoint the three categories that generate the most revenue from paid apps. This calculation is based on the product of the app price and its number of installations.
INTUTION
APPROACH
SELECT category,
ROUND(AVG(price * installs), 2) AS revenue
FROM playstore
WHERE type = 'Paid'
GROUP BY category
ORDER BY revenue desc
limit 3;
Output:
Category | revenue |
LIFESTYLE | 3199340.56 |
FINANCE | 1979115.38 |
PHOTOGRAPHY | 1162143.33 |
As a data analyst for a gaming company, you're tasked with calculating the percentage of games within each category. This information will help the company understand the distribution of gaming apps across different categories.
INTUTION
APPROACH
Create a temporary table for free apps:
with freeapp as
(
select category, round(avg(rating),2) as 'avg_rating_free'
from playstore
where type ='Free'
group by category
),
Create a temporary table for paid apps:
paidapp as
(
select category, round(avg(rating),2) as 'avg_rating_paid'
from playstore
where type ='Paid'
group by category
)
Join the temporary tables and compare ratings:
select *,
if(avg_rating_free > avg_rating_paid, 'Develop Free app', 'Develop Paid app') as 'Development'
from
(
select f.category, f.avg_rating_free, p.avg_rating_paid
from freeapp as f
inner join paidapp as p
on f.category = p.category
) k
Category | Free_avg_rating | Paid_avg_rating | Decision |
BUSINESS | 4.12 | 4.2 | paid app |
COMMUNICATION | 4.17 | 4.06 | free app |
DATING | 3.98 | 3.62 | free app |
EDUCATION | 4.38 | 4.75 | paid app |
ENTERTAINMENT | 4.12 | 4.6 | paid app |
FOOD_AND_DRINK | 4.16 | 4.35 | paid app |
HEALTH_AND_FITNESS | 4.27 | 4.39 | paid app |
GAME | 4.28 | 4.37 | paid app |
FAMILY | 4.18 | 4.3 | paid app |
MEDICAL | 4.17 | 4.26 | paid app |
PHOTOGRAPHY | 4.2 | 4.04 | free app |
SPORTS | 4.22 | 4.25 | paid app |
PERSONALIZATION | 4.31 | 4.44 | paid app |
PRODUCTIVITY | 4.21 | 4.21 | paid app |
WEATHER | 4.23 | 4.37 | paid app |
TOOLS | 4.04 | 4.17 | paid app |
TRAVEL_AND_LOCAL | 4.11 | 4.1 | free app |
LIFESTYLE | 4.09 | 4.25 | paid app |
AUTO_AND_VEHICLES | 4.18 | 4.6 | paid app |
NEWS_AND_MAGAZINES | 4.13 | 4.8 | paid app |
SHOPPING | 4.26 | 4.5 | paid app |
BOOKS_AND_REFERENCE | 4.35 | 4.28 | free app |
SOCIAL | 4.26 | 3.7 | free app |
ART_AND_DESIGN | 4.36 | 4.73 | paid app |
VIDEO_PLAYERS | 4.06 | 4.1 | paid app |
FINANCE | 4.14 | 3.83 | free app |
MAPS_AND_NAVIGATION | 4.06 | 3.86 | free app |
PARENTING | 4.34 | 3.35 | free app |
Suppose you're a database administrator, your databases have been hacked and hackers are changing price of certain apps on the database , its taking long for IT team to neutralize the hack , however you as a responsible manager dont want your data to be changed , do some measure where the changes in price can be recorded as you cant stop hackers from making changes.
Now this is an interesting problem and also an industry-relevant question. You have to create a trigger**.**
Trigger Definition: A trigger is a database object that automatically executes a specified action in response to certain events on a particular table or view. Triggers are useful for maintaining data integrity, enforcing business rules, and recording changes to data for auditing purposes.
INTUTION
APPROACH
play
table. This trigger will record the relevant details into the logging table.Here is the approach with code snippets:
Create the logging table
-- This table will store the details of each price change.
CREATE TABLE PriceChangeLog (
App VARCHAR(255),
Old_Price DECIMAL(10, 2),
New_Price DECIMAL(10, 2),
Operation_Type VARCHAR(10),
Operation_Date TIMESTAMP
);
Create a copy of the playstore table:
-- This step creates a working table (play) from the existing
-- playstore table.
CREATE TABLE play AS
SELECT * FROM playstore;
-- Create the trigger:
DELIMITER //
CREATE TRIGGER price_change_update
AFTER UPDATE ON play
FOR EACH ROW
BEGIN
INSERT INTO PriceChangeLog (App, Old_Price, New_Price, Operation_Type, Operation_Date)
VALUES (NEW.App, OLD.Price, NEW.Price, 'update', CURRENT_TIMESTAMP);
END;
//
DELIMITER ;
This trigger is set to activate after any update operation on the play
table. It logs the app name, old price, new price, operation type (update), and the current timestamp into the PriceChangeLog
table.
Use this snippet to check the effect of trigger :-
SET SQL_SAFE_UPDATES = 0; -- this allow us to update
UPDATE play
SET price = 4
WHERE app = 'Infinite Painter';
UPDATE play
SET price = 5
WHERE app = 'Sketch - Draw & Paint';
select * from play where app='Sketch - Draw & Paint'
your IT team have neutralize the threat, however hacker have made some changes in the prices, but becasue of your measure you have noted the changes , now you want correct data to be inserted into the database.
INTUTION
PriceChangeLog
table, you can use this information to update the play
table and revert the prices back to their correct values. This ensures data integrity and accuracy in your app pricing.APPROACH
play
table by joining it with PriceChangeLog
based on the app names. This update reverts the prices back to their original values recorded before the hacking incident. Finally, we verify the correct restoration of prices by checking a specific app's data in the play
table. This ensures the database reflects accurate app prices following the security breach.Drop the trigger and Update the prices in theplay
table:
DROP TRIGGER price_change_update;
UPDATE play AS p1
INNER JOIN pricechangelog AS p2 ON p1.app = p2.app
SET p1.price = p2.old_price;
SELECT * FROM play WHERE app='Sketch - Draw & Paint'; -- To verify
As a data person you are assigned the task to investigate the correlation between two numeric factors: app ratings and the quantity of reviews.
INTUTION
playstore
dataset. A positive coefficient suggests that higher ratings typically correlate with more reviews, indicating strong user engagement. Conversely, a negative coefficient would imply the opposite relationship. This analysis provides valuable insights into how user perception (ratings) aligns with user activity (reviews), essential for strategic decisions in app development and marketing.APPROACH
playstore
dataset, we begin by calculating the average rating (@x
) and average number of reviews (@y
). Using these averages, we compute deviations from the mean for both ratings and reviews, along with their squared values within a temporary table (t
). This prepares the necessary components for calculating the correlation coefficient: the sum of products of these deviations (@numerator
), and the square roots of the sums of their squares (@deno_1
and @deno_2
). Finally, we compute the correlation coefficient itself by dividing @numerator
by the square root of the product of @deno_1
and @deno_2
, providing a quantitative measure of the relationship between app ratings and reviews.![](cdn.hashnode.com/res/hashnode/image/upload/.. align="center")
-- Calculate average rating and average reviews
SET @x = (SELECT ROUND(AVG(rating), 2) FROM playstore);
SET @y = (SELECT ROUND(AVG(reviews), 2) FROM playstore);
-- Create a temporary table to compute deviations and their squares
with t as
(
select *,
round((rating - @x), 2) as 'rat',
round((reviews - @y), 2) as 'rev',
round((rating - @x) * (rating - @x), 2) as 'sqr_x',
round((reviews - @y) * (reviews - @y), 2) as 'sqr_y'
from playstore
)
Calculate numerator and denominators for correlation coefficient
select
@numerator := round(sum(rat * rev), 2),
@deno_1 := round(sum(sqr_x), 2),
@deno_2 := round(sum(sqr_y), 2)
from t;
-- Calculate correlation coefficient
select round((@numerator) / (sqrt(@deno_1 * @deno_2)), 2) as corr_coeff;
Your boss noticed that some rows in genres columns have multiple generes in them, which was creating issue when developing the recommendor system from the data he/she asssigned you the task to clean the genres column and make two genres out of it, rows that have only one genre will have other column as blank.
INTUTION
genres
column in the dataset to facilitate the development of a recommender system. Many rows contain multiple genres separated by semicolons, which complicates the analysis. To address this, we need to split the genres
column into two separate columns: one for the primary genre and another for the secondary genre. Rows that originally had only one genre will have the secondary genre column as blank. This cleanup ensures that each app's genre information is structured consistently, which is crucial for accurate recommendations in the system.APPROACH
f_name
and l_name
. The f_name
function extracts the first genre from the genres
column, handling cases where multiple genres are separated by semicolons. It identifies the position of the semicolon and retrieves the substring before it. The l_name
function extracts the second genre, returning an empty string if there's only one genre present. By applying these functions in a query, we can transform the genres
column into two separate columns (gene 1
and gene 2
), ensuring each row is structured correctly for the recommender system's needs.Function to extract the first genre by extracting left of ';'
DELIMITER //
CREATE FUNCTION f_name(a VARCHAR(100))
RETURNS VARCHAR(100)
DETERMINISTIC
BEGIN
DECLARE l INT;
DECLARE s VARCHAR(100);
SET l = LOCATE(';', a);
SET s = IF(l > 0, LEFT(a, l - 1), a);
RETURN s;
END//
DELIMITER ;
Function to get right of ';'
-- Function to extract the second genre
DELIMITER //
CREATE FUNCTION l_name(a VARCHAR(100))
RETURNS VARCHAR(100)
DETERMINISTIC
BEGIN
DECLARE l INT;
DECLARE s VARCHAR(100);
SET l = LOCATE(';', a);
SET s = IF(l = 0, '', SUBSTRING(a, l + 1, LENGTH(a)));
RETURN s;
END//
DELIMITER ;
-- Query to transform genres column
SELECT app, genres, f_name(genres) AS 'gene 1', l_name(genres) AS 'gene 2'
FROM playstore;
Finally calling 'f_name
' and 'l_name
' to get the cleaned genres out of them
Your senior manager wants to know which apps are not performing as par in their particular category, however he is not interested in handling too many files or list for every category and he/she assigned you with a task of creating a dynamic tool where he/she can input a category of apps he/she interested in and your tool then provides real-time feedback bydisplaying apps within that category that have ratings lower than the average rating for that specific category.
INTUTION
APPROACH
checking
is created. The procedure takes a category (cate
) as input and calculates the average rating (@c
) for that category from the playstore
dataset. It uses a subquery to compute the average ratings grouped by category. Once the average rating for the specified category is determined, the procedure then selects and displays all apps within that category (cate
) where the rating is lower than the computed average (@c
). This approach provides a direct, real-time feedback mechanism for evaluating app performance relative to category norms, supporting informed managerial decisions.To Learn More About SQL Procedure
Stored Procedure to check underperforming apps in a specific category
DELIMITER //
CREATE PROCEDURE checking(IN cate VARCHAR(30))
BEGIN
-- Calculate the average rating for the specified category
SET @c = (
SELECT average
FROM (
SELECT category, ROUND(AVG(rating), 2) AS average
FROM playstore
GROUP BY category
) m
WHERE category = cate
);
-- Select apps within the specified category that have ratings lower than the category average
SELECT *
FROM playstore
WHERE category = cate
AND rating < @c;
END//
DELIMITER ;
Call the stored procedure with a specific category ('business')
CALL checking('business');
This query will give underperforming apps in business category
Bonus Theory Question
what is duration time and fetch time
Duration Time: Duration time is how long it takes the system to completely understand the instructions given, from start to end, in the proper order and manner.
Fetch Time: Once the instructions are completed, fetch time is the time it takes for the system to retrieve and hand back the results. This duration depends on how quickly the system can find and bring back what you asked for.
For example, if a query is simple but needs to display a large volume of data, the fetch time will likely be longer as the system processes and retrieves extensive records. Conversely, if the query is complex with multiple criteria or parameters, the duration time may be extended as the system comprehensively processes the intricacies of the request before initiating the fetch process.
Wrapping Up: Exploring SQL Concepts and Looking Ahead!
That concludes this blog post on these concepts, which can be quite intricate. If you have any suggestions or questions, feel free to let me know at . Don't forget to subscribe for updates on the next blog, where we'll dive into data cleaning using SQL. Goodbye!