Leetcode questions analysis
Insights and predict topics
Code
About Dataset
Leetcode is very popular among programmers. It has so many quality questions for various topics. It is crucial for content provider to understand the question quality with difficulty levels, most/less popular topics among programmers, most liked/disliked questions etc. It will be helpful to get overall trend, skill improvization pathway and which topic needs much more attention then any other to work. Each question contains similar questions title and text, which can be used to sugget similar questions.
Dataset is available on Kaggle
Entire project is divided into three major tasks:
- Data Scraping
- Data Analysis
- Topic prediction
Data Scraping
The data is scraped from this webpage using this file.
Disclaimer: The purpose of the data scraping is solely for generating insights.
Data Analysis
At the time of working on this project, the Leetcode website offers a grand total of 2239 questions spanning across 72 distinct topics.
Among these, the ‘Medium’ difficulty level emerges as the category with the highest number of questions, coming in at just under 1200.
The graph visually represents the distribution of questions by topic, showcasing the total number of questions for each topic. The predominant topics with the highest number of questions include ‘Array’, ‘String’, ‘Hash Table’, ‘Dynamic Programming’, and ‘Math’.
Upon analyzing the graph, we observe that when considering the combined factors of difficulty level and total number of questions per topic, the ‘Medium’ category emerges as the one with a significant number of questions across various topics.
We can infer that problem solvers found ‘Shell’ questions to be notably challenging and strenuous to solve, while a majority of coders were able to solve ‘Database’ questions with relative ease.
Discover the top four most popular topics on Leetcode, highly favored by problem solvers.
- Array
- String
- Hash Table
- Dynamic Programming
Topic prediction
To predict the topic based on question and description text, a systematic approach involves data processing, which includes text cleaning, word frequency analysis, and record preparation.
Subsequently, the data is divided into training, validation, and test sets. Utilizing either simple regression methods or advanced models like Bert, we can effectively predict the topics.
Results
Methods | f1-score |
---|---|
Logistic Regression | 0.55 |
Bert | 0.88 |
Both notebooks are available here