A chocolate company is looking to uncover insights into the factors that most strongly influence chocolate ratings. The goal is to conduct an exploratory analysis of their data to identify key characteristics that consumers prioritize when rating chocolate products.
Investigate the impact of cocoa percentage on chocolate ratings.
Analyze the influence of the origin of cocoa beans on customer preferences.
Explore the relationship between the company's geographic location and product ratings.
Identify potential patterns or trends in the data to guide product improvement and market positioning.
Assess the interactions between variables to determine if certain combinations yield consistently higher ratings.
Python Jupyter Notebook
Libraries: Pandas, NumPy, Matplotlib, Sklearn, Folium, Seaborn
Tableau
Data Preparation & Cleaning: Loading, profiling, and cleaning data for analysis.
Data Exploration & Analysis: Filtering, grouping, aggregating, and wrangling data, while deriving new variables.
Visualization: Creating histograms, bar charts, line charts, geographic data mapping and scatter plots.
Reporting: Summarizing and reporting insights in Tableau.
Chocolate Bar Ratings
Tatmanhttps://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings/data
As part of the analysis, we'll investigate the key factors that most strongly influence chocolate ratings, focusing on variables such as cocoa percentage, origin of the beans, and the company's geographic location. By examining these factors, we aim to identify which characteristics consumers prioritize when rating chocolate products. This will allow us to uncover potential patterns or trends in the data, enabling the client to make data-driven decisions for product improvement and market positioning. Additionally, we will assess interactions between these factors to determine if certain combinations yield consistently higher ratings.
I wanted to explore the relationship between cocoa percentage and chocolate rating, to determine whether higher cocoa content correlates with higher consumer satisfaction. Using a histogram, I identified a distribution that works best for three subgroups: low cocoa content (less than 64%), moderate cocoa content (65%–79%), and high cocoa content (80% and above). The results showed that most people in the survey preferred a moderate amount of cocoa.
Using a regression line in a scatterplot, I observed that lower cocoa percentages tend to be associated with higher ratings. However, the relationship isn’t purely linear (i.e., it doesn’t follow a single, consistent trend). The heatmap I created reveals a weak negative correlation, suggesting that while the relationship exists, it is not strong enough to form a solid hypothesis.
As an analyst, after reviewing expert findings on the previous investigation, I shifted my focus towards exploring the relationship between country and chocolate ratings. My approach was twofold: first, I analyzed how the company's location influences the rating, and second, I examined how the rarity of companies from each country relates to their ratings. I aimed to uncover potential geographic advantages in chocolate production. However, this yielded minimal findings, suggesting no strong correlation between these factors and ratings.
Intriguingly, one standout was Chile, which had the highest-rated chocolate among all the countries. I then investigated which beans Chilean companies were using and found that they sourced their beans primarily from Peru. However, this bean variety did not seem to correlate with higher ratings, as other chocolates using the same beans did not achieve similarly high scores.
As observed, there is a weak correlation between the rarity of certain cacao beans and the lower number of reviews for chocolates made from those beans. However, the challenge lies in determining whether these beans are genuinely rare or simply under-reviewed compared to others.
At this point, no hypothesis could be proven due to the lack of strong correlation between the variables. The only conclusion that is truly apparent is that people generally like chocolate regardless of the bean type, production location, or cocoa percentage.
I believe the limited variety of chocolate reviews is a significant factor that makes it difficult to find a clear correlation. This introduces bias, which can be observed in the bar graph below. Notice that the beans with fewer entries tend to sit at the extremes of the average ratings, while those with more reviews cluster around the middle. A similar phenomenon can be seen in a simple coin toss: although the probability of landing heads or tails is 50%, achieving a 50/50 ratio requires multiple flips. The more times the coin is flipped, the more the ratio reflects the expected 50/50 outcome.
My recommendation for next steps is to gather more data, particularly on chocolates that received fewer ratings in this survey, and to take a random sample of those with higher ratings to gain a clearer view of emerging trends.