Evaluating customer emotions in coffee shop reviews using machine learning

Sentiment analysis is a crucial technique for understanding and classifying the emotions expressed in written content. Given the widespread use of social media, these platforms have become an essential method for gathering customer feedback in several industries, including the café sector. This research introduces a method that uses the Soft K-means clustering algorithm to automatically evaluate sentiment in response to the growing online presence of coffee establishments. This approach is used to categorize customer evaluations on a famous coffee shop review website. The sentiment analysis of coffee shop reviews utilizes many machines learning methods, including Naïve Bayes, Gradient Boosting Machines (GBM), K-Nearest Neighbor, Support Vector Machines, Logistic Regression, and Random Forest. Furthermore, this paper proposes an innovative ensemble learning technique that integrates these six classifiers to enhance forecast accuracy. The efficacy of various tactics in efficiently capturing the ideas of coffee shop patrons is thoroughly compared.


Introduction
In today's world of digital technology, individuals often choose to interact and communicate with others using online methods, regularly sharing their experiences and opinions on various service platforms [1].Coffee shops have been a key focal point on review websites, where consumers express their thoughts about their visits.Harnessing these huge volumes of review data may be quite beneficial for coffee shop owners and industry analysts seeking to enhance profitability and customer happiness.Reviews may display subjectivity, which includes personal opinions, feelings, and viewpoints, or objectivity, which consists of verified facts and observable data.A successful method for sentiment analysis entails distinguishing between subjective and objective evaluations, with an emphasis on analyzing the emotional tone of subjective information to determine if the feelings conveyed are good or negative.
To address this problem, we use the Soft K-means clustering approach to first identify sentiment.This program categorizes reviews as either subjective or objective, enabling us to ignore the objective ones and focus on the useful subjective ideas.This research employs a range of machine learning classifiers, including Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, Random Forest, and Logistic Regression, to do sentiment categorization.Furthermore, a composite learning model is implemented to combine the skills of these separate classifiers in order to enhance the accuracy and reliability of the sentiment assessments.

literature Review
Sentiment analysis seeks to determine the neutral nature of language, namely where the content is favorable or adverse.The volume of tourism-related information accessible on the web has experienced abrupt and substantial growth [2].
Sentiment analysis offers a chance to significantly enhance decision-making within the tourist sector by providing a deeper comprehension of the overall visitor experience [3].
Various machine learning methods are employed to categorize the sentiments conveyed in coffee house reviews.The classification of tactics that utilize machine learning can be categorized into three basic groups: supervised, semisupervised, and unsupervised methods.Sourav, Sinha, and their colleagues conducted a study that specifically examined the application of data mining and sentiment evaluation within the environment of consumer evaluations, having a special emphasis on assessments regarding coffee shops [4].We specifically examine the utilization of artificial intelligence algorithms to evaluate the sentiments of customers based on these evaluations.The aim is to enhance rating predictions and decision-making by analyzing the sentiment conveyed in the evaluations, with the ultimate goal of attaining the desired outcomes.
In their study, Neha Nandal et al. [5] employed Support Vector Machines (SVM) to examine customer evaluations obtained from Amazon.The researchers detected aspect phrases within the evaluations and allocated polarity ratings to measure the sentiment.Out of the SVM kernels examined (Linear, Polynomial, and Radial Basis Function), the Radial Basis Function kernel produced the most favorable outcomes, suggesting its efficacy in identifying intricate connections between aspect phrases and emotion.Their research highlights the significance of employing Support Vector Machines (SVM) with the Radial Basis Function (RBF) kernel for evaluating the sentiment of customer evaluations.
Vishal S. and his colleagues conducted a sentence-level evaluation of sentiment [6].Their main focus was to discover instances of negation in internet news articles.The utilized machine learning methods consisted of a Support Vector Machine and Naïve Bayes.The Support Vector Machine attained a precision rate of 96.46%, whilst the Naïve Bayes achieved a precision rate of 94.16%.Aurangzeb Khan and his associates.Sonal et.al provides a new method for improving consumer decision-making in restaurant selection by using an artificial sentiment summarizing system designed specifically for restaurant evaluations.The technology uses machine learning methods to effectively assess the quality of restaurants by extracting emotions expressed in reviews.The suggested approach offers a comprehensive framework for analyzing feelings in restaurant reviews, making it a beneficial tool for busy consumers who need to quickly assess the quality of a restaurant and make educated selections [7].This framework simplifies the process of analyzing reviews by offering summaries of the feelings stated in the reviews.This helps to improve the decision-making process and saves consumers significant time.
Rehab M. Duwairi et al. [8] utilised machine learning techniques to do sentiment analysis on Arabic reviews.Three classifiers were employed, specifically Naïve Bayes, Support Vector Machine, and K-Nearest Neighbour.The test results showed the Support Vector machine had the highest reliability, while the K-Nearest Neighbour approach had the maximum recall.Anjuman Prabhat et al. [9] performed sentiment classification by employing Twitter ratings.The evaluations were categorized with Naïve Bayes along with logical regression.The results revealed that statistical regression surpassed the naive Bayes method of accuracy as well as precision.In their research, Bhavitha et al. [10] did a comparative assessment of both the Random Forest along with Support Vector Machine methods for sentiment evaluation.The Random Forest method exhibited improved accuracy in sentiment analysis of hotel reviews; however, it necessitates greater computational resources and a lengthier training period.
Minh-Hao Nguyen et al. investigate the application of machine learning methods to assess consumer emotions, particularly in the detection and classification of various emotions expressed in assessments.They argue that conducting such analysis might provide significant benefits to companies by enabling them to evaluate services and optimise customer satisfaction strategies effectively [11].The paper introduces a model for recognising consumer emotions and a data mining method that utilises a substantial dataset of 80,593 online reviews gathered from agoda.com and booking.com.This technique and approach are specifically designed to extract significant insights from the huge amount of client input seen on these sites.Their objective is to enhance the capabilities of enterprises by facilitating a more profound comprehension of customer sentiments and the refinement of their products and services, ultimately resulting in enhanced overall customer satisfaction [12].
Chaturvedi et al. [13] employed Bayesian systems and fuzzy continuous neural networks to detect subjectivity.Bayesian networks are used to depict relationships in data with numerous dimensions, whereas fuzzy recurrent models are then deployed to capture temporal attributes.The authors claim that their proposed method may effectively tackle the issues associated with detecting subjectivity in a traditional manner.Furthermore, they have successfully proved its applicability to different languages.Sonal et.al provide a new method for improving consumer decision-making in restaurant selection by using an artificial sentiment summarizing system designed specifically for restaurant evaluations.The technology use machine learning methods to effectively assess the quality of restaurants by extracting emotions expressed in reviews.The suggested approach offers a comprehensive framework for analyzing feelings in restaurant reviews, making it a beneficial tool for busy consumers who need to quickly assess the quality of a restaurant and make educated selections.This framework simplifies the process of analyzing reviews by offering brief summaries of the feelings stated in the reviews.This helps to improve the decision-making process and saves consumers significant time.

Methodology
The paper presents a detailed methodology for conducting sentiment analysis on coffee shop evaluations, which involves a thorough examination of consumer feedback obtained from a prominent coffee shop website.The procedure begins with data collection which involves gathering a large dataset of coffee shop reviews from the Kaggle website, which contains labeled reviews, then data preprocessing, which involves cleansing the text by eliminating noise such as typographical errors and extraneous information.Afterward, the Fuzzy C-Means classification method is used to distinguish and remove objective assertions, thus focusing the study on subjective content.Word2Vec is employed to extract features and produce embeddings that capture the semantic relationships between words.The general opinion of the comments is classified using various machine learning classification methods, such as Naïve Bayes, K-Nearest Neighbour, Support Vector Machine, Logistic Regression, Random Forest, and Gradient Boosting Machines (GBM).A composite approach that integrates these classifiers aims to enhance prediction accuracy by leveraging their collective skills, providing a robust framework for understanding consumer sentiments in reviews.

Dataset
The original dataset consists of 1,246 coffee reviews that have been categorized and acquired from the Kaggle website.This dataset exclusively consists of reviews specifically related to a solitary coffee business.The data underwent preprocessing to purify and ready it for sentiment analysis (Fig. 1).Text pre-processing entails removing superfluous content from the written material [15].Online sources often contain an excessive amount of unnecessary content, such as spelling errors, poor grammar, web addresses, popular phrases, and assertions that do not contribute to the overall importance.The primary objective of data pre-processing is to minimize the presence of superfluous or undesirable data within the written material, hence enhancing the efficiency of the algorithm and expediting the process of categorization.

Data Preprocessing
One such duty during the pre-processing phase is eliminating stop terminology.Eliminating stop terminology, such as "a", "this", "that", and "on", is beneficial as it reduces unnecessary processing.Eliminating punctuation was a crucial stage in the process of cleansing data.More precisely, certain punctuation marks, such as ".", ",", and "?", are crucial and should be preserved, while others ought to be eliminated.Furthermore, to eliminate any ambiguity in the definition of words, it is crucial to perform an apostrophe check to transform any apostrophe into regular text.For example, the expression "it's an excellent place" must be transformed into "it is an extremely nice place".Standardizing terminologies is crucial for data cleansing when phrases don't exist in the correct format.Word or terminology embedding was utilized to extract characteristics and depict terminologies in each sentence of the evaluation.Terminology inclusion is a technique that entails applying phrases or terms into vectors composed of numerical values [16].The Word2vec methodology [17] was employed to obtain features.
The word2vec model was trained on a corpus that encompassed all the terms present in the dataset.Word2Vec is a widely used technique that employs shallow neural networks.Word2vec possesses the capability to infer profound semantic connections among words [17].It calculates uninterrupted vector illustrations for terminologies.The computed word or phrase vectors retain a substantial amount of grammatical and semantically patterns found in the language [18] and translate these into positional distinctions in the resultant vector space.Each terminology is associated with 100 extracted attributes.

Figure 2 Removing command of stop words
The evaluation consists of a multitude of sentences.The analysis combines the term matrices by breaking down the phrases into each word, resulting in the total sum.Before implementing subjectivity identification, it turned out crucial to establish standardized features.The Fuzzy C-means grouping method was employed to detect subjectivity and categorize data into favorable, adverse, as well as objective groups.The Fuzzy C-means grouping approach is analogous to the K-means methodology, except it employs soft clustering instead of hard grouping.Soft clustering allows for the assignment of each data point to many clusters, based on a Belonging Function that determines the degree of association.For example, observations that are close to the core of a collection will show a strong association with that group, while data values that are away from the core of the cluster will have a weak association.After the clustering procedure, any statements that serve as objectives are removed, and only sentences that are categorized as either favorable or adverse are retained to undergo additional classification.
The sentiment classification algorithm was constructed via six distinct machine learning methodologies: Naïve Bayes, K-Nearest Neighbour, Support Vector Machine, Logistic Regression, Gradient Boosting Machines (GBM), and Random Forest.The utilization of collective learning was employed to augment the precision of the forecasts.This was accomplished by combining the results of six distinct classifiers.

Machine Learning Techniques
Naïve Bayes provides a straightforward model that relies upon the Bayes Hypothesis.Due to its rapid and precise performance, it exhibits outstanding efficiency in managing extensive datasets, all while incurring minimal processing expenses [19].A naive Bayes classifier assumes that every one of the features is distinct from each other.
Although a claim of independence [20] is frequently made, it might not continuously hold in actual scenarios.Though, Naïve Bayes frequently achieves success in situations that are real.When evaluating a predictor, A, the term P(B|A) represents a posterior probability for class B, whereas P(A|B) indicates the likely outcome of A that is associated with the class.P(A) represents the starting probability of event A.

P(B|A) = P(A|B) * P(B)/P(A)
 K-Nearest Neighbor (KNN): Is a straightforward and readily implementable technique [20].The K-nearest neighbors (KNN) technique operates on the assumption that data points sharing similar features are located in proximity to each other.The conventional approach for determining the distance across the two places is the classical Euclidean distance.Nevertheless, there exist other methods for computing distance, and the selection is contingent upon the specific task at hand.The input consists of the K closest training samples in the feature space.To achieve optimal results for a given amount of K, researchers repeatedly employ the K-nearest neighbors (KNN) approach with multiple amounts of K and then choose the one that produces the highest level of accuracy. The K-Nearest Neighbors technique has demonstrated effectiveness across multiple tasks such as regression and classification.However, a significant drawback of the K-Nearest Neighbor's (KNN) technique is its notable decrease in computational effectiveness as information volume increases. Support Vector Machine (SVM): is a very efficient and widely recognized artificial intelligence technique employed for classification.The development of SVM is grounded in the concept of statistical learning [19].Furthermore, it demonstrates the remarkable effectiveness of memory by utilizing only a portion of training data for prediction.Support Vector (SVM) are highly effective in situations when you have a significant separation between data elements or when dealing with data that has a large number of dimensions.However, it exhibits suboptimal performance when dealing with classes that possess overlapping properties and is highly dependent on the choice of the kernels. The Random Forest algorithm: Multiple decision trees combine to form an ensemble.Every single tree within the Random Forest produces a categorical prediction, while the Random Rainforest as a whole makes its decisions based on the majority of these forecasts.This approach utilizes ensemble learning methods to mitigate overfitting issues in decision trees and enhance overall precision [20].The system exhibits a remarkable level of stability and effectively manages missing data and non-linear characteristics.In contrast, the random forest approach also appears more resilient to disturbance.The drawbacks of this strategy include the extended training duration and the heightened level of complexity.This approach yields a substantial quantity of trees, necessitating greater computer capacity and resources in contrast to a basic decision tree. Logistic Regression: Logistic regression is a statistical technique primarily employed for binary classification.
Logistic Regression measures the relationship between an outcome variable and independent factors by computing probabilities using its core Sigmoid function.The Sigmoid function is a computation that is distinguished by its sigmoidal curve.The function can take any real integer as an input and convert it to a value between 0, and 1. Afterward, such values can be converted into moreover 0 otherwise 1 by employing a threshold categorization.The system's efficiency is widely recognized due to its minimal computer resource requirements.Furthermore, it exhibits great flexibility, eliminating the need for input feature standardization, requiring no parameter tuning, and being straightforward to implement.A fundamental constraint of this method is its inability to address non-linear situations due to its linear decision boundary [21].Moreover, it is susceptible to overfitting. Gradient Boosting Machines (GBM): Random forests are a robust ensemble learning method utilized over both regression and classification tasks.Gradient Boosting Machines (GBMs) construct models through an iterative process, continuously refining the models by correcting the mistakes produced by prior models.This leads to improved efficiency in each successive iteration.This is accomplished by iteratively modeling according to the residual mistakes of previous models, forecasts and projections.Each subsequent simulation in the chain is often an ineffective learner, commonly in the form of a choice tree, which is incorporated into the group with a carefully determined weight to minimize the total forecast error.The process continues till a specific number of replicas are constructed or further enhancements become insignificant.The outcome is a resilient prediction model that continually demonstrates exceptional performance across various types of information by adeptly identifying intricate patterns and linkages inside the data.

Result and Discussion
The dataset obtained using the Kaggle platform has 1158 positive ratings and 84 negative evaluations related to coffee businesses.The objective detection technique identified a total of 1242 reviews as subjective.The performance of sentiment categorization was assessed using accuracy, F-score recall, and precision metrics.Precision remains the quotient obtained by dividing the count of correctly predicted cases that were positive by the total count of instances expected as positive.Recall is calculated by dividing the number of correctly predicted positive occurrences by the sum of true positive instances and false negative instances.Valid Positives (VP) refer to situations where the model correctly classifies positive data points as Valid.Invalid Negatives (IVN) occur when the model incorrectly identifies positive data values as Invalid.The F-score is a quantitative measure that combines Precision and Recall by considering both Invalid positives and Invalid negatives.Based on those measures, a comparison was performed between the machine learning techniques employed and the outcomes are presented in Table 1.

Conclusion
This study examined different methodologies for classifying coffee establishments in France.The study employed Naïve Bayes, Random Forest, Gradient Boosting Machines (GBM), Support Vector Machine, K-Nearest Neighbour, and Logistic as the methodologies.The consequences found through the application of logistic regression were highly favorable.It is important to mention that the Naïve Bayes classifier achieved the lowest level of correctness.The Support Vector Machine and the Random Forest algorithm both got 92.2% accuracy.The K-Nearest Neighbour classifier, on the other hand, was 89.6% accurate, while the Naïve Bayes classifier was 92.5.8% accurate.In addition, a composite learning model was made by combining the five separate algorithms to get even better accuracy.92.5% of the time, the ensemble learning model got it right when it was being tested.At this point, it looks like there was no change in accuracy and the logistic regression model works better in this case.

Figure 4
Figure 4 Evaluation of ensemble model