Made byBobr AI

Amazon Review Rating Prediction with NLP and Machine Learning

Discover how to predict Amazon star ratings using NLP models like XGBoost, Random Forest, and TF-IDF in this comprehensive machine learning case study.

#nlp#machine-learning#xgboost#sentiment-analysis#python#streamlit#data-science
Watch
Pitch

Amazon Review Rating Prediction using NLP

Multi-Model Machine Learning Approach

Logistic Regression • Random Forest • XGBoost • TF-IDF • Streamlit

Course: Natural Language Processing
Made byBobr AI

Problem Statement & Dataset

🎯 The Goal

Predict the star rating (1–5) from unstructured customer review text.

📊 The Dataset

Source: Amazon Cell Phones & Accessories Reviews
Format: JSON Lines
Key Feature: reviewText (Input)
Target: overall (1–5 Star Rating)

Chart
Made byBobr AI

Project Architecture

A clean, modern flat-style flowchart diagram on a white background. Shows a vertical pipeline steps: User Icon -> Streamlit UI -> JSON Data Upload -> Text Preprocessing -> TF-IDF Vectorizer -> ML Model Selection -> Prediction Output. Data science architecture.

System Components

  • Frontend: Streamlit (Interactive UI)
  • Processing: Pandas & Regex (Data Cleaning)
  • Vectorization: Scikit-Learn TF-IDF
  • Modeling: XGBoost, Random Forest, Logistic Regression
Made byBobr AI

System Pipeline: NLP & ML Workflow

1. Load Amazon Review Data (JSON Lines)
2. Extract Phone Models (Regex)
3. Text Preprocessing (Clean & Tokenize)
4. TF-IDF Feature Extraction
5. Train-Test Split (80/20)
6. Train 3 ML Models
7. Evaluate (Accuracy & ROC-AUC)
8. Live Prediction
abstract isometric illustration of a data factory, raw text going in one side, gears processing, and star ratings coming out the other side, clean minimal vector style, blue and white colors
Made byBobr AI

Data Loading & Dynamic Extraction

📂 JSON Loading

# Load JSON Data
df = pd.read_json(uploaded_file, lines=True)
df['summary'] = df['summary'].astype(str)

📱 Regex Extraction

# Regex for Phone Models
iphone_pattern = r'\biPhone\s*(\d+|[XxRr][Ss]?)\b'
samsung_pattern = r'\bSamsung\b.*\b(Galaxy\s*)?S(\d+)\b'

✔ Enables dynamic model selection in the UI based on product extracted from review text.

Made byBobr AI

Feature Engineering: TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency)

• Converts text -> Numerical Vectors
• Penalizes common words (e.g., 'the', 'is')
• Highlights sentiment-rich terms
tfidf = TfidfVectorizer(
    max_features=800,
    stop_words='english'
)
X = tfidf.fit_transform(model_df['reviewText'])
Visualization of a word cloud or text matrix transforming into digital binary code, representing natural language processing vectorization, clean technical style
Made byBobr AI

Prediction Models Implemented

1. Logistic Regression

Linear baseline. Fast & interpretable.

2. Random Forest

Ensemble of trees. Captures non-linear patterns.

3. XGBoost (Best Performer)

Gradient boosting. Optimized for ROC-AUC.

Implementation Snippet (XGBoost)

xgb_clf = XGBClassifier(
    objective='multi:softprob',
    n_estimators=200
)
Made byBobr AI

Model Evaluation Metrics

  • Accuracy: Overall correctness.
  • ROC-AUC (Macro): Critical for handling class imbalance (reviews are heavily skewed positive). Evaluates performance across all rating classes equally.
# Evaluation Code
acc = accuracy_score(y_true, y_pred)
auc = roc_auc_score(y_test_bin, y_proba, average="macro")
Iconographic illustration of a ROC Curve chart and a checkmark, clean design in blue and grey
Made byBobr AI

Results & Comparison

Chart

🔍 Key Observations

  • XGBoost achieves the highest performance (ROC-AUC ~0.82).
  • Random Forest is a strong contender but slower to train.
  • Logistic Regression struggled with non-linear sentiment patterns.
  • Sentiment dominance: strongly positive/negative words drive predictions.
Made byBobr AI

Live Prediction (Streamlit App)

App Features:
• Upload Custom Dataset (JSON)
• Dynamic Phone Model Selection
• Visualization of Rating Distribution
Real-time Confidence Scores for typed text
# Real-time Inference
vec = tfidf.transform([user_review])
confidence = xgb_clf.predict_proba(vec).max()
prediction = xgb_clf.predict(vec)
Streamlit framework logo, clean vector
Made byBobr AI

Running Application Demo

A professional screenshot of a Streamlit web application dashboard for 'Amazon Review Classifier'. Left sidebar has 'Upload JSON' and 'Select Model'. Main area has a blue bar chart, a ROC curve, and a text input box saying 'Great phone battery life!' with a result label '5 Stars (98% Confidence)'. Clean data science UI.

Streamlit Dashboard: Data upload, Model Selection, and Real-time Inference.

Made byBobr AI

Conclusion & Future Work

✅ Key Takeaways

• TF-IDF effectively captures sentiment from short reviews.
• XGBoost outperformed linear models.
• Streamlit provides a rapid path from notebook to deployed app.

🚀 Future Enhancements

• Implement Transformers (BERT/RoBERTa) for contextual understanding.
• Aspect-based sentiment (e.g., Battery vs. Camera).
• Cloud deployment (AWS/Heroku).
Made byBobr AI
Bobr AI

DESIGNER-MADE
PRESENTATION,
GENERATED FROM
YOUR PROMPT

Create your own professional slide deck with real images, data charts, and unique design in under a minute.

Generate For Free

Amazon Review Rating Prediction with NLP and Machine Learning

Discover how to predict Amazon star ratings using NLP models like XGBoost, Random Forest, and TF-IDF in this comprehensive machine learning case study.

Amazon Review Rating Prediction using NLP

Multi-Model Machine Learning Approach

Logistic Regression • Random Forest • XGBoost • TF-IDF • Streamlit

Course: Natural Language Processing

Problem Statement & Dataset

🎯 The Goal

Predict the star rating (1–5) from unstructured customer review text.

📊 The Dataset

Source: Amazon Cell Phones & Accessories Reviews<br>Format: JSON Lines<br>Key Feature: reviewText (Input)<br>Target: overall (1–5 Star Rating)

Project Architecture

<ul><li><b>Frontend:</b> Streamlit (Interactive UI)</li><li><b>Processing:</b> Pandas & Regex (Data Cleaning)</li><li><b>Vectorization:</b> Scikit-Learn TF-IDF</li><li><b>Modeling:</b> XGBoost, Random Forest, Logistic Regression</li></ul>

System Pipeline: NLP & ML Workflow

1. Load Amazon Review Data (JSON Lines)<br>2. Extract Phone Models (Regex)<br>3. Text Preprocessing (Clean & Tokenize)<br>4. TF-IDF Feature Extraction<br>5. Train-Test Split (80/20)<br>6. Train 3 ML Models<br>7. Evaluate (Accuracy & ROC-AUC)<br>8. Live Prediction

Data Loading & Dynamic Extraction

# Load JSON Data df = pd.read_json(uploaded_file, lines=True) df['summary'] = df['summary'].astype(str)

# Regex for Phone Models iphone_pattern = r'\biPhone\s*(\d+|[XxRr][Ss]?)\b' samsung_pattern = r'\bSamsung\b.*\b(Galaxy\s*)?S(\d+)\b'

✔ Enables dynamic model selection in the UI based on product extracted from review text.

Feature Engineering: TF-IDF

<b>TF-IDF (Term Frequency-Inverse Document Frequency)</b><br><br>• Converts text -> Numerical Vectors<br>• Penalizes common words (e.g., 'the', 'is')<br>• Highlights sentiment-rich terms

tfidf = TfidfVectorizer( max_features=800, stop_words='english' ) X = tfidf.fit_transform(model_df['reviewText'])

Prediction Models Implemented

1. Logistic Regression

Linear baseline. Fast & interpretable.

2. Random Forest

Ensemble of trees. Captures non-linear patterns.

3. XGBoost (Best Performer)

Gradient boosting. Optimized for ROC-AUC.

xgb_clf = XGBClassifier( objective='multi:softprob', n_estimators=200 )

Model Evaluation Metrics

<ul><li><b>Accuracy:</b> Overall correctness.</li><li><b>ROC-AUC (Macro):</b> Critical for handling class imbalance (reviews are heavily skewed positive). Evaluates performance across all rating classes equally.</li></ul>

# Evaluation Code acc = accuracy_score(y_true, y_pred) auc = roc_auc_score(y_test_bin, y_proba, average="macro")

Results & Comparison

<ul><li><b>XGBoost</b> achieves the highest performance (ROC-AUC ~0.82).</li><li>Random Forest is a strong contender but slower to train.</li><li>Logistic Regression struggled with non-linear sentiment patterns.</li><li>Sentiment dominance: strongly positive/negative words drive predictions.</li></ul>

Live Prediction (Streamlit App)

# Real-time Inference vec = tfidf.transform([user_review]) confidence = xgb_clf.predict_proba(vec).max() prediction = xgb_clf.predict(vec)

<b>App Features:</b><br>• Upload Custom Dataset (JSON)<br>• Dynamic Phone Model Selection<br>• Visualization of Rating Distribution<br>• <b>Real-time</b> Confidence Scores for typed text

Running Application Demo

Streamlit Dashboard: Data upload, Model Selection, and Real-time Inference.

Conclusion & Future Work

✅ Key Takeaways

• TF-IDF effectively captures sentiment from short reviews.<br>• XGBoost outperformed linear models.<br>• Streamlit provides a rapid path from notebook to deployed app.

🚀 Future Enhancements

• Implement Transformers (BERT/RoBERTa) for contextual understanding.<br>• Aspect-based sentiment (e.g., Battery vs. Camera).<br>• Cloud deployment (AWS/Heroku).

  • nlp
  • machine-learning
  • xgboost
  • sentiment-analysis
  • python
  • streamlit
  • data-science