DBS Foundation Coding Camp 2024/AI/ML/course

sentimentAnalysis Dukcapil

This project is framed as a practical NLP learning artefact with raw and cleaned datasets, notebooks, requirements, and README narrative preserved for traceability.

project links
Domain
AI/ML
Role
Machine Learning Engineer
Output
ML Pipeline
Category
NLP Sentiment Analysis
Project Framing

A source-backed case study built for recruiter review

This reading path makes the problem choice, evidence quality, user framing, execution decisions, and proof trail visible without overstating what the sources support.

Project Type
course

NLP sentiment-analysis project using notebook-driven preprocessing, local datasets, and model benchmarking for Dukcapil app reviews.

Orientation
Tech

Shows disciplined NLP experimentation by keeping data artefacts, requirements, and analysis notebooks together for source-level review.

Core Stack
Python · Jupyter Notebook · Pandas · Scikit-learn

Notebook-first ML workflow with raw and processed CSV datasets, slang-word resources, scraping notebooks, and model-analysis notebooks documented in the local archive.

Why This Problem Mattered

Problem framing before execution

The case-study layer starts with why this problem was selected and how the context justified investment.

Problem Framing Map

Issue

Public-app review text needs repeatable preprocessing before sentiment models can be compared responsibly.

Context

The project preserves raw and cleaned datasets, scraping notebooks, analysis notebooks, requirements, and README narrative so the full NLP path stays reviewable.

Why Selected

It adds language-specific ML depth to the portfolio by showing that preprocessing and auditability matter as much as the final model choice.

Problem statement

Public-app review text needs repeatable preprocessing before sentiment models can be compared responsibly.

Solution thesis

Built a notebook-based workflow covering data collection artefacts, cleaned datasets, Indonesian text preprocessing, and model benchmarking.

Research and Evidence

What supports the narrative

Evidence is surfaced with its source type and credibility note so the recruiter can quickly see what is directly backed versus intentionally constrained.

Data trail preservation
local

The project keeps raw and cleaned Dukcapil review datasets together with scraping and analysis notebooks.

Credibility: Directly supported by the metrics, architecture description, and insight archive references.
Language-specific preprocessing
local

The workflow explicitly handles Indonesian review text before benchmarking models.

Credibility: Backed by the responsibilities, stack decisions, and README-backed sources.

Credibility Notes

  • The project is framed as notebook-driven NLP experimentation, not as a deployed sentiment product.
  • No live user feedback, service integration, or product-impact claim is added beyond the source-backed analysis artefacts.
Who The User Was

User framing stays explicit

When formal research artefacts are not available, the page still explains who the work served and why that user framing is justified by the existing sources.

Primary user
Reviewers or ML practitioners who need a traceable raw-to-cleaned text-processing workflow.

The strongest project value lies in the explicit data trail and preprocessing discipline, not a production-facing interface.

Secondary stakeholder
Teams interested in understanding public-review sentiment through structured NLP preprocessing and model comparison.

The source-backed narrative connects cleaned review text and comparative experimentation to practical analysis goals.

Decision Flow

How design thinking translated into decisions

The goal is to show the trace from research and insight to concrete product or system decisions, then to the outcomes those decisions supported.

Design Thinking Flow

Each step keeps the movement from evidence to action explicit before the rationale expands it.

  1. Step 1
    Data acquisition framing

    Started from preserving source and cleaned review datasets before optimizing model comparison.

    Signal: Raw-to-processed traceability became part of the evidence story.
  2. Step 2
    Preprocessing discipline

    Handled Indonesian review normalization as a first-class problem rather than a hidden notebook detail.

    Signal: Language-specific text preparation became central to model credibility.
  3. Step 3
    Benchmarking transparency

    Used notebook-based experimentation so model behavior remains reviewable alongside the data pipeline.

    Signal: The workflow supports auditable comparison rather than opaque model claims.

Decision Rationale

Each decision keeps the path from insight to execution visible before ending on the outcome signal.

Raw and cleaned dataset separation
Insight

Sentiment experiments become harder to audit when source text and processed data are merged too early.

Decision

Kept raw and cleaned review datasets separate inside the project workflow.

Outcome

The analysis path becomes easier to verify and discuss during portfolio review.

Notebook-first NLP traceability
Insight

Preprocessing choices can change model conclusions as much as algorithm selection.

Decision

Used notebooks to keep preprocessing and model benchmarking steps explicit.

Outcome

The project demonstrates disciplined NLP reasoning rather than only a final score narrative.

Solution and System Execution

Execution choices and delivery details

This section preserves the technical and operational substance: architecture, responsibilities, trade-offs, and implementation quality signals.

System Design

Notebook-first ML workflow with raw and processed CSV datasets, slang-word resources, scraping notebooks, and model-analysis notebooks documented in the local archive.

Source-backed Impact

Shows disciplined NLP experimentation by keeping data artefacts, requirements, and analysis notebooks together for source-level review.

Responsibilities

  • Prepared source and cleaned review datasets for analysis
  • Implemented preprocessing steps for Indonesian review text
  • Compared model behavior through notebook-based experimentation

Stack Decisions

  • Used notebooks to keep exploratory NLP decisions auditable
  • Kept raw and processed datasets separate to preserve reviewability
  • Used requirements metadata so the experiment environment can be reconstructed

Trade-offs

  • Accepted notebook workflow limits in exchange for transparent experimentation
  • Avoided production-readiness claims because the available source is an analysis artefact, not a deployed service

Challenges

  • Handling informal Indonesian review text before modeling
  • Keeping data acquisition, preprocessing, and model comparison traceable across notebooks
Outcomes and Proof

What was delivered and what can be verified

Outcome claims remain conservative and source-backed, while proof records and recruiter-safe links surface the strongest verification trail available.

Validation Signals

  • Local archive includes raw and cleaned Dukcapil app review datasets.
  • Project contains scraping and analysis notebooks plus requirements metadata.

Source-backed Outcomes

  • Local archive includes raw and cleaned Dukcapil app review datasets
  • Project contains scraping and analysis notebooks plus requirements metadata
  • Insight archive records 9 files across dataset, notebook, README, and requirements artefacts
Retrospective and Limits

What the project proves, and what it does not

Strong case studies show both what was learned and where the current evidence stops.

Retrospective

Next iteration should add a concise model card and reproducible evaluation summary before turning notebook results into product claims.

Evidence Limits

  • Current sources do not support deployment, online inference, or ongoing feedback-loop claims.
  • The project should remain framed as auditable NLP experimentation and analysis.

Lessons

  • Language-specific preprocessing can matter as much as algorithm choice in sentiment analysis
  • A clear raw-to-cleaned data trail improves ML project auditability