General Guidelines

Problem statement

Riverline’s AI agent must decide what to do next when a customer’s issue remains open. Building a robust Next-Best-Action (NBA) system requires 3 core skills: (1) turning messy, multi-source text into a reliable daily feed, (2) learning from user-patterns, and (3) choosing the smartest follow-up. The assignment below lets you showcase all 3 using only public text datasets.

Datasets

  1. Customer Support on Twitter (CST) – 3 M+ tweets and brand replies, with user IDs, timestamps and language tags. It is large enough to show multi-turn issues and reply patterns. (Dataset: kaggle.com)
  2. (Bonus) Reddit MBTI – 11 773 authors labelled with Myers-Briggs type and 13 M posts; use this only if you want to enrich customer personas. (Dataset: kaggle.com)

You may reference additional open text sources for sentiment, author profiling or persona cues (e.g., Sentiment140 kaggle.com, PAN Author Profiling pan.webis.de, Synthetic-Persona-Chat github.com, Empathetic Dialogues kaggle.com) if they strengthen your logic.

Core Tasks

1. Data Pipeline

Create an ingestion job that pulls new CST records, normalises them into a single interaction table, and guarantees idempotent re-runs. You decide the storage, scheduling, and deduplication approach, just explain the rationale. Best-practice references on ML data pipelines may help but are not prescriptive

2. Observe user behavior

Users behave in their own way, leading to different “conversation-flows” (i.e. how a user responds in conversation). In this part, study patterns in user-behavior to figure out different conversation flows and see if you can tag/cohort customers based on attributes like “nature_of_support_request”, “customer_sentiment” or “conversation_history”.

Important: Some support-request-queries would have already been resolved so we want you to tag and log the count of them and then exclude them, as we will evaluate your model based on how it acts for open customer-support-tickets.