Wyona
/

message-classification-question-other-smalltalk-modified

+---
+annotations_creators:
+- expert-generated
+language:
+- en
+language_creators:
+- found
+license: []
+multilinguality:
+- monolingual
+pretty_name: message-classification
+size_categories:
+- n=10K
+source_datasets:
+- https://github.com/zeloru/small-english-smalltalk-corpus
+tags: []
+task_categories:
+- text-classification
+task_ids:
+- semantic-similarity-scoring
+---
+## Table of Contents
+- [Table of Contents](#table-of-contents)
+- [Description](#description)
+  - [Summary](#summary)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+- [Considerations for Using the Model](#considerations-for-using-the-model)
+  - [Known Limitations](#known-limitations)
+## Description
+### Summary
+https://ukatie.com
+This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment.
+It is also used to determine the questions and context within an input sentence.
+### Languages
+So far, English is the only supported language.
+## Dataset Structure
+### Data Fields
+Text: Short input sentence
+Label: Question or Other
+### Data Splits
+Question: 10K samples
+Other: 10K samples
+Training:  18K samples shuffled
+Validation: 2K samples shuffled
+## Dataset Creation
+### Curation Rationale
+Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words.
+Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk.
+### Source Data
+#### Initial Data Collection
+https://github.com/zeloru/small-english-smalltalk-corpus
+It is scraped data from ESL language learning material.
+Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them.
+Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at.
+### Annotations
+#### Annotation process
+The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other"
+## Considerations for Using the Model
+### Known Limitations
+There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates.
+Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data.
+Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.