--- annotations_creators: - expert-generated language: - en language_creators: - found license: [] multilinguality: - monolingual pretty_name: message-classification size_categories: - n=10K source_datasets: - https://github.com/zeloru/small-english-smalltalk-corpus tags: [] task_categories: - text-classification task_ids: - semantic-similarity-scoring --- ## Table of Contents - [Table of Contents](#table-of-contents) - [Description](#description) - [Summary](#summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Considerations for Using the Model](#considerations-for-using-the-model) - [Known Limitations](#known-limitations) ## Description ### Summary https://ukatie.com This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment. It is also used to determine the questions and context within an input sentence. ### Languages So far, English is the only supported language. ## Dataset Structure ### Data Fields Text: Short input sentence Label: Question or Other ### Data Splits Question: 10K samples Other: 10K samples Training: 18K samples shuffled Validation: 2K samples shuffled ## Dataset Creation ### Curation Rationale Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words. Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk. ### Source Data #### Initial Data Collection https://github.com/zeloru/small-english-smalltalk-corpus It is scraped data from ESL language learning material. Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them. Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at. ### Annotations #### Annotation process The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other" ## Considerations for Using the Model ### Known Limitations There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates. Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data. Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.