scasps commited on
Commit
f391d00
·
1 Parent(s): e19f181

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -105
README.md DELETED
@@ -1,105 +0,0 @@
1
- ---
2
- annotations_creators:
3
- - expert-generated
4
- language:
5
- - en
6
- language_creators:
7
- - found
8
- license: []
9
- multilinguality:
10
- - monolingual
11
- pretty_name: message-classification
12
- size_categories:
13
- - n=10K
14
- source_datasets:
15
- - https://github.com/zeloru/small-english-smalltalk-corpus
16
- tags: []
17
- task_categories:
18
- - text-classification
19
- task_ids:
20
- - semantic-similarity-scoring
21
- ---
22
-
23
- # Dataset Card for [Dataset Name]
24
-
25
- ## Table of Contents
26
- - [Table of Contents](#table-of-contents)
27
- - [Description](#description)
28
- - [Summary](#summary)
29
- - [Languages](#languages)
30
- - [Dataset Structure](#dataset-structure)
31
- - [Data Fields](#data-fields)
32
- - [Data Splits](#data-splits)
33
- - [Dataset Creation](#dataset-creation)
34
- - [Curation Rationale](#curation-rationale)
35
- - [Source Data](#source-data)
36
- - [Annotations](#annotations)
37
- - [Considerations for Using the Model](#considerations-for-using-the-model)
38
- - [Known Limitations](#known-limitations)
39
-
40
- ## Description
41
- ### Summary
42
- https://ukatie.com
43
-
44
- This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment.
45
-
46
- It is also used to determine the questions and context within an input sentence.
47
-
48
- ### Languages
49
-
50
- So far, English is the only supported language.
51
-
52
- ## Dataset Structure
53
-
54
- ### Data Fields
55
-
56
- Text: Short input sentence
57
-
58
- Label: Question or Other
59
-
60
- ### Data Splits
61
-
62
- Question: 10K samples
63
-
64
- Other: 10K samples
65
-
66
-
67
- Training: 18K samples shuffled
68
-
69
- Validation: 2K samples shuffled
70
-
71
- ## Dataset Creation
72
-
73
- ### Curation Rationale
74
-
75
- Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words.
76
-
77
- Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk.
78
-
79
- ### Source Data
80
-
81
- #### Initial Data Collection
82
-
83
- https://github.com/zeloru/small-english-smalltalk-corpus
84
-
85
- It is scraped data from ESL language learning material.
86
-
87
- Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them.
88
-
89
- Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at.
90
-
91
- ### Annotations
92
-
93
- #### Annotation process
94
-
95
- The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other"
96
-
97
- ## Considerations for Using the Data
98
-
99
- ### Known Limitations
100
-
101
- There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates.
102
-
103
- Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data.
104
-
105
- Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.