scasps commited on
Commit
07c4a7f
·
1 Parent(s): f391d00

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ annotations_creators:
3
+ - expert-generated
4
+ language:
5
+ - en
6
+ language_creators:
7
+ - found
8
+ license: []
9
+ multilinguality:
10
+ - monolingual
11
+ pretty_name: message-classification
12
+ size_categories:
13
+ - n=10K
14
+ source_datasets:
15
+ - https://github.com/zeloru/small-english-smalltalk-corpus
16
+ tags: []
17
+ task_categories:
18
+ - text-classification
19
+ task_ids:
20
+ - semantic-similarity-scoring
21
+ ---
22
+
23
+ # Dataset Card for [Dataset Name]
24
+
25
+ ## Table of Contents
26
+ - [Table of Contents](#table-of-contents)
27
+ - [Description](#description)
28
+ - [Summary](#summary)
29
+ - [Languages](#languages)
30
+ - [Dataset Structure](#dataset-structure)
31
+ - [Data Fields](#data-fields)
32
+ - [Data Splits](#data-splits)
33
+ - [Dataset Creation](#dataset-creation)
34
+ - [Curation Rationale](#curation-rationale)
35
+ - [Source Data](#source-data)
36
+ - [Annotations](#annotations)
37
+ - [Considerations for Using the Model](#considerations-for-using-the-model)
38
+ - [Known Limitations](#known-limitations)
39
+
40
+ ## Description
41
+ ### Summary
42
+ https://ukatie.com
43
+
44
+ This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment.
45
+
46
+ It is also used to determine the questions and context within an input sentence.
47
+
48
+ ### Languages
49
+
50
+ So far, English is the only supported language.
51
+
52
+ ## Dataset Structure
53
+
54
+ ### Data Fields
55
+
56
+ Text: Short input sentence
57
+
58
+ Label: Question or Other
59
+
60
+ ### Data Splits
61
+
62
+ Question: 10K samples
63
+
64
+ Other: 10K samples
65
+
66
+
67
+ Training: 18K samples shuffled
68
+
69
+ Validation: 2K samples shuffled
70
+
71
+ ## Dataset Creation
72
+
73
+ ### Curation Rationale
74
+
75
+ Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words.
76
+
77
+ Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk.
78
+
79
+ ### Source Data
80
+
81
+ #### Initial Data Collection
82
+
83
+ https://github.com/zeloru/small-english-smalltalk-corpus
84
+
85
+ It is scraped data from ESL language learning material.
86
+
87
+ Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them.
88
+
89
+ Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at.
90
+
91
+ ### Annotations
92
+
93
+ #### Annotation process
94
+
95
+ The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other"
96
+
97
+ ## Considerations for Using the Model
98
+
99
+ ### Known Limitations
100
+
101
+ There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates.
102
+
103
+ Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data.
104
+
105
+ Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.