scasps commited on
Commit
e52adc8
·
1 Parent(s): ed1c474

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ annotations_creators:
3
+ - expert-generated
4
+ language:
5
+ - en
6
+ language_creators:
7
+ - found
8
+ license: []
9
+ multilinguality:
10
+ - monolingual
11
+ pretty_name: message-classification
12
+ size_categories:
13
+ - n=10K
14
+ source_datasets:
15
+ - https://github.com/zeloru/small-english-smalltalk-corpus
16
+ tags: []
17
+ task_categories:
18
+ - text-classification
19
+ task_ids:
20
+ - semantic-similarity-scoring
21
+ ---
22
+
23
+ ## Table of Contents
24
+ - [Table of Contents](#table-of-contents)
25
+ - [Description](#description)
26
+ - [Summary](#summary)
27
+ - [Languages](#languages)
28
+ - [Dataset Structure](#dataset-structure)
29
+ - [Data Fields](#data-fields)
30
+ - [Data Splits](#data-splits)
31
+ - [Dataset Creation](#dataset-creation)
32
+ - [Curation Rationale](#curation-rationale)
33
+ - [Source Data](#source-data)
34
+ - [Annotations](#annotations)
35
+ - [Considerations for Using the Model](#considerations-for-using-the-model)
36
+ - [Known Limitations](#known-limitations)
37
+
38
+ ## Description
39
+ ### Summary
40
+ https://ukatie.com
41
+
42
+ This model is used to trigger chatbots to respond in chatrooms such as Slack, MSTeams, Discord and Matrix by detecting whether the user comment is a question or just a comment.
43
+
44
+ It is also used to determine the questions and context within an input sentence.
45
+
46
+ ### Languages
47
+
48
+ So far, English is the only supported language.
49
+
50
+ ## Dataset Structure
51
+
52
+ ### Data Fields
53
+
54
+ Text: Short input sentence
55
+
56
+ Label: Question or Other
57
+
58
+ ### Data Splits
59
+
60
+ Question: 10K samples
61
+
62
+ Other: 10K samples
63
+
64
+
65
+ Training: 18K samples shuffled
66
+
67
+ Validation: 2K samples shuffled
68
+
69
+ ## Dataset Creation
70
+
71
+ ### Curation Rationale
72
+
73
+ Simple, short and basic language examples were chosen, because those contain the same kind of words and word placements as long-winded questions with rare words.
74
+
75
+ Also the goal of this is to detect questions in a chatroom, which is a medium where people often use very short sentences and a lot of greetings or small-talk.
76
+
77
+ ### Source Data
78
+
79
+ #### Initial Data Collection
80
+
81
+ https://github.com/zeloru/small-english-smalltalk-corpus
82
+
83
+ It is scraped data from ESL language learning material.
84
+
85
+ Out of the already scraped data, only samples from certain conversations were taken, because of quality issues with some of them.
86
+
87
+ Because we also want to detect questions that are missing a questionmark, most of the samples had the questionmark removed. The same was done for the "other" label where "." were removed from the end of a sentence. This was done, so these identifiers don't become the only feature the model looks at.
88
+
89
+ ### Annotations
90
+
91
+ #### Annotation process
92
+
93
+ The annotations of "question" or "other" were done automatically by taking questions in the conversations as "question" and the answers as "other"
94
+
95
+ ## Considerations for Using the Model
96
+
97
+ ### Known Limitations
98
+
99
+ There seems to be an inbalance of greeting word combinations in the beginning of sentences, for example "Hi, has anyone deployed X in Y" is falsely not detected as a question because of the "Hi, " part. This issue will be addressed in updates.
100
+
101
+ Sentences in the form of "Wondering if 'question'..." and "I'm asking for help about 'question'..." are seemingly hard to detect and need more samples in the data.
102
+
103
+ Code fragments in input sentences are sometimes detected as questions. If code is present, it should probably be filtered out beforehand.