File size: 3,260 Bytes
3533d15 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
# Weakly Supervised Domain Detection
This repository releases the code and data for weakly supervised domain detection. Please cite the following paper [[bib](https://www.mitpressjournals.org/action/showCitFormats?doi=10.1162/tacl_a_00287)] if you find our code /data resource useful to you,
> In this paper we introduce domain detection as a new natural language processing task. We argue that the ability to detect textual segments which are domain-heavy, i.e., sentences or phrases which are representative of and provide evidence for a given domain would enable the development of domain aware tools and increase the domain coverage for practical applications. We propose an encoder-detector framework for domain detection and bootstrap classifiers with multiple instance learning (MIL). The models are hierarchically organized and suited to multilabel classification. We demonstrate that despite learning from minimal supervision, our models can be applied to text spans of different granularities, languages, and genres. We also explore the potential of domain detection for text summarization.
Should you have any query please contact me at [[email protected]](mailto:[email protected]).
## Project Structure
```bash
DomainDetection
β README.md
β spec-file.text
ββββsrc
β ββββframe # DetNet framework
β β encoder.py
β β detector.py
β β ...
β ββββconfig # configuration files
β ββββdata # dataset parsing, building and piping
β ββββutils # miscellaneous utils
ββββdataset
β ββββen # English dataset
β ββββ...
β ββββzh # Chinese dataset
β ββββ...
ββββres # resources (vocabulary)
β ββββvocab
β ββββen # English vocabulary
β β vocab
β ββββzh # Chinese vocabulary
β β vocab
ββββmodel # trained models
β ββββen # English models
β β DetNet
β β ...
β ββββzh # Chinese models
β β DetNet
β β ...
ββββlog
```
## Environment Setup
You can check the `spec-file.txt` provided in this project for the list of packages required.
To create a suitable environment conviniently with `conda`, do:
```bash
conda create --name myenv --file spec-file.txt
```
or alternatively, you may prefer to install required packages into an existing environment:
```bash
conda install --name myenv --file spec-file.txt
```
## Dataset
You can download our datasets for both English and Chinese via [Google Drive](https://drive.google.com/drive/folders/1K5TdwoezGzzb19_2QjTuNipOX9kf1tUY?usp=sharing).
After uncompressing *.zip files, put them under `dataset/en` or `dataset/zh`, respectively. These include data for model training, development and test. Note that `test` is for document-level test, and `syn_docs`is for sentence-level test with synthesized contexts (check the algorithm proposed in our paper for details).
`*.json` files include documents sampled from Wikipedia (in both `en` and `zh`) and NYT (in `en`); these documents are manually labeled via MTurk at both sentence-level and word-level for test purpose. |