pdjohn commited on
Commit
4c7688b
·
verified ·
1 Parent(s): 121336b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -21
README.md CHANGED
@@ -9,36 +9,46 @@ pipeline_tag: token-classification
9
  ---
10
 
11
  # C-EBERT
12
-
13
- C-EBERT is a multi-task fine-tuned German EuroBERT to extract causal attribution.
14
 
15
  ## Model details
16
- - **Model architecture**: EuroBERT-210m + token & relation heads
17
- - **Fine-tuned on**: environmental causal attribution corpus (German)
18
- - **Tasks**:
19
- 1. Token classification (BIO tags for INDICATOR / ENTITY)
20
- 2. Relation classification (CAUSE, EFFECT, INTERDEPENDENCY)
 
21
 
22
  ## Usage
23
  Find the custom [library](https://github.com/padjohn/causalbert). Once installed, run inference like so:
24
  ```python
25
- from transformers import AutoTokenizer
26
- from causalbert.infer import load_model, analyze_sentence_with_confidence
27
 
 
28
  model, tokenizer, config, device = load_model("pdjohn/C-EBERT")
29
- result = analyze_sentence_with_confidence(
30
- model, tokenizer, config, "Autoverkehr verursacht Bienensterben.", []
31
- )
32
- ```
33
 
34
- ## Training
 
35
 
36
- - **Base model**: `EuroBERT/EuroBERT-210m`
37
- - **Epochs**: 3, **LR**: 2e-5, **Batch size**: 8
38
- - See [train.py](https://github.com/padjohn/causalbert/blob/main/causalbert/train.py) for details.
 
 
 
 
39
 
40
- ## Limitations
 
 
 
 
41
 
42
- - German only.
43
- - Sentence-level; doesn’t handle cross-sentence causality.
44
- - Relation classification depends on detected spans — errors in token tagging propagate.
 
 
 
 
 
 
9
  ---
10
 
11
  # C-EBERT
12
+ A multi-task model to extract **causal attribution** from German texts.
 
13
 
14
  ## Model details
15
+ - **Model architecture**: [EuroBERT-210m](https://huggingface.co/EuroBERT/EuroBERT-210m) with two custom classification heads (one for token span and one for relation).
16
+ - **Fine-tuned on**: A custom corpus focused on environmental causal attribution in German.
17
+ | Task | Output Type | Labels / Classes |
18
+ | :--- | :--- | :--- |
19
+ | **1. Token Classification** | Sequence Labeling (BIO) | **5 Span Labels** (O, B-INDICATOR, I-INDICATOR, B-ENTITY, I-ENTITY) |
20
+ | **2. Relation Classification** | Sentence-Pair Classification | **14 Relation Labels** (e.g., MONO\_POS\_CAUSE, DIST\_NEG\_EFFECT, INTERDEPENDENCY, NO\_RELATION) |
21
 
22
  ## Usage
23
  Find the custom [library](https://github.com/padjohn/causalbert). Once installed, run inference like so:
24
  ```python
25
+ from causalbert.infer import load_model, sentence_analysis
 
26
 
27
+ # NOTE: The model path accepts either a local directory or a Hugging Face Hub ID.
28
  model, tokenizer, config, device = load_model("pdjohn/C-EBERT")
 
 
 
 
29
 
30
+ # Analyze a batch of sentences
31
+ sentences = ["Autoverkehr verursacht Bienensterben.", "Lärm ist der Grund für Stress."]
32
 
33
+ all_results = sentence_analysis(
34
+ model,
35
+ tokenizer,
36
+ config,
37
+ sentences,
38
+ batch_size=8
39
+ )
40
 
41
+ # The result is a list of dictionaries containing token_predictions and derived_relations.
42
+ print(all_results[0]['derived_relations'])
43
+ # Example Output:
44
+ # [(['Autoverkehr', 'verursacht'], ['Bienensterben']), {'label': 'MONO_POS_CAUSE', 'confidence': 0.954}]
45
+ ```
46
 
47
+ # Training
48
+ - Base model: EuroBERT/EuroBERT-210m
49
+ - Training Parameters (Approx.):
50
+ - Epochs: 8
51
+ - Learning Rate: 1e-4
52
+ - Batch size: 32
53
+ - PEFT/LoRA: Enabled with r = 16
54
+ See [train.py](https://github.com/padjohn/cbert/blob/main/causalbert/train.py) for the full configuration details.