Safetensors
File size: 8,406 Bytes
84456d8
 
 
60269cc
 
 
 
 
3864dd4
60269cc
894cc85
60269cc
 
 
 
 
 
 
39a649b
 
 
60269cc
4580d59
 
60269cc
 
4580d59
 
 
 
894cc85
4580d59
39a649b
 
4580d59
60269cc
39a649b
60269cc
 
 
fd0fba5
60269cc
4580d59
 
 
60269cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09040fc
4580d59
 
 
 
 
711b1ed
 
 
 
 
 
 
 
4580d59
 
 
 
60269cc
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: cc-by-nc-4.0
---


# ICKG Model Card

## Model Details
ICKG (Integrated Contextual Knowledge Graph Generator) v3.2 is a knowledge graph construction (KGC) task-specific instruction-following large language model (LLM) fine-tuned from [Mistral 7B](https://arxiv.org/abs/2310.06825). It outperforms the latest [ICKG v2.0](https://huggingface.co/victorlxh/ICKG-v2.0) that is fine-tuned from Vicuna LLM. 

- **Developed by**: [Xiaohui Victor Li](https://xiaohui-victor-li.github.io/)
- **Model type**: Auto-regressive language model based on the transformer architecture.
- **License**: Non-commercial
- **Finetuned from model**: [Mistral-7B](https://arxiv.org/abs/2310.06825).

## Model Sources
- **Repository**: [https://github.com/xiaohui-victor-li/FinDKG](https://github.com/xiaohui-victor-li/FinDKG)
- **Website**: [https://xiaohui-victor-li.github.io/FinDKG/](https://xiaohui-victor-li.github.io/FinDKG/)
- **Paper**:
  - [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608445](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608445)
  - [https://arxiv.org/abs/2407.10909](https://arxiv.org/abs/2407.10909)

## Use Guidance

The primary use of ICKG LLM is for generating knowledge graphs (KG) based on instruction-following capability with specialized prompts. It's intended for researchers, data scientists, and developers interested in natural language processing, and knowledge graph construction.

- Generative Knowledge Graph Construction (KGC) refers to the process employing LLMs to system- atically extract entities and relationships from textual data via given prompts, subse- quently assembling them into event triplets (see [Li, [2023]](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608445) for details).  

- Aspect-Based Sentiment Analysis (ABSA) represents a refined facet of sentiment analysis that specifically targets the sentiments associated with distinct aspects or attributes within a text. This granular approach is crucial for applications where understanding nuanced opinions about specific features is essential. ABSA not only discerns the overall sentiment of the text but also pinpoints and evaluates sentiments related to individual aspects mentioned within the document. 

- ⚠️ As a prerequisite, ensure the Hugging Face python library `transformers` and the auxiliary `peft` library (https://github.com/huggingface/peft) are pre-installed.

- 📒 For optimal performance, the LLM is recommended to run on a GPU server.


## How to Get Started with the Model

- **Python Code**: [https://github.com/xiaohui-victor-li/FinDKG](https://github.com/xiaohui-victor-li/FinDKG)

## Training Details
ICKG v3.2 is fine-tuned from the latest Mistral-7B using ~5K instruction-following demonstrations including KG construction input document and extracted KG triplets as response output. ICKG is thus learnt to extract list of KG triplets from given text document via prompt engineering. For more in-depth training details, refer to the "Generative Knowledge Graph Construction with Fine-tuned LLM" section of [the accompanying paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608445).

- **Prompt Template**:

  [ Generative Knowledge Graph Construction ]: The entities and relationship can be customized for specific tasks. `<input_text>` is the document text to replace.

  ```
  From the provided document labeled as INPUT_TEXT, your task is to extract structured information from it in the form of triplet for constructing a knowledge graph. Each tuple should be in the form of ('h', 'type',  'r', 'o', 'type'), where 'h' stands for the head entity, 'r' for the relationship, and 'o' for the tail entity. The 'type' denotes the category of the corresponding entity. Do NOT include redundant triplets, NOT include triplets with relationship that occurs in the past.   

  Note that the entities should not be generic, numerical or temporal (like dates or percentages).  Entities must be classified into the following categories:
  ORG: Organizations other than government or regulatory bodies
  ORG/GOV: Government bodies (e.g., "United States Government")
  ORG/REG: Regulatory bodies (e.g., "Federal Reserve")
  PERSON: Individuals (e.g., "Elon Musk")
  GPE: Geopolitical entities such as countries, cities, etc. (e.g., "Germany")
  COMP: Companies (e.g., "Google")
  PRODUCT: Products or services (e.g., "iPhone")
  EVENT: Specific and Material Events (e.g., "Olympic Games", "Covid-19")
  SECTOR: Company sectors or industries (e.g., "Technology sector")
  ECON_INDICATOR: Economic indicators (e.g., "Inflation rate"), numerical value like "10%" is not a ECON_INDICATOR;
  FIN_INSTRUMENT: Financial and market instruments (e.g., "Stocks", "Global Markets")
  CONCEPT: Abstract ideas or notions or themes (e.g., "Inflation", "AI", "Climate Change")
  
  The relationships 'r' between these entities must be represented by one of the following relation verbs set: Has, Announce, Operate_In, Introduce, Produce, Control, Participates_In, Impact, Positive_Impact_On, Negative_Impact_On, Relate_To, Is_Member_Of, Invests_In, Raise, Decrease.
  
  Remember to conduct entity disambiguation, consolidating different phrases or acronyms that refer to the same entity (for instance,  "UK Central Bank", "BOE" and "Bank of England" should be unified as "Bank of England"). Simplify each entity of the triplet to be less than four words.  
  
  Your output should strictly be in a list format of triplets in the JSON list format of ('h', 'type', 'r', 'o', 'type'), where the relationship 'r' must be in the given relation verbs set above. Only output the list. 
  ===========================================================
  As an Example, consider the following news excerpt:
  'Apple Inc. is set to introduce the new iPhone 14 in the technology sector this month. The product's release is likely to positively impact Apple's stock value.'
  
  From this text, your output should be:
  [('Apple Inc.', 'COMP', 'Introduce', 'iPhone 14', 'PRODUCT'),
   ('Apple Inc.', 'COMP', 'Operate_In', 'Technology Sector', 'SECTOR'),
   ('iPhone 14', 'PRODUCT', 'Positive_Impact_On', 'Apple's Stock Value', 'FIN_INSTRUMENT')]
  
  INPUT_TEXT:
  <input_text>
  ```

  <br/>
  
  [ Aspect-Based Sentiment Analysis ]: `<input_text>` is the document text to replace, `<input_ent_set>` refers to a list of the aspects (entities) of interest to identify the associated sentiment score.

  ```
  Act as if you are a senior financial analyst, from the provided news article labeled as 'INPUT_TEXT', your task is to analyze and extract sentiment scores for specific key entities. These key entities are marked as 'KEY_ENTITY' in the text.
  You are required to evaluate the sentiment surrounding each of these key entities within the context of the transcript. The sentiment score should be a continuous value ranging from -1 (most negative) to +1 (most positive), with 0 representing a neutral sentiment. For each key entity, you will present the results in a JSON format where the entity name is the key, and the sentiment score is the value. Ensure the scores accurately reflect the sentiment expressed in the transcript concerning each key entity. ONLY output the JSON result.
  ========== Example ==============
  "Global markets experienced volatility this week, with tech stocks taking a significant hit due to rising interest rates. However, the energy sector showed resilience, buoyed by increasing oil prices. Meanwhile, consumer confidence remained neutral despite economic uncertainties."
  Key Entities: Tech Stocks, Energy Sector, Consumer Confidence
  Your formatted output should be: { "Tech Stocks": -0.8, "Energy Sector": 0.6, "Consumer Confidence": 0 }
  =================================
  INPUT_TEXT: <input_doc>
  KEY_ENT: <input_ent_set>
  ```

  

## Evaluation

ICKG-v3.2 has undergone preliminary evaluation comparing its performance to GPT-3.5, GPT-4, Vicuna-7B, the original Mistral-7B, and its early model variations (e.g., ICKG-v2.0). With respect to the KG construction task, it outperforms GPT-3.5, Vicuna-7B, and Mistral-7B, while exhibiting comparative capability as GPT-4. ICKG excels in generating instruction-based knowledge graphs with a particular emphasis on quality and adherence to format.

For a more detailed introduction, refer to [the accompanying paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4608445).