Bostoncake commited on
Commit
227f71f
·
1 Parent(s): b160990

Initial commit

Browse files
README.md CHANGED
@@ -1,13 +1,34 @@
1
- ---
2
- title: ChatAssistant
3
- emoji: 🐠
4
- colorFrom: purple
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 3.24.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChatGPT Paper Reading Assistant
2
+
3
+
4
+ ## 使用步骤:
5
+ 1. 在apikey.ini中填入openai的api key([如何获取Api Key](https://chatgpt.cn.obiscr.com/blog/posts/2023/How-to-get-api-key/))。
6
+
7
+ 2. 使用过程中的网络代理方式:
8
+
9
+ - 使用VPN并且保证全局代理;
10
+
11
+ - 使用具有SOCKS5代理功能的代理软件,并在终端中指定:
12
+
13
+ ```bash
14
+ set http_proxy=http://127.0.0.1:<PORT>
15
+ set https_proxy=http://127.0.0.1:<PORT>
16
+ ```
17
+
18
+ 3. 创建虚拟环境并使用国内镜像安装依赖:
19
+
20
+ ```
21
+ conda create -n chatgpt python=3.8
22
+ conda activate chatgpt
23
+ pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
24
+ ```
25
+
26
+ 4. 对本地论文进行阅读辅助:运行`chat_assistant.py`,并指定论文路径与论文方向
27
+
28
+ ```bash
29
+ python chat_assistant.py --paper_path "paper/FedSR - A Simple and Effective Domain Generalization Method for Federated Learning.pdf" --research_fields "computer science, artificial intelligence and transfer learning"
30
+ ```
31
+
32
+ ## Credits
33
+ - 在本框架中,从PDF文件提取论文内容的代码由[kaixindelele/ChatPaper](https://github.com/kaixindelele/ChatPaper)修改而来;
34
+ - 在本框架中,调用OpenAI框架发送、接收请求的代码由[nishiwen1214/ChatReviewer](https://github.com/kaixindelele/ChatPaper)修改而来;
apikey.ini ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ [OpenAI]
2
+ OPENAI_API_KEYS = [sk-XXX, ]
3
+
app.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import os
3
+ import re
4
+ from io import BytesIO
5
+ import datetime
6
+ import time
7
+ import openai, tenacity
8
+ import argparse
9
+ import configparser
10
+ import json
11
+ import tiktoken
12
+ import PyPDF2
13
+ import gradio
14
+
15
+ # 定义Reviewer类
16
+ class Reviewer:
17
+ # 初始化方法,设置属性
18
+ def __init__(self, api, review_format, paper_pdf, language):
19
+ self.api = api
20
+ self.review_format = review_format
21
+
22
+ self.language = language
23
+ self.paper_pdf = paper_pdf
24
+ self.max_token_num = 4097
25
+ self.encoding = tiktoken.get_encoding("gpt2")
26
+
27
+
28
+ def review_by_chatgpt(self, paper_list):
29
+ text = self.extract_chapter(self.paper_pdf)
30
+ chat_review_text, total_token_used = self.chat_review(text=text)
31
+ return chat_review_text, total_token_used
32
+
33
+
34
+
35
+ @tenacity.retry(wait=tenacity.wait_exponential(multiplier=1, min=4, max=10),
36
+ stop=tenacity.stop_after_attempt(5),
37
+ reraise=True)
38
+ def chat_review(self, text):
39
+ openai.api_key = self.api # 读取api
40
+ review_prompt_token = 1000
41
+ text_token = len(self.encoding.encode(text))
42
+ input_text_index = int(len(text)*(self.max_token_num-review_prompt_token)/(text_token+1))
43
+ input_text = "This is the paper for your review:" + text[:input_text_index]
44
+ messages=[
45
+ {"role": "system", "content": "You are a professional reviewer. Now I will give you a paper. You need to give a complete review opinion according to the following requirements and format:"+ self.review_format +" Must be output in {}.".format(self.language)},
46
+ {"role": "user", "content": input_text},
47
+ ]
48
+
49
+ response = openai.ChatCompletion.create(
50
+ model="gpt-3.5-turbo",
51
+ messages=messages,
52
+ )
53
+ result = ''
54
+ for choice in response.choices:
55
+ result += choice.message.content
56
+ print("********"*10)
57
+ print(result)
58
+ print("********"*10)
59
+ print("prompt_token_used:", response.usage.prompt_tokens)
60
+ print("completion_token_used:", response.usage.completion_tokens)
61
+ print("total_token_used:", response.usage.total_tokens)
62
+ print("response_time:", response.response_ms/1000.0, 's')
63
+ return result, response.usage.total_tokens
64
+
65
+ def extract_chapter(self, pdf_path):
66
+ file_object = BytesIO(pdf_path)
67
+ # 创建一个PDF阅读器对象
68
+ pdf_reader = PyPDF2.PdfReader(file_object)
69
+ # 获取PDF的总页数
70
+ num_pages = len(pdf_reader.pages)
71
+ # 初始化提取状态和提取文本
72
+ extraction_started = False
73
+ extracted_text = ""
74
+ # 遍历PDF中的每一页
75
+ for page_number in range(num_pages):
76
+ page = pdf_reader.pages[page_number]
77
+ page_text = page.extract_text()
78
+
79
+ # 如果找到了章节标题,开始提取
80
+ if 'Abstract'.lower() in page_text.lower() and not extraction_started:
81
+ extraction_started = True
82
+ page_number_start = page_number
83
+ # 如果提取已开始,将页面文本添加到提取文本中
84
+ if extraction_started:
85
+ extracted_text += page_text
86
+ # 如果找到下一章节标题,停止提取
87
+ if page_number_start + 1 < page_number:
88
+ break
89
+ return extracted_text
90
+
91
+ def main(api, review_format, paper_pdf, language):
92
+ start_time = time.time()
93
+ if not api or not review_format or not paper_pdf:
94
+ return "请输入完整内容!"
95
+ # 判断PDF文件
96
+ else:
97
+ # 创建一个Reader对象
98
+ reviewer1 = Reviewer(api, review_format, paper_pdf, language)
99
+ # 开始判断是路径还是文件:
100
+ comments, total_token_used = reviewer1.review_by_chatgpt(paper_list=paper_pdf)
101
+ time_used = time.time() - start_time
102
+ output2 ="使用token数:"+ str(total_token_used)+"\n花费时间:"+ str(round(time_used, 2)) +"秒"
103
+ return comments, output2
104
+
105
+
106
+
107
+ ########################################################################################################
108
+ # 标题
109
+ title = "🤖ChatReviewer🤖"
110
+ # 描述
111
+
112
+ description = '''<div align='left'>
113
+ <img align='right' src='http://i.imgtg.com/2023/03/22/94PLN.png' width="270">
114
+ <strong>ChatReviewer是一款基于ChatGPT-3.5的API开发的论文自动评审AI助手。</strong>其用途如下:
115
+ ⭐️对论文进行快速总结和评审,提高科研人员的文献阅读和理解的效率,紧跟研究前沿。
116
+ ⭐️对自己的论文进行评审,根据ChatReviewer生成的审稿意见进行查漏补缺,进一步提高自己的论文质量。
117
+ ⭐️辅助论文审稿,给出参考意见,提高审稿效率和质量。(🈲:禁止直接复制生成的评论用于任何论文审稿工作!)
118
+ 如果觉得很卡,可以点击右上角的Duplicate this Space,把ChatReviewer复制到你自己的Space中!
119
+ 本项目的[Github](https://github.com/nishiwen1214/ChatReviewer),欢迎Star和Fork,也欢迎大佬赞助让本项目快速成长!💗([获取Api Key](https://chatgpt.cn.obiscr.com/blog/posts/2023/How-to-get-api-key/))
120
+ </div>
121
+ '''
122
+
123
+ # 创建Gradio界面
124
+ inp = [gradio.inputs.Textbox(label="请输入你的API-key(sk开头的字符串)",
125
+ default="",
126
+ type='password'),
127
+ gradio.inputs.Textbox(lines=5,
128
+ label="请输入特定的评审要求和格式(否则为默认格式)",
129
+ default="""* Overall Review
130
+ Please briefly summarize the main points and contributions of this paper.
131
+ xxx
132
+ * Paper Strength
133
+ Please provide a list of the strengths of this paper, including but not limited to: innovative and practical methodology, insightful empirical findings or in-depth theoretical analysis,
134
+ well-structured review of relevant literature, and any other factors that may make the paper valuable to readers. (Maximum length: 2,000 characters)
135
+ (1) xxx
136
+ (2) xxx
137
+ (3) xxx
138
+ * Paper Weakness
139
+ Please provide a numbered list of your main concerns regarding this paper (so authors could respond to the concerns individually).
140
+ These may include, but are not limited to: inadequate implementation details for reproducing the study, limited evaluation and ablation studies for the proposed method,
141
+ correctness of the theoretical analysis or experimental results, lack of comparisons or discussions with widely-known baselines in the field, lack of clarity in exposition,
142
+ or any other factors that may impede the reader's understanding or benefit from the paper. Please kindly refrain from providing a general assessment of the paper's novelty without providing detailed explanations. (Maximum length: 2,000 characters)
143
+ (1) xxx
144
+ (2) xxx
145
+ (3) xxx
146
+ * Questions To Authors And Suggestions For Rebuttal
147
+ Please provide a numbered list of specific and clear questions that pertain to the details of the proposed method, evaluation setting, or additional results that would aid in supporting the authors' claims.
148
+ The questions should be formulated in a manner that, after the authors have answered them during the rebuttal, it would enable a more thorough assessment of the paper's quality. (Maximum length: 2,000 characters)
149
+ *Overall score (1-10)
150
+ The paper is scored on a scale of 1-10, with 10 being the full mark, and 6 stands for borderline accept. Then give the reason for your rating.
151
+ xxx"""
152
+ ),
153
+ gradio.inputs.File(label="请上传论文PDF(必填)",type="bytes"),
154
+ gradio.inputs.Radio(choices=["English", "Chinese"],
155
+ default="English",
156
+ label="选择输出语言"),
157
+ ]
158
+
159
+ chat_reviewer_gui = gradio.Interface(fn=main,
160
+ inputs=inp,
161
+ outputs = [gradio.Textbox(lines=25, label="评审结果"), gradio.Textbox(lines=2, label="资源统计")],
162
+ title=title,
163
+ description=description)
164
+
165
+ # Start server
166
+ chat_reviewer_gui .launch(quiet=True, show_api=False)
chat_assistant.py ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import os
3
+ import re
4
+ import datetime
5
+ import time
6
+ import openai, tenacity
7
+ import argparse
8
+ import configparser
9
+ import json
10
+ import tiktoken
11
+ from get_paper_from_pdf import Paper
12
+
13
+ class Assistant:
14
+ def __init__(self, args=None):
15
+ if args.language == 'en':
16
+ self.language = 'English'
17
+ elif args.language == 'zh':
18
+ self.language = 'Chinese'
19
+ else:
20
+ self.language = 'Chinese'
21
+ self.config = configparser.ConfigParser()
22
+ self.config.read('apikey.ini')
23
+ self.chat_api_list = self.config.get('OpenAI', 'OPENAI_API_KEYS')[1:-1].replace('\'', '').split(',')
24
+ self.chat_api_list = [api.strip() for api in self.chat_api_list if len(api) > 5]
25
+ self.cur_api = 0
26
+ self.file_format = args.file_format
27
+ self.max_token_num = 4096
28
+ self.encoding = tiktoken.get_encoding("gpt2")
29
+ self.result_backup = ''
30
+
31
+ def validateTitle(self, title):
32
+ rstr = r"[\/\\\:\*\?\"\<\>\|]"
33
+ new_title = re.sub(rstr, "_", title)
34
+ return new_title
35
+
36
+
37
+ def assist_reading_by_chatgpt(self, paper_list):
38
+ htmls = []
39
+ for paper_index, paper in enumerate(paper_list):
40
+ sections_of_interest = self.extract_paper(paper)
41
+ # extract the essential parts of the paper
42
+ text = ''
43
+ text += 'Title:' + paper.title + '. '
44
+ text += 'Abstract: ' + paper.section_texts['Abstract']
45
+ intro_title = next((item for item in paper.section_names if 'ntroduction' in item.lower()), None)
46
+ if intro_title is not None:
47
+ text += 'Introduction: ' + paper.section_texts[intro_title]
48
+ # Similar for conclusion section
49
+ conclusion_title = next((item for item in paper.section_names if 'onclusion' in item), None)
50
+ if conclusion_title is not None:
51
+ text += 'Conclusion: ' + paper.section_texts[conclusion_title]
52
+ for heading in sections_of_interest:
53
+ if heading in paper.section_names:
54
+ text += heading + ': ' + paper.section_texts[heading]
55
+ chat_review_text = self.chat_assist(text=text)
56
+ htmls.append('## Paper:' + str(paper_index+1))
57
+ htmls.append('\n\n\n')
58
+ htmls.append(chat_review_text)
59
+
60
+ # 将问题与回答保存起来
61
+ date_str = str(datetime.datetime.now())[:19].replace(' ', '-').replace(':', '-')
62
+ try:
63
+ export_path = os.path.join('./', 'output_file')
64
+ os.makedirs(export_path)
65
+ except:
66
+ pass
67
+ mode = 'w' if paper_index == 0 else 'a'
68
+ file_name = os.path.join(export_path, date_str+'-'+self.validateTitle(paper.title)+"."+self.file_format)
69
+ self.export_to_markdown("\n".join(htmls), file_name=file_name, mode=mode)
70
+ htmls = []
71
+
72
+
73
+ def extract_paper(self, paper):
74
+ htmls = []
75
+ text = ''
76
+ text += 'Title: ' + paper.title + '. '
77
+ text += 'Abstract: ' + paper.section_texts['Abstract']
78
+ text_token = len(self.encoding.encode(text))
79
+ if text_token > self.max_token_num/2 - 800:
80
+ input_text_index = int(len(text)*((self.max_token_num/2)-800)/text_token)
81
+ text = text[:input_text_index]
82
+ openai.api_key = self.chat_api_list[self.cur_api]
83
+ self.cur_api += 1
84
+ self.cur_api = 0 if self.cur_api >= len(self.chat_api_list)-1 else self.cur_api
85
+ print("\n\n"+"********"*10)
86
+ print("Extracting content from PDF.")
87
+ print("********"*10)
88
+ messages = [
89
+ {"role": "system",
90
+ "content": f"You are a professional researcher in the field of {args.research_fields}. You are the mentor of a student who is new to this field. "
91
+ f"I will give you a paper. You need to help your student to read this paper by instructing him to read the important sections in this paper and answer his questions towards these sections."
92
+ f"Due to the length limitations, I am only allowed to provide you the abstract, introduction, conclusion and at most two sections of this paper."
93
+ f"Now I will give you the title and abstract and the headings of potential sections. "
94
+ f"You need to reply at most two headings. Then I will further provide you the full information, includes aforementioned sections and at most two sections you called for.\n\n"
95
+ f"Title: {paper.title}\n\n"
96
+ f"Abstract: {paper.section_texts['Abstract']}\n\n"
97
+ f"Potential Sections: {paper.section_names[2:-1]}\n\n"
98
+ f"Follow the following format to output your choice of sections:"
99
+ f"{{chosen section 1}}, {{chosen section 2}}\n\n"},
100
+ {"role": "user", "content": text},
101
+ ]
102
+ response = openai.ChatCompletion.create(
103
+ model="gpt-3.5-turbo",
104
+ messages=messages,
105
+ )
106
+ result = ''
107
+ for choice in response.choices:
108
+ result += choice.message.content
109
+ print("\n\n"+"********"*10)
110
+ print("Important sections of this paper:")
111
+ print(result)
112
+ print("********"*10)
113
+ print("prompt_token_used:", response.usage.prompt_tokens)
114
+ print("completion_token_used:", response.usage.completion_tokens)
115
+ print("total_token_used:", response.usage.total_tokens)
116
+ print("response_time:", response.response_ms/1000.0, 's')
117
+ return result.split(',')
118
+
119
+ @tenacity.retry(wait=tenacity.wait_exponential(multiplier=1, min=4, max=10),
120
+ stop=tenacity.stop_after_attempt(5),
121
+ reraise=True)
122
+ def chat_assist(self, text):
123
+ openai.api_key = self.chat_api_list[self.cur_api]
124
+ self.cur_api += 1
125
+ self.cur_api = 0 if self.cur_api >= len(self.chat_api_list)-1 else self.cur_api
126
+ review_prompt_token = 1000
127
+ text_token = len(self.encoding.encode(text))
128
+ input_text_index = int(len(text)*(self.max_token_num-review_prompt_token)/text_token)
129
+ input_text = "This is the paper for your review:" + text[:input_text_index] + "\n\n"
130
+ input_text_backup = input_text
131
+ while True:
132
+ print("\n\n"+"********"*10)
133
+ print("Ask ChatGPT questions of the important sections. Type \"quit\" to exit the program. To receive better responses, please describe why you ask the question.\nFor example, ask \"Why does the author use residual connections? I want to know how does the residual connections work in the model structure.\" instead of \"Why does the author use residual connections?\"")
134
+ print("********"*10)
135
+ student_question = input()
136
+ if student_question == "quit":
137
+ break
138
+ input_text = input_text_backup
139
+ input_text = input_text + "The question from your student is: " + student_question
140
+ messages=[
141
+ {"role": "system", "content": "You are a professional researcher in the field of "+args.research_fields+". You are the mentor of a student who is new to this field. Now I will give you a paper. You need to help your student to read this paper by instructing him to read the important sections in this paper and answer his questions towards these sections. Please answer in {}.".format(self.language)},
142
+ {"role": "user", "content": input_text},
143
+ ]
144
+
145
+ response = openai.ChatCompletion.create(
146
+ model="gpt-3.5-turbo",
147
+ messages=messages,
148
+ )
149
+ result = ''
150
+ for choice in response.choices:
151
+ result += choice.message.content
152
+ self.result_backup = self.result_backup + "\n\n" + student_question + "\n"
153
+ self.result_backup += result
154
+ print("\n\n"+"********"*10)
155
+ print(result)
156
+ print("********"*10)
157
+ print("prompt_token_used:", response.usage.prompt_tokens)
158
+ print("completion_token_used:", response.usage.completion_tokens)
159
+ print("total_token_used:", response.usage.total_tokens)
160
+ print("response_time:", response.response_ms/1000.0, 's')
161
+ return self.result_backup
162
+
163
+ def export_to_markdown(self, text, file_name, mode='w'):
164
+ # 使用markdown模块的convert方法,将文本转换为html格式
165
+ # html = markdown.markdown(text)
166
+ # 打开一个文件,以写入模式
167
+ with open(file_name, mode, encoding="utf-8") as f:
168
+ # 将html格式的内容写入文件
169
+ f.write(text)
170
+
171
+ def main(args):
172
+
173
+ # Paper reading assistant instructions
174
+ print("********"*10)
175
+ print("Extracting content from PDF.")
176
+ print("********"*10)
177
+
178
+
179
+ assistant1 = Assistant(args=args)
180
+ # 开始判断是路径还是文件:
181
+ paper_list = []
182
+ if args.paper_path.endswith(".pdf"):
183
+ paper_list.append(Paper(path=args.paper_path))
184
+ else:
185
+ for root, dirs, files in os.walk(args.paper_path):
186
+ print("root:", root, "dirs:", dirs, 'files:', files) #当前目录路径
187
+ for filename in files:
188
+ # 如果找到PDF文件,则将其复制到目标文件夹中
189
+ if filename.endswith(".pdf"):
190
+ paper_list.append(Paper(path=os.path.join(root, filename)))
191
+ print("------------------paper_num: {}------------------".format(len(paper_list)))
192
+ [print(paper_index, paper_name.path.split('\\')[-1]) for paper_index, paper_name in enumerate(paper_list)]
193
+ assistant1.assist_reading_by_chatgpt(paper_list=paper_list)
194
+
195
+
196
+
197
+ if __name__ == '__main__':
198
+ parser = argparse.ArgumentParser()
199
+ parser.add_argument("--paper_path", type=str, default='', help="path of papers")
200
+ parser.add_argument("--file_format", type=str, default='txt', help="output file format")
201
+ parser.add_argument("--research_fields", type=str, default='computer science, artificial intelligence and transfer learning', help="the research fields of paper")
202
+ parser.add_argument("--language", type=str, default='en', help="output lauguage, en or zh")
203
+
204
+ args = parser.parse_args()
205
+ start_time = time.time()
206
+ main(args=args)
207
+ print("total time:", time.time() - start_time)
208
+
get_paper_from_pdf.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import fitz, io, os
2
+ from PIL import Image
3
+ from collections import Counter
4
+ import json
5
+ import re
6
+
7
+
8
+ class Paper:
9
+ def __init__(self, path, title='', url='', abs='', authors=[]):
10
+ # 初始化函数,根据pdf路径初始化Paper对象
11
+ self.url = url # 文章链接
12
+ self.path = path # pdf路径
13
+ self.section_names = [] # 段落标题
14
+ self.section_texts = {} # 段落内容
15
+ self.abs = abs
16
+ self.title_page = 0
17
+ if title == '':
18
+ self.pdf = fitz.open(self.path) # pdf文档
19
+ self.title = self.get_title()
20
+ self.parse_pdf()
21
+ else:
22
+ self.title = title
23
+ self.authors = authors
24
+ self.roman_num = ["I", "II", 'III', "IV", "V", "VI", "VII", "VIII", "IIX", "IX", "X"]
25
+ self.digit_num = [str(d + 1) for d in range(10)]
26
+ self.first_image = ''
27
+
28
+ def parse_pdf(self):
29
+ self.pdf = fitz.open(self.path) # pdf文档
30
+ self.text_list = [page.get_text() for page in self.pdf]
31
+ self.all_text = ' '.join(self.text_list)
32
+ self.extract_section_infomation()
33
+ self.section_texts.update({"title": self.title})
34
+ self.pdf.close()
35
+
36
+ # 定义一个函数,根据字体的大小,识别每个章节名称,并返回一个列表
37
+ def get_chapter_names(self, ):
38
+ # # 打开一个pdf文件
39
+ doc = fitz.open(self.path) # pdf文档
40
+ text_list = [page.get_text() for page in doc]
41
+ all_text = ''
42
+ for text in text_list:
43
+ all_text += text
44
+ # # 创建一个空列表,用于存储章节名称
45
+ chapter_names = []
46
+ for line in all_text.split('\n'):
47
+ line_list = line.split(' ')
48
+ if '.' in line:
49
+ point_split_list = line.split('.')
50
+ space_split_list = line.split(' ')
51
+ if 1 < len(space_split_list) < 5:
52
+ if 1 < len(point_split_list) < 5 and (
53
+ point_split_list[0] in self.roman_num or point_split_list[0] in self.digit_num):
54
+ # print("line:", line)
55
+ chapter_names.append(line)
56
+
57
+ return chapter_names
58
+
59
+ def get_title(self):
60
+ doc = self.pdf # 打开pdf文件
61
+ max_font_size = 0 # 初始化最大字体大小为0
62
+ max_string = "" # 初始化最大字体大小对应的字符串为空
63
+ max_font_sizes = [0]
64
+ for page_index, page in enumerate(doc): # 遍历每一页
65
+ text = page.get_text("dict") # 获取页面上的文本信息
66
+ blocks = text["blocks"] # 获取文本块列表
67
+ for block in blocks: # 遍历每个文本块
68
+ if block["type"] == 0 and len(block['lines']): # 如果是文字类型
69
+ if len(block["lines"][0]["spans"]):
70
+ font_size = block["lines"][0]["spans"][0]["size"] # 获取第一行第一段文字的字体大小
71
+ max_font_sizes.append(font_size)
72
+ if font_size > max_font_size: # 如果字体大小大于当前最大值
73
+ max_font_size = font_size # 更新最大值
74
+ max_string = block["lines"][0]["spans"][0]["text"] # 更新最大值对应的字符串
75
+ max_font_sizes.sort()
76
+ # print("max_font_sizes", max_font_sizes[-10:])
77
+ cur_title = ''
78
+ for page_index, page in enumerate(doc): # 遍历每一页
79
+ text = page.get_text("dict") # 获取页面上的文本信息
80
+ blocks = text["blocks"] # 获取文本块列表
81
+ for block in blocks: # 遍历每个文本块
82
+ if block["type"] == 0 and len(block['lines']): # 如果是文字类型
83
+ if len(block["lines"][0]["spans"]):
84
+ cur_string = block["lines"][0]["spans"][0]["text"] # 更新最大值对应的字符串
85
+ font_flags = block["lines"][0]["spans"][0]["flags"] # 获取第一行第一段文字的字体特征
86
+ font_size = block["lines"][0]["spans"][0]["size"] # 获取第一行第一段文字的字体大小
87
+ # print(font_size)
88
+ if abs(font_size - max_font_sizes[-1]) < 0.3 or abs(font_size - max_font_sizes[-2]) < 0.3:
89
+ # print("The string is bold.", max_string, "font_size:", font_size, "font_flags:", font_flags)
90
+ if len(cur_string) > 4 and "arXiv" not in cur_string:
91
+ # print("The string is bold.", max_string, "font_size:", font_size, "font_flags:", font_flags)
92
+ if cur_title == '':
93
+ cur_title += cur_string
94
+ else:
95
+ cur_title += ' ' + cur_string
96
+ self.title_page = page_index
97
+ # break
98
+ title = cur_title.replace('\n', ' ')
99
+ return title
100
+
101
+ def extract_section_infomation(self):
102
+ doc = fitz.open(self.path)
103
+
104
+ # 获取文档中所有字体大小
105
+ font_sizes = []
106
+ for page in doc:
107
+ blocks = page.get_text("dict")["blocks"]
108
+ for block in blocks:
109
+ if 'lines' not in block:
110
+ continue
111
+ lines = block["lines"]
112
+ for line in lines:
113
+ for span in line["spans"]:
114
+ font_sizes.append(span["size"])
115
+ most_common_size, _ = Counter(font_sizes).most_common(1)[0]
116
+
117
+ # 按照最频繁的字体大小确定标题字体大小的阈值
118
+ threshold = most_common_size * 1
119
+
120
+ section_dict = {}
121
+ section_dict["Abstract"] = ""
122
+ last_heading = None
123
+ subheadings = []
124
+ heading_font = -1
125
+ # 遍历每一页并查找子标题
126
+ found_abstract = False
127
+ upper_heading = False
128
+ font_heading = False
129
+ for page in doc:
130
+ blocks = page.get_text("dict")["blocks"]
131
+ for block in blocks:
132
+ if not found_abstract:
133
+ try:
134
+ text = json.dumps(block)
135
+ except:
136
+ continue
137
+ if re.search(r"\bAbstract\b", text, re.IGNORECASE):
138
+ found_abstract = True
139
+ last_heading = "Abstract"
140
+ if found_abstract:
141
+ if 'lines' not in block:
142
+ continue
143
+ lines = block["lines"]
144
+ for line in lines:
145
+ for span in line["spans"]:
146
+ # 如果当前文本是子标题
147
+ if not font_heading and span["text"].isupper() and sum(1 for c in span["text"] if c.isupper() and ('A' <= c <='Z')) > 4: # 针对一些标题大小一样,但是全大写的论文
148
+ upper_heading = True
149
+ heading = span["text"].strip()
150
+ if "References" in heading: # reference 以后的内容不考虑
151
+ self.section_names = subheadings
152
+ self.section_texts = section_dict
153
+ return
154
+ subheadings.append(heading)
155
+ if last_heading is not None:
156
+ section_dict[last_heading] = section_dict[last_heading].strip()
157
+ section_dict[heading] = ""
158
+ last_heading = heading
159
+ if not upper_heading and span["size"] > threshold and re.match( # 正常情况下,通过字体大小判断
160
+ r"[A-Z][a-z]+(?:\s[A-Z][a-z]+)*",
161
+ span["text"].strip()):
162
+ font_heading = True
163
+ if heading_font == -1:
164
+ heading_font = span["size"]
165
+ elif heading_font != span["size"]:
166
+ continue
167
+ heading = span["text"].strip()
168
+ if "References" in heading: # reference 以后的内容不考虑
169
+ self.section_names = subheadings
170
+ self.section_texts = section_dict
171
+ return
172
+ subheadings.append(heading)
173
+ if last_heading is not None:
174
+ section_dict[last_heading] = section_dict[last_heading].strip()
175
+ section_dict[heading] = ""
176
+ last_heading = heading
177
+ # 否则将当前文本添加到上一个子标题的文本中
178
+ elif last_heading is not None:
179
+ section_dict[last_heading] += " " + span["text"].strip()
180
+ self.section_names = subheadings
181
+ self.section_texts = section_dict
182
+
183
+
184
+ def main():
185
+ path = r'demo.pdf'
186
+ paper = Paper(path=path)
187
+ paper.parse_pdf()
188
+ # for key, value in paper.section_text_dict.items():
189
+ # print(key, value)
190
+ # print("*"*40)
191
+
192
+
193
+ if __name__ == '__main__':
194
+ main()
output_file/2023-04-08-15-32-46-FedSR_ A Simple and Effective Domain Generalization Abstract References Checklist.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Paper:1
2
+
3
+
4
+
5
+
6
+
7
+
8
+ Would you please tell me why does the author use conditional mutual information instead of mutual information? The author explained that conditional mutual information is not as restrictive as mutual information. I want to know more details.
9
+ The author chose to use conditional mutual information (CMI) instead of mutual information because CMI is less restrictive and provides more flexibility in achieving domain generalization. CMI measures the amount of information shared between two variables (in this case, the representation and the data given the label) while controlling for the influence of a third variable (in this case, the label). By using CMI, the model can focus on learning only the essential information relevant to the prediction task while ignoring spurious correlations such as background noise. In contrast, mutual information measures the total amount of information shared between two variables regardless of whether it is relevant to the prediction task or not. Therefore, CMI is a more suitable choice for achieving domain generalization where the goal is to learn a representation that is invariant across domains while still capturing relevant information for the prediction task.
10
+
11
+ Would you please tell me why does the auther report results of 3 runs?
12
+ The author reports the results of 3 runs to ensure the stability and reproducibility of their proposed method. By running their experiment multiple times, they can observe the variance and the consistency of their method's performance. Variance in results could be due to various reasons such as initialization, randomization, or the stochastic nature of the algorithm used. Therefore, the author performs multiple runs and report the mean and standard deviation of the results. This helps to ensure that their proposed method's performance is not a random or outlier result and is instead an accurate representation of its actual performance.
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ PyMuPDF==1.21.1
2
+ tiktoken==0.2.0
3
+ tenacity==8.2.2
4
+ pybase64==1.2.3
5
+ Pillow==9.4.0
6
+ openai==0.27.0
7
+ markdown
8
+ gradio==3.20.1
9
+ PyPDF2