Davidsamuel101 commited on
Commit
9cc8ef2
·
1 Parent(s): 5f21add
Files changed (3) hide show
  1. BDP LEC Report.md +116 -0
  2. __pycache__/app.cpython-38.pyc +0 -0
  3. app.py +4 -3
BDP LEC Report.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ BDP LEC Report Presentation
3
+ ===========================
4
+
5
+
6
+ ---
7
+ # Group Members:
8
+
9
+
10
+ Stella Shania Mintara, David Samuel and Egivenia.
11
+
12
+ ---
13
+ # Case Problem
14
+
15
+
16
+ FreshMart is an established large-scale supermarket in Jakarta and big cities, such as Surabaya and Bandung. On average, FreshMart implements cashier-guided checkout and self-checkout in their supermarket. With cashier-guided checkout, the cashier has the responsibility to scan products and manage transactions using cash registers.
17
+
18
+ FreshMart is already well-established, they have enough resources to buy and own servers. They prefer to outsource the server management to another party so they don’t need to search and hire talents to run and manage the servers.
19
+
20
+ ---
21
+ # I. Data Source
22
+
23
+
24
+ The data source is obtained from Fresh Mart’s surveillance cameras. This data will be ingested in the ingestion layer using Apache Kafka.
25
+
26
+ ---
27
+ # II. Ingestion Layer
28
+
29
+
30
+ The ingestion layer is where we manage to convert raw data sources. The most suitable type of ingestion is real-time data processing.
31
+
32
+ The tool that we use is Apache Kafka to ingest the data source. The reason why we choose Apache Kafka is that it can handle a lot of data per unit of time.
33
+
34
+ A Kafka cluster is made up of multiple nodes and several components, which are producer, broker, and consumer. There are multiple copies of the same data in the cluster called replicas that make the system more fault-tolerant, stable, and reliable.
35
+
36
+ We use Apache Spark for real-time and batch analysis of the data. We choose Spark Streaming as it provides fault-tolerant, scalable, and high throughput stream processing. We can analyze the data using Machine Learning and Deep Learning algorithms.
37
+
38
+ Apache Spark supports in-memory cluster computing and promises to be faster than Hadoop. It supports multiple high-level tools for data analysis such as Spark Streaming.
39
+
40
+ ---
41
+ # III. Infrastructure Layer
42
+
43
+
44
+ This layer is built on a distributed computing concept which means the data will be physically or cloud-stored in many distinct locations. This layer is built on a distributed computing concept which means the data will be physically or cloud-stored in many distinct locations.
45
+
46
+ ---
47
+ # IV. Storage Layer
48
+
49
+
50
+ This layer where all of the data will be stored in a specific type of database. This storage is also known as a non-relational, distributed, flexible, and scalable database.
51
+
52
+ For the database, we will use MongoDB which is the Document Stores type. A document is MongoDB's fundamental data storage unit. A MongoDB database is also used to store inferences after deep learning models are applied to the data streams as well.
53
+
54
+ ---
55
+ # V. Analytics Engines
56
+
57
+
58
+ This section will explain about the different analytical engine steps required.
59
+ ## A. Data Preparation
60
+
61
+
62
+ To perform these tasks we will use the Open Refine framework.
63
+ ## B. Analysis Type and Mode
64
+
65
+
66
+ Real-time analysis on the sensor data (live video) will be performed on FreshMart. Person detection, object detection, and customer association. Activity analysis is performed to determine whether a person has picked up or returned a product from and to a shelf.
67
+
68
+ The Spark Streaming instance connects to Kafka by creating a new Data Stream called DStream. The streaming data from Kafka will be ingested and analyzed in micro-batches. We can then use deep learning models such as YOLO and CNN models for joint detection and pose estimation.
69
+
70
+ This second analysis is a statistical analysis on a batch of transactional data. To perform these basic statistical analysis, we can use the Spark Framework.
71
+
72
+ ---
73
+ # VI. Visualization Layer
74
+
75
+ ## Web Framework
76
+
77
+
78
+ We use Django to develop a web application to display analysis results.
79
+
80
+ Seaborn is an open-source Python library based on matplotlib. It is used for exploratory data analysis and data visualization.
81
+
82
+ Django is a popular Python web framework to develop web applications. The model serves as a definition for stored data and manages database interactions.
83
+
84
+ Django is a web framework for big data analytics applications.
85
+
86
+ Django is a great match with MongoDB for building powerful, secure, easy-to- maintain applications. Support for a non-relational database like MongoDB can be implemented by installing additional Django-MongoDB engines for MongoDB.
87
+ ## Serving Database
88
+
89
+
90
+ A MongoDB database is also used to store inferences after the deep learning models are applied to the data streams as well. For saving large images such as image and video files up to 16MB per file, we can use MongoDB specification which is GridFS.
91
+ ## Interactive Querying
92
+
93
+
94
+ The MongoDB Connector for Spark provides integration between MongoDB and Apache Spark. Spark also includes a cost-based that is an optimization technique in Spark that uses table statistics to determine the most efficient query execution plan.
95
+
96
+ ---
97
+ # VII. Security Layer
98
+
99
+
100
+ MongoDB provides native encryption so we don’t need to pay extra money or hire 3rd party to protect sensitive data.
101
+
102
+ MongoDB supports encryption protocols such as TLS (Transport Layer Security) and SSL (Secure Socket Layer) to send and receive data securely over networks.
103
+
104
+ MongoDB uses database encryption which is called Transparent Data Encryption (TDE)
105
+
106
+ MongoDB encrypts each database using an encrypted storage engine. WiredTiger has optimized encryption further by encrypting the database file to the page level.
107
+
108
+ encrypts the motion and rest of data with a high-performance storage engine. All the private and sensitive information of the company’s data such as credit card numbers will be safe and encrypted in storage.
109
+
110
+ ---
111
+ # VIII. Monitoring Layer
112
+
113
+
114
+ This is the layer that will help us monitor all of the moving parts in the distributed Hadoop grid architecture. This needs to be done so that the data that is ingested into the process flow, are reliable and consumable.
115
+
116
+ Nagios provides us with high-efficiency worker processes. Nagios could also give us the report failures that happened in an instant along with the notification.
__pycache__/app.cpython-38.pyc CHANGED
Binary files a/__pycache__/app.cpython-38.pyc and b/__pycache__/app.cpython-38.pyc differ
 
app.py CHANGED
@@ -62,13 +62,14 @@ def inference(document):
62
  slides = preprocess.get_slides(texts)
63
  generated_slides = summarize(slides)
64
  markdown_path = convert2markdown(generated_slides)
65
-
66
- return markdown_path
 
67
 
68
 
69
  with gr.Blocks() as demo:
70
  inp = gr.File( file_types=['pdf'])
71
- out = gr.File(type="file", label="Markdown")
72
  inference_btn = gr.Button("Summarized PDF")
73
  inference_btn.click(fn=inference, inputs=inp, outputs=out, show_progress=True, api_name="summarize")
74
 
 
62
  slides = preprocess.get_slides(texts)
63
  generated_slides = summarize(slides)
64
  markdown_path = convert2markdown(generated_slides)
65
+ with open(markdown_path, 'rt') as f:
66
+ markdown_str = f.read()
67
+ return markdown_str
68
 
69
 
70
  with gr.Blocks() as demo:
71
  inp = gr.File( file_types=['pdf'])
72
+ out = gr.Textbox(label="Markdown Content")
73
  inference_btn = gr.Button("Summarized PDF")
74
  inference_btn.click(fn=inference, inputs=inp, outputs=out, show_progress=True, api_name="summarize")
75