DBST 667 – Data Mining

Part I.

With the improvement of current technology, the major problem faced by the World Wide Web is user latency. User latency is the time user waits for a web page they requested to load. If the page takes a long time to respond to the user, the user might terminate the visit to the website. For E-commerce sites, if the user terminates the stay because of latency, it directly translates into lost revenue for the business. It is essential to measure the delay experienced by the user for the e-commerce website. These measures are necessary for analyzing components of the site and their behaviors . In this paper, we are going to explore https://www.amazon.com, an e-commerce website.

From the analysis of our website, https://www.amazon.com, optimization reports reveal that the website has more objects that cause delay. The page contains 24 objects in total, but the recommend number of objects per page is 20. Above 20 objects per page causes more than 75% of whole page latency. The reports also reveal that the site contains 22 images, which is more than recommended and causes page load delays. Another report shows that the page’s total size is 185966 bytes, which takes over 20 seconds on a 56kbps modern, but the recommended load time is less than 20 seconds. The total HTML size of the page is 176542 bytes, but the recommended is 100k recommended.

The report is evidence that the website has more objects that dominate the web page delay. As a website manager, by replacing graphic rollovers with CSS rollovers will refine, combine and optimize web page which will speed up display and minimize HTTP requests. Will use CSS sprites to consolidate decorative images and optimize parallel downloads by using different hostnames or a CDN to reduce object overhead.

Other issues are causing page delays from the analysis it more images. I will combine and optimize graphics by replacing graphic rollover menus with CSS rollover menus to speed display and minimize HTTP requests.

From the report, the total size of the page is 185966 bytes which are over 100k. I will substitute abstract CSS rules for repeated embedded styles and eliminate unnecessary comments and whitespaces to solve the issue. We will also use HTTP compression to compress the XHTML with GZIP to reduce HTML by an average of 75%.

Part II.

Time series data is a collection of data quantities that are collected over time intervals in chronological order. The interval at which the data is collected is always even like monthly and is known as the time series frequency. In this paper, we will use Pharma Sales data that compares sales of pharmaceutical products over time between the Qlik Bands and Non-Qlik Brands. The data contains products that can be either Qlik Brand or non-Qlik Brand. It also includes the physicians who are doctors in our case. The doctor prescribes the product to the visitor (patient). The data also contains Dates that prescriptions are made and dates of purchases. Data collection also include the geographical location of both physicians and visitors, markets at which the product are sold, and calls made to the pharma.

We can use a time series analysis from the data set above to determine which product brands have more sales over time. We can also determine how doctors have prescribed products over time. We can determine what products that doctors prescribed have been bought over time. We can also loyal customers that have been purchasing products over time.

For our study, I will you ClusStream time series method, which stands out amongst the most time series method. Usually, unsupervised clustering tasks are always carried out in a batch mode where data is stored somewhere in physically where several passes on the data are carried out. However, in the new big data perspective, all data cannot be stored and arrive simultaneously. Such data flow reach processing systems at a high speed and might contain data generation that is non-stationary. This can cause inconvenience in storing the data and an unknown number of clusters. Due to the high-speed rate at which data is transmitted, it can cause high noise levels. These factors in big data make traditional data clustering not suitable. Clustream has evolved as a method of high research that aims to tackle challenges faced by traditional time stream analysis methods. Clustream algorithm is one of the most advanced state-of-the-art stream clustering methods that have two phases. Phase one of Clustream is an online micro-clustering, and the second phase is offline Clustering. On the online micro-clustering, statics are gathered describing the incoming data.

In contrast, a conventional non-stream clustering algorithm is executed using the high-level statistics based on the online phase (Kumar and Singh, nd). It uses k-means in the micro-clustering phase that enables it to accomplish short-time calculations while maintaining data integrity in a high-dimensional setting. Clustering can be used in our study to capture all calls, prescriptions, and geo-locations without storing them in the stream using online phase clustering.

Then the CluStream method will meet the purpose of time series analysis of the study purpose discussed above. Personal computer with specification like Processor: Intel(R) Core (TM) i3 CPU. Processor Frequency of 2.10GHz and running on windows operating system.

References.

Singh, A., Kumar, A., & Singh, R. (2017). An Efficient hybrid-clustream algorithm for stream mining. https://www.researchgate.net/publication/321670718_An_Efficient_Hybrid-Clustream_Algorithm_for_Stream_Mining

DBST 667 – Data Mining

Comments

Leave a Reply Cancel reply