DBST 667 – Data Mining: Week 9 Individual Exercise

Deliverables: One File: (1) Submit this lab report with answers to all questions including output screenshots into the ‘Individual Exercises Week 9’ assignment folder.

Grading: This exercise is worth 2% of the course grade. All questions must be answered in your own words with any paraphrased references properly cited using in-text citations and a reference list as needed. In addition, grammatical and spelling errors may affect the grade.

The purpose of this exercise is to make practical sense of mining data streams. This assignment DOES NOT require using R Studio. The assignment consists of two parts below. Put all of your answers in the spaces provided. Answer all questions in your own words. This assignment is designed to be free from the need for external research. Should the need arise to include these, ensure that you properly cite and attribute all non-original content.

Part 1 – Website Optimization

Visit http://www.websiteoptimization.com/services/analyze/ and analyze any web page of your choice in the ‘Enter URL to diagnose’ field. For the purpose of this exercise, it would be beneficial and easier for analysis if you choose a web page that has a perceived latency when responding to your web requests.

Discuss what page you elected to use for analysis and then what the website optimization report reveals about that web page. Include the report and report interpretation below.

(Hint: Focus on the ‘Analysis and Recommendations’ section)

With the improvement of current technology, the major problem faced by the World Wide Web is user latency. User latency is the time user waits for a web page they requested to load. If the page takes a long time to respond to the user, the user might terminate the visit to the website. For E-commerce sites, if the user terminates the stay because of latency, it directly translates into lost revenue for the business (Marshak and Levy, nd). It is essential to measure the delay experienced by the user for the e-commerce website. These measures are necessary for analyzing components of the site and their behaviors. In this paper, we are going to explore https://www.amazon.com, an e-commerce website.

From the analysis of our website, https://www.amazon.com, optimization reports reveal that the website has more objects that cause delay. The page contains 24 objects in total, but the recommend number of objects per page is 20. Above 20 objects per page causes more than 75% of whole page latency. The reports also reveal that the site contains 22 images, which is more than recommended and causes page load delays. Another report shows that the page’s total size is 185966 bytes, which takes over 20 seconds on a 56kbps modern, but the recommended load time is less than 20 seconds. The total HTML size of the page is 176542 bytes, but the recommended is 100k recommended.

Imagine that you are a website manager. Explain how you could use the website optimization report to improve the performance of your website. Provide at least two examples from the website optimization report.

The report is evidence that the website has more objects that dominate the web page delay. As a website manager, by replacing graphic rollovers with CSS rollovers will refine, combine and optimize web page which will speed up display and minimize HTTP requests. Will use CSS sprites to consolidate decorative images and optimize parallel downloads by using different hostnames or a CDN to reduce object overhead.

Other issues are causing page delays from the analysis it more images. I will combine and optimize graphics by replacing graphic rollover menus with CSS rollover menus to speed display and minimize HTTP requests.

From the report, the total size of the page is 185966 bytes which are over 100k. I will substitute abstract CSS rules for repeated embedded styles and eliminate unnecessary comments and whitespaces to solve the issue. We will also use HTTP compression to compress the XHTML with GZIP to reduce HTML by an average of 75%.

Part 2 – Time Series Analysis

Visit https://datamarket.com/data/list/?q=provider:tsdl and choose one of the datasets in the list that would be good for time series analysis.

Discuss what data is in the dataset at a high-level and why you selected this data for time series analysis.

Time series data is a collection of data quantities that are collected over time intervals in chronological order. The interval at which the data is collected is always even like monthly and is known as the time series frequency. In this paper, we will use Pharma Sales data that compares sales of pharmaceutical products over time between the Qlik Bands and Non-Qlik Brands. The data contains products that can be either Qlik Brand or non-Qlik Brand. It also includes the physicians who are doctors in our case. The doctor prescribes the product to the visitor (patient). The data also contains Dates that prescriptions are made and dates of purchases. Data collection also include the geographical location of both physicians and visitors, markets at which the product are sold, and calls made to the pharma.

Imagine that your course project will be using this dataset for time series analysis. Now, imagine what insight or purpose your project would try to uncover by exploring this dataset with time series analysis. (Note: This is a purposefully abstract prompt. The only wrong answers are those that do not use your imagination and analytical abilities).

We can use a time series analysis from the data set above to determine which product brands have more sales over time. We can also determine how doctors have prescribed products over time. We can determine what products that doctors prescribed have been bought over time. We can also loyal customers that have been purchasing products over time.

Choose one of the following time series methods and discuss how you would use the method in your study:

Lossy CountingRandom SamplingVery Fast Decision Tree (VFDT)Concept-Adapting Very Fast Decision Tree (CVFDT)Hoeffding TreeCluStreamSequential Pattern Mining

(Hint: Your weekly lab projects and course project have been building towards this form of thought so look to the structure/implementation of these for your answer.)

For our study, I will you ClusStream time series method, which stands out amongst the most time series method. Usually, unsupervised clustering tasks are always carried out in a batch mode where data is stored somewhere in physically where several passes on the data are carried out. However, in the new big data perspective, all data cannot be stored and arrive simultaneously. Such data flow reach processing systems at a high speed and might contain data generation that is non-stationary. This can cause inconvenience in storing the data and an unknown number of clusters. Due to the high-speed rate at which data is transmitted, it can cause high noise levels. These factors in big data make traditional data clustering not suitable. Clustream has evolved as a method of high research that aims to tackle challenges faced by traditional time stream analysis methods. Clustream algorithm is one of the most advanced state-of-the-art stream clustering methods that have two phases. Phase one of Clustream is an online micro-clustering, and the second phase is offline Clustering. On the online micro-clustering, statics are gathered describing the incoming data.

In contrast, a conventional non-stream clustering algorithm is executed using the high-level statistics based on the online phase (Kumar and Singh, nd). It uses k-means in the micro-clustering phase that enables it to accomplish short-time calculations while maintaining data integrity in a high-dimensional setting. Clustering can be used in our study to capture all calls, prescriptions, and geo-locations without storing them in the stream using online phase clustering.

Would the method you selected meet the purpose of the study? Are there any potential drawbacks and/or any additional considerations that must be made?

Then the CluStream method will meet the purpose of time series analysis of the study purpose discussed above. Personal computer with specification like Processor: Intel(R) Core (TM) i3 CPU. Processor Frequency of 2.10GHz and running on windows operating system.

References.

Marshak, M, Levy, H,(nd). Evaluating Web User Perceived Latency Using Server-Side Measurements. Link: ? (psu.edu)

Singh, A., Kumar, A., & Singh, R. (2017). An Efficient hybrid-clustream algorithm for stream mining. https://www.researchgate.net/publication/321670718_An_Efficient_Hybrid-Clustream_Algorithm_for_Stream_Mining

DBST 667 – Data Mining: Week 9 Individual Exercise

Comments

Leave a Reply Cancel reply