# data science with r

project 1

Problem Statement:

An education department in the US needs to analyze the factors that influence the admission of a student into a college.

Analyze the historical data and determine the key drivers. Analysis information:

Predictive â€¢ Run logistic model to determine the factors that influence the admission process of a student (Drop insignificant variables) â€¢ Transform variables to factors wherever required â€¢ Calculate accuracy of the model â€¢ Try other modeling techniques like decision tree and SVM and select a champion model â€¢ Determine the accuracy rates for each model â€¢ Select the most accurate model â€¢ Identify other Machine learning or statistical techniques that can be used

Descriptive â€¢ Categorize the grade point average into High, Medium, and Low (with admission probability percentages) and plot it on a point chart. â€¢ Cross grid for admission variables with GRE Categorization is shown below:

GRE Categorized 0-440 Low 440-580 Medium 580 + High

Variables in the Dataset:

• GRE (Graduate Record Exam scores) â€¢ GPA (grade point average) â€¢ Rank refers to prestige of the undergraduate institution. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. â€¢ Admit is a response variable; admit/donâ€™t admit is a binary variable where 1 indicates that student is admitted and 0 indicates that student is not admitted. â€¢ SES refers to socioeconomic status: 1 – low, 2 – medium, 3 – high. â€¢ Gender_male (0, 1) = 0 -> Female, 1 -> Male â€¢ Race â€“ 1, 2, and 3 represent Hispanic, Asian, and African-American
• project 2

• Problem Statement: A UK-based online retail store has captured the sales data for different products for the period of one year (Nov 2016 to Dec 2017). The organization sells gifts primarily on the online platform. The customers who make a purchase consume directly for themselves. There are small businesses that buy in bulk and sell to other customers through the retail outlet channel. Analysis information: Find the significant customers for the business who make high purchases of their favorite products. The organization wants to roll out an offer to the high-value customers after identification of segments. Use the clustering methodology to segment customers into groups: â€¢ Use the following clustering algorithms: o K means o Hierarchical â€¢ Identify the right number of customer segments â€¢ Provide the number of customers who are highly valued â€¢ Identify the clustering algorithm that gives maximum accuracy and explains robust clusters. â€¢ If the number of observations is loaded in one of the clusters, break down that cluster further using clustering algorithm. Variables in the Dataset: â€¢ This is a transnational dataset that contains all the transactions occurring between Nov-2016 to Dec-2017 for a UK-based online retail store. â€¢ Variable Information: o InvoiceNo: Invoice number (A 6-digit integral number uniquely assigned to each transaction) o StockCode: Product (item) code o Description: Product (item) name o Quantity: The quantities of each product (item) per transaction o InvoiceDate: The day when each transaction was generated o UnitPrice: Unit price (Product price per unit) o CustomerID: Customer number (Unique ID assigned to each customer) o Country: Country name (The name of the country where each customer resides)
• project 3
• Question A nationwide survey of hospital costs conducted by the US Agency for Healthcare consists of hospital records of inpatient samples. The given data is restricted to the city of Wisconsin and relates to patients in the age group 0-17 years. The agency wants to analyze the data to research on the healthcare costs and their utilization. Here is a detailed description of the given dataset: AGE : Age of the patient discharged FEMALE : Binary variable that indicates if the patient is female LOS : Length of stay, in days RACE : Race of the patient (specified numerically) TOTCHG : Hospital discharge costs APRDRG : All Patient Refined Diagnosis Related Groups To complete this project, you will require a strong understanding of the following concepts: ï‚· Lesson 3 â€“ Histogram, Summaries, ANOVA ï‚· Lesson 4 â€“ Linear Regression 2 | P a g e The goals of this project are: ï€­ To record the patient statistics, the agency wants to find the age category of people who frequent the hospital and has the maximum expenditure. ï€­ In order of severity of the diagnosis and treatments and to find out the expensive treatments, the agency wants to find the diagnosis related group that has maximum hospitalization and expenditure. ï€­ To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs. ï€­ To properly utilize the costs, the agency has to analyze the severity of the hospital costs by age and gender for proper allocation of resources. ï€­ Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race. ï€­ To perform a complete analysis, the agency wants to find the variable that mainly affects the hospital costs. The data can be downloaded from the URL mentioned below (under the name HospitalCosts): http://instruction.bus.wisc.edu/jfrees/jfreesbooks… /BookWebDec2010/data.html The total time provided to complete this task is 2 hours.
• project 4
• Question The data gives the details of third party motor insurance claims in Sweden for the year 1977. In Sweden, all motor insurance companies apply identical risk arguments to classify customers, and thus their portfolios and their claims statistics can be combined. The data were compiled by a Swedish Committee on the Analysis of Risk Premium in Motor Insurance. The Committee was asked to look into the problem of analyzing the real influence on the claims of the risk arguments and to compare this structure with the actual tariff. The insurance dataset holds 7 variables and the description of these variables are given below: Variable Description Kilometers Kilometers travelled per year 1: < 1000 2: 1000-15000 3: 15000-20000 4: 20000-25000 5: > 25000 Zone Geographical zone 1: Stockholm, GÃ¶teborg, and MalmÃ¶ with surroundings 2: Other large cities with surroundings 3: Smaller cities with surroundings in southern Sweden 4: Rural areas in southern Sweden 5: Smaller cities with surroundings in northern Sweden 6: Rural areas in northern Sweden 7: Gotland 2 | P a g e Bonus No claims bonus; equal to the number of years, plus one, since the last claim Make 1-8 represents eight different common car models. All other models are combined in class 9. Insured Number of insured in policy-years Claims Number of claims Payment Total value of payments in Skr (Swedish Krona) To complete this project, you will require a strong understanding of the following concepts: ï‚· Lesson 3: Correlation, summarizing, subset summary, and data visualization ï‚· Lesson 4: Linear Regression After understanding the data, you need to help the committee with the following by the use of the R tool: – The committee is interested to know each field of the data collected through descriptive analysis to gain basic insights into the data set and to prepare for further analysis. – The total value of payment by an insurance company is an important factor to be monitored. So the committee has decided to find whether this payment is related to number of claims and the number of insured policy years. They also want to visualize the results for better understanding. 3 | P a g e – The committee wants to figure out the reasons for insurance payment increase and decrease. So they have decided to find whether distance, location, bonus, make, and insured amount or claims are affecting the payment or all or some of these are affecting it. – The insurance company is planning to establish a new branch office, so they are interested to find at what location, kilometer, and bonus level their insured amount, claims, and payment get increased. (Hint: Aggregate Dataset) – The committee wants to understand what affects their claim rates so as to decide the right premiums for a certain set of situations. Hence, they need to find whether the insured amount, zone, kilometer, bonus, or make affects the claim rates and to what extent. The data can be downloaded from the URL mentioned below (Swedish motor insurance): http://instruction.bus.wisc.edu/jfrees/jfreesbooks… /BookWebDec2010/data.html The total time provided to complete this task is 4 hours.
• project 5
• Question The web analytics team of www.datadb.com is interested to understand the web activities of the site, which are the sources used to access the website. They have a database that states the keywords of time in page, source group, bounces, exits, unique page views, and visits. The variables in the dataset are defined here for better understanding: – Bounces: It represents the percentage of visitors who enter the site and “bounce” (leave the site) rather than continuing to view other pages within the same site. – Exits: It represents the percentage of visitors to a site who actively click away to a different site from a specific page, after possibly having visited any other page on the site. – Continent: It shows the continent from which the site has been accessed. – Source group: It shows how the visitor has accessed the site. – Time on page: It shows how long the user has spent on that particular page of the website. – Unique page view: It represents the number of sessions during which that page was viewed one or more times. – Visits: A visit counts all visitors, no matter how many times the same visitor may have been to your site. To complete this project, you will require a strong understanding of the following concepts: â€¢ Lesson 3 â€“ Explore data using R, diagnostic analysis using R â€¢ Lesson 4 â€“Logistic Regression 2 | P a g e The team is targeting at the following issues: – The team wants to analyze each variable of the data collected through data summarization to get a basic understanding of the dataset and to prepare for further analysis. – As mentioned earlier, a unique page view represents the number of sessions during which that page was viewed one or more times. A visit counts all instances, no matter how many times the same visitor may have been to your site. So the team needs to know whether the unique page view value depends on visits. – Find out the probable factors from the dataset, which could affect the exits. Exit Page Analysis is usually required to get an idea about why a user leaves the website for a session and moves on to another one. Please keep in mind that exits should not be confused with bounces. – Every site wants to increase the time on page for a visitor. This increases the chances of the visitor understanding the site content better and hence there are more chances of a transaction taking place. Find the variables which possibly have an effect on the time on page. – A high bounce rate is a cause of alarm for websites which depend on visitor engagement. Help the team in determining the factors that are impacting the bounce. The data can be downloaded from the Excel file given below: InternetCaseStudy.c svThe total time provided to complete this task is 2 hours.
• project 6
• Question A high-end fashion retail store is looking to expand its products. It wants to understand the market and find the current trends in the industry. It has a database of all products with attributes, such as, style, material, season, and the sales of the products over a period of two months. There are two files provided, and the detailed description of each is given below: Attribute DataSet.csv Dress_ID : A unique identifier for each dress Style : Style of dress, can belong to one of 12 styles, including casuals, novelty, etc. Price : Price category of the dress (low, average, medium, high, and very high) Rating : A number between 0 and 5, specifying the rating of the dress Size : Size of the dress (small, medium, large, XL, and free) Season : Season category of the dress, i.e., summer, spring, etc. NeckLine : Type of neckline, for example, V-neck, collar, etc. Sleeve length : Length of the sleeveâ€”full, three-quarters, etc. Waistline : Waistline of the dress Material : Material of the dress, for example, silk, cotton, etc. Fabric type : Fabric type of the dress Decoration : Decoration of dress, like ruffles, embroidery, etc. Pattern Type : Pattern type of the dressâ€”dot, animal print, etc. Recommendation: A binary value suggesting a recommendation (1) or not (0) 2 | P a g e Dress Sales.xlsx Dress_ID : A unique identifier for each dress The remaining columns depict the sales for each dress on a particular date. Date ranges from 29/8/2013 to 12/10/2013, and the sales are registered for alternative days. To complete this project, you will require a strong understanding of the following concepts: ï‚· Lesson 3â€”ANOVA and Correlation ï‚· Lesson 4â€”Linear Regression, Logistic Regression, and Time Series Analysis The goals of this project are: ï€­ To automate the process of recommendations, the store needs to analyze the given attributes of the product, like style, season, etc., and come up with a model to predict the recommendation of products (in binary output â€“ 0 or 1) accordingly. ï€­ In order to stock the inventory, the store wants to analyze the sales data and predict the trend of total sales for each dress for an extended period of three more alternative days. ï€­ To decide the pricing for various upcoming clothes, they wish to find how the style, season, and material affect the sales of a dress and if the style of the dress is more influential than its price. ï€­ Also, to increase the sales, the management wants to analyze the attributes of dresses and find which are the leading factors affecting the sale of a dress. ï€­ To regularize the rating procedure and find its efficiency, the store wants to find if the rating of the dress affects the total sales. 3 | P a g e The data can be downloaded from the URL mentioned below (under the Data Folder link): http://archive.ics.uci.edu/ml/datasets/Dresses_Att… The total time provided to complete this task is 4 hours.
• Kindly follow the below steps to draft the Project Report :
1. Project Report has to be created in a â€œsingleâ€ word document/ PPT / PDF