Beskrivelse
Lær at bruger Microsoft R Server til at analysere store datasæt i Big Data miljøer med Hadoop, Spark cluster, og SQL Server databaser. Efter kurset vil deltageren være i stand til:
- Forklare hvordan Microsoft R Server og Microsoft R Client arbejder sammen
- Bruge R-klient med R-server til at udforske store datasæt fra forskellige datagrundlag
- Visualiser data ved hjælp af grafer og diagrammer
- Transformere og rengøre store datasæt
- Implementere muligheder for at splitte analysearbejdet i forskellige parallelle opgaver
- Bygge og evaluer regressionsmodeller genereret fra store datasæt
- Oprette og implementer partitioning models genereret fra store datasæt
- Bruge R i SQL Server og Hadoop miljøer
Indhold
Module 1: Microsoft R Server and R Client
- Explain how Microsoft R Server and Microsoft R Client work.
- Lessons
- What is Microsoft R server
- Using Microsoft R client
- The ScaleR functions
- Lab : Exploring Microsoft R Server and Microsoft R Client
- Using R client in VSTR and RStudio
- Exploring ScaleR functions
- Connecting to a remote server Module 2: Exploring Big Data
- At the end of this module the student will be able to use R Client with R Server to explore big data held in different data stores.
- Lessons
- Understanding ScaleR data sources
- Reading data into an XDF object
- Summarizing data in an XDF object
- Lab : Exploring Big Data
- Reading a local CSV file into an XDF file
- Transforming data on input
- Reading data from SQL Server into an XDF file
- Generating summaries over the XDF data Module 3: Visualizing Big Data
- Explain how to visualize data by using graphs and plots.
- Lessons
- Visualizing In-memory data
- Visualizing big data
- Lab : Visualizing data
- Using ggplot to create a faceted plot with overlays
- Using rxlinePlot and rxHistogram Module 4: Processing Big Data
- Explain how to transform and clean big data sets.
- Lessons
- Transforming Big Data
- Managing datasets
- Lab : Processing big data
- Transforming big data
- Sorting and merging big data
- Connecting to a remote server Module 5: Parallelizing Analysis Operations
- Explain how to implement options for splitting analysis jobs into parallel tasks.
- Lessons
- Using the RxLocalParallel compute context with rxExec
- Using the revoPemaR package
- Lab : Using rxExec and RevoPemaR to parallelize operations
- Using rxExec to maximize resource use
- Creating and using a PEMA class Module 6: Creating and Evaluating Regression Models
- Explain how to build and evaluate regression models generated from big data
- Lessons
- Clustering Big Data
- Generating regression models and making predictions
- Lab : Creating a linear regression model
- Creating a cluster
- Creating a regression model
- Generate data for making predictions
- Use the models to make predictions and compare the results Module 7: Creating and Evaluating Partitioning Models
- Explain how to create and score partitioning models generated from big data.
- Lessons
- Creating partitioning models based on decision trees.
- Test partitioning models by making and comparing predictions
- Lab : Creating and evaluating partitioning models
- Splitting the dataset
- Building models
- Running predictions and testing the results
- Comparing results Module 8: Processing Big Data in SQL Server and Hadoop
- Explain how to transform and clean big data sets.
- Lessons
- Using R in SQL Server
- Using Hadoop Map/Reduce
- Using Hadoop Spark
- Lab : Processing big data in SQL Server and Hadoop
- Creating a model and predicting outcomes in SQL Server
- Performing an analysis and plotting the results using Hadoop Map/Reduce
- Integrating a sparklyr script into a ScaleR workflow