Title/ Speakers
R Beginner WorkshopJoe Kambourakis is a Trainer at EMC, John VerostekFri6:00 - 8:30The workshop has three main parts: R-Studio, Data Management, and Graphics. We will walk through how to use R-Studio for inputting scripts (multiple lines of code), entering code, installing packages and libraries, running a script, looking at data, visualizing output. For "Data Management" we will be inputting data, manipulating data using Vectors, DataFrames -- think Excel Columns (Vectors) and Excel spreadsheet (DataFrame). We will covering some stats along the way, too. We will finish with Graphics including Histograms, Boxplots, and Scatterplots.
Beginner SQLSatMorning
Data ScienceHow to Think About Data - David WeismanSat9:00 - 10:00David will present on "How to Think About Data". This is for attendees who know basics of data science, and will provide some critical reasoning about their data.
Data ScienceGeneral Linearized Mixed Models (GLMMs) in R to model population dynamics and temporal autocorrelation in long-lived plant species- Julia Pilowsky, a biology PhD student at Tufts UniversitySat10:00 - 11:00Plant population models help ecologists and managers direct conservation efforts toward those populations that are in greatest risk of collapse. Research has shown that vital rates such as yearly survival and reproductive rate are best estimated using generalized linear mixed models (GLMMs). Long-lived plant species add an extra layer of complexity because their vital rates may be autocorrelated from one year to another. Mixed Model Case Study
Data ScienceDimensionality Reduction using Principal Compoments Analysis in MatLab by Sri Krishnamurthy Sat11:00 - 11:45We will cover both traditional and modern techniques to address high-dimensional data sets. In part1, we will lay the foundation by discussing some of the common traditional techniques to handle high-dimensional data. We begin by discussing the problems in high-dimensional data sets including the famous “Curse of Dimensionality” problem. We then discuss two methods to deal with high-dimensional datasets. The goal of the first method is to reduce the number of variables by variable selection and that of second is to reduce the number of variables by deriving new variables. We will illustrate these methods through sample techniques (regression, decision trees and principal component analysis) and give pointers on implementing these techniques in MATLAB.
Data EngineeringBaseball and Data EngineeringSat1:30 - 2:30
Data EngineeringData Engineering (Python) Sat2:30 - 3:30
Data EngineeringUsing R to Build a ... - Dag HolmboeSat12:30 - 1:00R is used to build a TBD. Steps covered include: (1) using the Google API package to access google analytics for keywords, then (2) using the Twitter API package to download twitter searches based on these keywords, then (3) using the WordNet package to find synonyms for the keywords, and finally (4) using the rChart package to interactively visualize the findings. Also, (5) I use the tm (textmining) package to analyze the Twitter tweets.
Text AnalyticsUsing Twitter to Analyze Switching Across Cellphone Carriers -
Tanya Cashorali
Sat1:00 - 1:30Twitter data is scraped to analyze subscribers switching between wireless carriers. For example, if someone mentions 'switch[ed||ing] to T-Mobile' for example, we mark that as a switch to T-Mobile. We provide context around these switches to understand in a programmatic way why people are switching to and from carriers. Simple word association and patterns are used to determine that this tweet is marked as a 'To Verizon' and a 'From AT&T' and it involves 'Data'. Simple word dictionaries are then used to assign tweets to each bucket.
Text AnalyticsTopic ModelingSat1:30 - 2:30
Text Analytics
Optimizing Multilingual Search - David Troiano is a Principal Software Engineer at Basis Technology Sat2:30 - 3:00Multilingual search requires the developer to address challenges that don’t exist in the monolingual case. In Solr, a robust multilingual search engine requires different analysis chains for each language because each language has its own logic for tokenization, lemmatization, stemming, synonyms, and stop words. To make multilingual search even harder, query strings are typically no longer than a handful of words, making language identification of query strings more difficult, or at worst ambiguous even to a human (“pie” could be an English or Spanish query). We’ll explore the breadth of Solr schema and configuration options available to a multilingual search application developer to balance functionality, performance, and complexity. We’ll dive deep into specific experiments with a practical application.
Beginner PythoniPython Tutorial - Imran MalekSun9:00 - 10:00the basics of iPython Notebooks, Pandas (data frames, reading csvs, etc.), and a light touch on MatPlotLib to render the following visualization of traffic at MBTA stations during the “late night” hours
Beginner PythonRegression using Pandas and Statsmodel - Allen Downey
Sun10:00 - 11:00 Regression is a powerful tool for fitting data and making predictions. In this talk I present the basics of linear regression and logistic regression and show how to use them in Python. I demonstrate pandas, a Python module that provides structures for data analysis, and StatsModels, a module that provides tools for regression and other statistical analysis.

As an example, I will use data from the National Survey of Family Growth to generate predictions for the date of birth, weight, and sex of an expected baby. This presentation is based on material from the recent revision of Think Stats, an introduction to data analysis and statistics with Python.

This talk is appropriate for people with no prior experience with regression. Basic familiarity with Python is recommended but not required.
Working with Big Data & RMassive Feature Selection Using Supercomputing in R - Jean-LoupSat12:30 Problems at the Big Data scale regularly involve hundreds to thousands of features and millions to billions of observations. Data scientists are often interested in identifying only a few dozens of the most relevant features in order to generate actionable analyzes. Since the density distributions of the features and their interactions are usually very complex, aggregating the results from several sophisticated feature selection techniques often yields more robust results. In this presentation, we will show how R can be used in practice to select features on a large scale, based on various feature selection techniques - traditional statistical tests (R packages "stats" and "fBasics"), information theory (infotheo, mpmi) and Machine Learning (e1071, class, randomForest, gbm) - coupled with parallel code (foreach, doParallel) and distributed computing (Rmpi). Elementary theoretical aspects will be illustrated on a complex real dataset related to the predictive maintenance of jet engines, covering data visualization, R codes, algorithmic complexities and computational issues.
Working with Big Data & RR for high throughput screening data- Sat2:30
Working with Big Data & RIntroduction to Massively Parallel Databases - Wes ReingSat1:30MPP Databases allow you use traditional database architectures to implement a parallel database that can span dozens of machines. Unlike Map Reduce and No-SQL systems MPP databases behave like traditional databases in most ways. For example, production systems access it through JDBC drivers, and python users just change their SQLAlchemy config to point to the MPP as though it were a PostgreSQL database. The advantages and disadvantages of the MPP approach will be discussed in abstract; as well as discussing some vendor specific implementations (Greenplum, Redshift, and Vertica).

DataViz Design: Angela Bassa
Sun9:00 - 10:00Angela will be covering dataviz design concepts, including though not limited to, being aware of color-blind colors, activating negative space, using whitespace, clarity, and Abela’s choice diagram, reviewing Tufte’s Challenger disaster analysis, etc.). These principles are agnostic as they apply across data visualization whether in PowerPoint or on the Web.
Interactive DataViz Using R by Abhinav SarapureSun10:00 - 11:00This talk will cover a variety of new data visualization packages for R including ggvis, r-charts, and shiny.
MBTA DataViz - Michael Barry and Brian CardSun12:30 - 1:30A Case Study Visualizing Boston's Subway System:
Michael and Brian recently built Visualizing MBTA Data, an interactive report of the performance and behavior of Boston's subway system. This talk will outline our visualization design process as well as exploratory visual analysis techniques used while building the project.
DataVizInteractive DataViz using Pefython Glue by Chris BeaumontSun1:30 - 12:30Chris will be presenting on GLUE, which is a project he is working on to enable rapid, linked-view visualization in Python. Here is a quick demo.
Machine Learning Using Python
ML In the Cloud and Python by Roope Astala of Microsoft
Sun12:30 - 1:30This presentation is an introduction to Microsoft Azure Machine Learning: a cloud-based service for building and experimenting with machine learning models, and putting them in production as web service REST end points, using visual data flow composition with no coding required. We discuss different kinds of machine learning models - classification, regression, recommendation – as well as features such as data importing, visualization and experiment sharing. We walk through an example of authoring and publishing a machine learning model, and making predictions by calling that model as a request-response service from a Python script. MS Azure and Python
Machine Learning Using Python
IP-Reputation Scoring System in Python and Hadoop: Stuart Layton of BitSight TechnologiesSun1:30 - 2:30ip-reputation scoring system in python that
integrates with Hadoop
Machine Learning Using Python
Dynamic Control for Purchasing of Online Advertisements- Michael ElsSun2:30 - 3:30In the computational advertising space bids are selected from a possible 12-15 billion ads a day. An algorithm was needed to control our bid volume across thousands of ad campaigns simultaneously. The desire was to serve ads optimally across the day and maximize ad quality/efficiency for each ad campaign. A real-time, machine learning-based product was developed that incorporates a PID algorithm, Kalman Filters and partical swarm optimization.