Cases and Research

Here is a subset of projects I've been working on, recent and not-so-recent. They range from more theoretical projects, to case studies highlighting ways businesses use these analytics in practice.

Projects are ordered roughly chronologically within each section. Arrows denote expandable content. Click on them for more details.

See also my teaching and notes for additional work.

Business Analytics

XLKitLearn - A tool for data science in Excel

See the XLKitLearn page.

Image Recognition at the USPS

Even in these days of emails and electronic communications, the United States Postal Service handles almost 150 billion pieces of mail a year. Doing this quickly and efficiently is a truly mythical task. In this case study, we focus on one small part of this problem - the automatic character recognition (OCR) that must go on every time a piece of mail is processed to automatically sort it into the correct bin. We first cover the history of these systems at the USPS, and then show how random forests can be used to construct a simple image recognition model that achieves 92% accuracy. We also discuss the encoding of image data for data science.

This case is still being polished. Please be in touch for an early copy.

Evisort: An AI-Powered Start-up Uses Text Mining to Become Google for Contracts

AI-driven text mining, a relatively new business analytics tool, allows users to unlock troves of information contained in documents and make them searchable by content and metadata. In this two-part case, I first introduce Evisort, a start-up seeking to create AI-enhanced software providing contract management and processing solutions for attorneys and business professionals, and discuss the challenges and opportunities inherent in such a startup. I then provide an introduction to the science of text analytics.

Review copy (stamped 'do not copy') of part 1 and part 2. Published at Columbia Caseworks (part 1 and part 2), and available there with full teaching notes, data files, and solutions. Featured by Business Wire.

Markdown Management; An Introduction to Dynamic Pricing

In this case/game combination, I introduce the concept of markdown management in a situation with complex demand patterns, that depends on the price of an item, its location, and time. As a result, the optimal markdown strategy can only be calculated using a dynamic program. I first show how to solve a simplified version of the dyanmic program leading to a heuristic, and then solve the full DP in Excel leading to an optimal solution. The case comes with an online game which allows students to test out their strategies, and automatically logs the results of these strategies in a Google doc that can be used to analyzed these results in class.

This case is still being polished. Please be in touch for an early copy.

Data-Driven Investment Strategies for Peer-to-Peer Lending: A Case Study for Teaching Data Science

with Maxime Cohen, Kevin Jiao, and Foster Provost, finalist in the 2019 INFORMS case competition

We develop a number of data-driven investment strategies that demonstrate how machine learning and data analytics can be used to guide investments in peer-to-peer loans. We detail the process starting with the acquisition of (real) data from a peer-to-peer lending platform all the way to the development and evaluation of investment strategies based on a variety of approaches. We focus heavily on how to apply and evaluate the data science methods, and resulting strategies, in a real-world business setting. The material presented in this article can be used by instructors who teach data science courses, at the undergraduate or graduate levels. Importantly, we go beyond just evaluating predictive performance of models, to assess how well the strategies would actually perform, using real, publicly available data. Our treatment is comprehensive and ranges from qualitative to technical, but is also modular—which gives instructors the flexibility to focus on specific parts of the case, depending on the topics they want to cover. The learning concepts include the following: data cleaning and ingestion, classification/probability estimation modeling, regression modeling, analytical engineering, calibration curves, data leakage, evaluation of model performance, basic portfolio optimization, evaluation of investment strategies, and using Python for data science.

Published in Big Data, and available here. Supporting notebooks and a separate student note available here.

Regression Analytics at the New York City Department of Education

With over 1,700 schools serving 1.1 million students and an annual budget of almost $25 billion, The New York City Department of Education is the largest education system in the world. It is also one of the most diverse. In this case, we first use Python to parse and prepare data in the public domain, and we then use regression analysis to explore various factors that affect performance at these schools.

This case is still being polished. Please be in touch for an early copy.

Data Visualization in Tableau - The Case of Citibike

Citibike is a bikesharing system in New York City. Bike stations are set up throughout the city, and users can pick up and drop off bikes from these stations. The system makes their data available, and in this case, we use these data to learn the art of data visualization in Tableau, and discuss how data can be used to make operational decisions.

This case is still being polished. Please be in touch for an early copy.

High Dimensional Model Selection

This review paper summarizes many of the results in the field of high dimensional model selection in statistics. Broadly, the field concerns model fitting in situations in which there are many, many variables that might affect an outcome, and we seek the subset of variables best able to model this outcome.

This was an essay submitted in partial fulfilment of the requirements for my Masters in Mathematics at the University of Cambridge under the supervision of Richard Samworth.

Available here.

Package Sizing Decisions

with Oded Koenigsberg

What's the ideal size for a ketchup bottle (from Heinz's point of view)? If the bottle is too small, the company loses out on extra profit from consumers who would have been willing to buy more. If the bottle is too big, the company loses out on consumers who need much less, and therefore don't buy at all. This question were addressed in a 2010 paper by Koenigsberg, Kohli and Montoya. This piece of work reviews some of their work, but also examines their assumptions, and reports preliminary numerical attempts are improving some of them to make the model more realistic.

This is work I carried out while at Columbia in the summer of 2009 - summary available here.


Supply Chain Management

Two-echelon distribution systems with random demands and storage constraints

with Awi Federgruen and Garud Iyengar

We consider a general two-echelon distribution system consisting of a depot and multiple sales outlets, henceforth referred to as retailers, which face random demands for a given item. The replenishment process consists of two stages: the depot procures the item from an outside supplier, while the retailers' inventories are replenished by shipments from the depot. Both of the replenishment stages are associated with a given facility-specific leadtime. The depot as well as the retailers faces a limited inventory capacity. Inventories are reviewed and orders are placed on a periodic basis. When a retailer runs out of stock, unmet demand is backlogged. We develop effective strategies to handle these kinds of problems, and show they perform exceptionally well in the vast majority of reasonable instances.

Published in Naval Research Logistics. Draft version available here

Supply Chain Coordination and Contracts in the Sharing Economy - a Case Study at Cargo

with Maxime Cohen and Wenqiang Xiao, winner of the 2018 INFORMS Case Competition

Cargo’s mission is to help "rideshare drivers earn more money by providing complimentary and premium products to passengers." Cargo sources goods from suppliers to provide a platform for gig economy drivers to run small convenience stores out of their vehicles. Drivers earn additional income, and riders enjoy convenient and affordable access to products during their rides. As the company grew, Cargo faced a number of supply-chain-related challenges including determining the product mix in the Cargo box, replenishment of the product, and the cost of carrying inventory. In particular, would the replenishment decision be driven by the company or the driver and who would bear the responsibility for the inventory cost? The founders also considered how to most efficiently manage its suppliers: Would a centralized or decentralized model best serve Cargo and its drivers? And, how might supply chain contracts with its suppliers help support the company’s profitable growth?

Review copy (stamped 'do not copy') available here. Published at Columbia Caseworks, and available there with full teaching notes, data files, and solutions.

Multi-Item Two Echelon Distribution Systems with Random Demands: Bounds and Effective Strategies

with Awi Federgruen and Garud Iyengar

We consider a general two‐echelon distribution system consisting of a depot and multiple sales outlets, henceforth referred to as retailers, which face random demands for a given item. The replenishment process consists of two stages: the depot procures the item from an outside supplier, while the retailers' inventories are replenished by shipments from the depot. Both of the replenishment stages are associated with a given facility-specific leadtime. The depot as well as the retailers faces a limited inventory capacity. Inventories are reviewed and orders are placed on a periodic basis. When a retailer runs out of stock, unmet demand is backlogged.

Here, we consider the additional complication in which there are multiple items to consider, possibly correlated to each other, each of which compete for storage capacity. We develop effective strategies to handle these kinds of problems, and show they perform exceptionally well in the vast majority of reasonable instances.

Draft version available here

Information Relaxation-Based Lower Bounds for The Stochastic Lot Sizing Problem with Advanced Demand Information

with Awi Federgruen and Garud Iyengar

Most models of supply chain management assume that demand is uncorrelated across periods. In other words, if I have a blockbuster day at my store today, I'm no more or less likely to have a blockbuster day tomorrow. Not only that, but they also assume that nothing I observe today can affect my beliefs about demand tomorrow. Clearly, this is almost never true. However, dealing with models that take such dependencies into account turns out to be extremely difficult. Many people have devised approximate methods for dealing with such problems, but they've never had an easy way to show that their approximate methods work well. In this paper, we devise such a method.

Draft version available here

Two Echelon Distribution Systems: Applications to a Luxury Goods Retailer

In this project, I was able to obtain data from a luxury goods retailer pertaining to their supply chain, in the hope of applying the algorithms in this section to this supply chain. Unfortunately, the data was such that I was unable to apply the methods above directly to this retailer. In this paper, I adapted these methods as best I could to come up with an algorithm that could be used in practice.

PDF available here


Miscellaneous

Analytics of TCUs in Californian Hospitals using Bayesian Networks

Intensive care units are invariably the most expensive units in any given hospital, and are often overloaded. Certain hospitals have introduced transitional care units (TCUs), which are cheaper to run, to try and reduce the load on intensive care units. However, it is unclear whether the introduction of these TCUs has had any positive effects. In this project, we will analyse data using data mining techniques to try and better understand the effect of the introduction of TCUs. Specifically, we will examine the effect TCUs have had on appropriate measures of treatment cost and quality of service.

This was a project I undertook as part of a service systems class with Prof Ward Whitt and with the help of Prof Carri Chan. Unfortunately, the data seemed to be of insufficient scale to get conclusive results. Final presentation here, minus any slides containing results, for confidentiality reasons.

The Odds Algorithm

The classical secretary problem concerns the following situation: an interviewer needs to hire a single secretary, and sets out to interview a fixed number of candidates. While interviewing a candidates, the interviewer ascertains how the candidate ranks compared to every previous candidate. After each candidate is seen, the interviewer can either accept the candidate, and end the interview process, or reject the candidate, without any chance of ever returning to that candidate. The classical secretary problem seeks the best strategy to adopt in this case. Clearly, choosing an early candidate is a bad idea - indeed, having seen very few candidates, it is difficult to know what is available. Similarly, waiting till the last candidate might also not be the best choice - the last candidate might be lousy!

The Odds Algorithm was developed as a very elegant way to solve the secretary problem and many of its more complicated variations. In this presentation, we state and prove the Odds Theorem and consider a number of its applications.

This was a final project in Prof Omar Besbes' and Vineet Goyal's course "Dynamic Learning and Optimization: Theory and Applications", which I took in the Spring 2011 semester at Columbia. I prepared a short presentation (which requires the MathType fonts) and report.

Detecting Bubbles Using Option Prices

In the context of financial markets, bubbles refer to asset prices that exceed the asset's fundamental, intrinsic value. Bubbles are often associated with a large increase in the asset price followed by a collapse when the bubble "bursts". A series of recent papers have developed a number of mathematical models for bubbles in financial markets, together with a number of analytical tests that could, in theory, be used to detect bubbles before they burst. These tests, however, only use information available in the stock prices themselves. In this project, we investigated a variation of these detection methods that rely on prices of options on the stock, rather than on the price of the stock itself.

This was a summer project I undertook in the first year of my PhD with Prof Paul Glasserman. Power point presentation available here.

The OLYMPUS experiement

What's inside a proton? We should be able to answer that question using lattice QCD (quantum chromodynamics), and when computers catch up with the theory, we probably will. In the meantime, however, we're stuck with a more primitive method - shoot things at protons, see what happens and make deductions. The problem is that particle physicists have tried two ways to "shoot stuff at a proton", and the results have not been consistent. This could be because of second-order interactions polluting one of the methods. OLYMPUS is an experiment that should reveal whether this is the case. This poster summaries the background and aims of the experiment.

This was a final presentation for class 8.276 (Particle Physics) at MIT. PDF available here.

The Path Integral Approach to Quantum Mechanics

Quantum mechanics and classical mechanics are both called "mechanics" - but they apparently have little in common. One deals with waves, operators and probabilities, whereas the other deals with particles, forces and deterministic variables. This paper is an introduction to the path integral formulation of quantum mechanics, which unifies quantum and classical mechanics under one common framework and reduces to the Lagrangian approach at very high energies (the equivalence principle).

This was a final project for class 8.06 (Quantum Mechanics) at MIT. PDF available here.

Enzyme-free constant-temperature DNA quantisation and amplification

DNA is everywhere, and being able to accurately and reliably detect and amplify tiny amounts of the molecule is crucial. The most common DNA amplification method, PCR (Polymerase Chain Reaction), is ubiquitous, but requires the use of highly specialized and expensive enzymes and tediously specialized reaction conditions most commonly obtained using thermal-cycling machines. In this project, we attempted to extend a method developed by Zhang et. al. (2007) to create an "enzyme-free" version of PCA.

This was part of a SURF project at Caltech's DNA lab. Progress report (more informative) here and final report here.

Excel Tools

This set of tools extends Excel's functionality - Formula explorer allows easy auditing of large and complex formulas - clicking on any cell refernce brings up the relevant cell and brackets can be independent and highlight for clarity. To use, hit Ctrl+Shift+F in any cell with a formula. Hit F1 from the formula explorer for a list of features. - Functions to perform redumentary linear algebra operations - finding eigenvalues, eigenvectors, Cholesky decomposition, and inverse matrices.

In theory, downloading this xla file and opening it should make these tools available in any workbook. Unfortunately, this was written for a previous version of Excel - it is unlikely to still work.

A Turing Machine Development Environment

Turing Machines are one of the simplest computing models equivalent to today's computers - that is to say, anything computers can do, Turing Machines can do and vice-versa. Turing Machines can therefore be used to find the limits of what computers can do and can't. However, Turing Machines are rather difficult and tedious to program, and very few packages exist to help this process. The aim of this project is to build a program to help the making of Turing Machines.

This was a research project I carried out at the Technion during the summer of 2003 (I was 16 when I wrote this, so don't judge!). Short presentation, project report, and executable files (file 1 and file 2; I'd be amazed if these still run on a modern OS!)