Projects

Targeting New Retail Locations

Background

I interviewed at an international retail company. They have (at this time) an open position focused on growing business in the Americas. They sell their product in three different types of stores, and also online. Although the recruiter didn't ask me to prepare anything, I wanted to put together a short presentation to showcase ambition, technical capacity, and storytelling.

I think putting together this minimum viable product (MVP) served me well. At one point during final the interview, the head of the office asked about my general data analysis process and how I go about sharing findings. I said, "Well, I'm glad you asked that..." I'm supposed to hear back from them this week.

Methodology

I extracted the location and type of each store from the company's website. For each of the 11 cities that this company had store type 1 in, I downloaded demographic data from the American Fact Finder datasets from the Census Bureau. I focused on store type 1 based on the assumption that these types would require the most investment, but generate the most revenues.

If you've ever dealt with socioeconomic and demographic data, you know that there is A LOT of information available. Based on my understanding of the company's target clientele and quick scan of available data, I settled for 10 demographics that fit into three broad categories: diversity, workforce participation, and affluence.  

I then developed a profile of the "typical" city for this company: I calculated the average value for each of the 10 variables across the 11 cities. I used this profile to create a lead score for potential future locations. I compared potential locations' demographics against these average values: if a particular location's demographic performed better than the typical city, I rewarded that location with one point.

Rather than look at cities, I looked at future potential counties. I did this for two reasons: (1) simplicity; and (2) the company's philosophy. Regarding the latter, this company prefers depth vs. breadth. Instead of opening new stores in disparate, new locations, the company seeks to open many stores in geographically proximate locations.

Figure 1: Map of Existing Store Locations in the U.S.

Data Source: Company website, Census Bureau
Created with R, ggplot2

I focused on counties based on the weak assumption that cities within the same county could have similar demographics. I thought that multiple cities within the same county with high lead scores could satisfy the company's philosophy. The real driver, however, was simplicity: this weak assumption allowed me to skip the process of calculating geographic proximity (if anyone knows how to quickly do this, please let me know!). 

Figure 2: Map of U.S. Counties and Lead Scores - Darker Blue Equates to Higher Lead Score

Data Source: Company website, Census Bureau
Created with R, ggplot2

The next natural step was to compare these counties with existing locations. It should be obvious that there would be considerable correlation among high lead scores and existing type 1 locations.

Figure 3: Map of U.S. Counties and Lead Scores vs. Type 1 Stores

Data Source: Company website, Census Bureau
Created with R, ggplot2

Final Thoughts

I turned this project around over the course of two-days with about five hours total work. If I had more time, and perhaps actually worked at the company, I would do things a little differently. The analysis does not include ANY information about the performance of each of these locations. With that information, I could build better models that compare demographics and performance metrics. With more time, I could build a Shiny App to allow users to choose their own demographics and thresholds for lead scores. Lastly, some demographic information can be correlated; after using intuition to choose the features, I could have performed principal component analysis (PCA) to reduce dimensionality.