Benchmark the Best Companies in Any Sector, Location, or Stage
At DataFox, our mission is to provide machine-learned business insights to analysts, executives and investors. One of our proprietary systems is the DataFox scoring system, which uses machine learning in conjunction with curated training sets to develop models that calculate company scores.
These scores allow DataFox users to quickly search our database of more than 650,000 companies and identify the best or most suitable companies in any sector, location, or stage.
To accomplish this functionality we’ve built a series of algorithms to automatically score companies based on growth, influence, finance, team and overall score.
How the DataFox Scoring Model is Determined
Our scores surface the best companies, just as Google's PageRank algorithm surfaces the best webpages.
Five Distinct Scores
Our customers have different perspectives on how to score a company, such as overall size, growth, funding strength, team quality, and more. Scoring is inherently subjective, so we calculate 5 different scores with dozens of criteria.
Harnessing Machine-Learning to Score Companies
We use machine learning to train our algorithm to assign scores across companies. The role of machine learning in this process is to rapidly run simulations and assist us in determining the appropriate value of the coefficients to cause the algorithm’s output to match with the training sets.
Defining Training Sets
We build training sets to classify what success and failure look like for the algorithm.
A training set is a set of examples used to fit a model to predict a type of response based on the input variables or features. We identify companies that exemplify the characteristics of the high and low achieving companies for each score. Our training sets include companies that are at all levels of growth ranging from Early Stage to Late Stage companies.
Cleaning and Normalizing Features
Raw data must first be cleaned and normalized before it can be used. We remove outliers and noisy data, for example. For our growth score, we calculate an intermediate "liquidity score" that estimates the company's financial position on a basis that can be compared across industries.
How Scores are Calculated
Like Google’s search algorithm, the DataFox scores are proprietary and we do not release the exact formulas. To give you an idea of how they work, here are some of the features that factor into them.
Finance score: What is the financial strength of the company?
- Investor Score: A function that estimates the number of ‘prestigious’ institutions that have invested in the company. DataFox analysts curate the set of ‘prestigious’ institutions.
- Estimated Revenues: While revenue estimates for private companies are imprecise, our analyses have shown them to be valuable indicators of the relative size of the business.
- Liquidity Score: A formula we developed that calculates how liquid the company’s finances are based on funds raised and the recency of those financings.
- 5 other metrics...
HR score: How strong is the company's team?
- Retention Rate: The average number of previous jobs held by members of the company’s executive team.
- Educational Prestige: The fraction of degrees earned by the company’s executive team that come from ‘prestigious’ schools, defined to mean schools ranked in the top 30 in the US News & World Report rankings.
- Change in LinkedIn Followers: The rate at which the company is gaining or losing LinkedIn followers. This indicates whether a company is gaining or losing cachet in the job market.
- 4 other metrics...
Influence score: How significant is the company's online presence?
- Website Traffic: The amount of web traffic to the company’s website.
- News Mentions: The frequency with which the company is mentioned in the news, as calculated by our news auditing algorithms.
- Twitter Mentions and Followers: The number of @mentions about the company on Twitter.
- 3 other metrics...
Growth score: Is the company likely to experience revenue growth?
- Headcount Growth: The increase in the number of employees at a firm over time
- Investor Score: A function that estimates the number of ‘prestigious’ investment firms that have invested in the company. The set of ‘prestigious’ institutions is curated by DataFox analysts.
- Job Listings: The number of available positions the company has listed on Indeed and Jobvite.
- Growth Factor: A function of overall quality score with respect to time.
- 4 other metrics...
Overall DataFox Score: How successful is the company overall?
While we do not disclose the exact formulas, the DataFox Score leverages machine learning to build a model, selecting among all of the underlying features available across the four sub-scores.
Picking the Model
Given the training data, we train the chosen model to learn the optimal combination and weightings of the input features. This learning phase helps us determine which quantitative factors differentiate the successes from the failures among examples provided in the training set. The most naïve possible model would be to apply a formula like this:
- [(1*A) + (1*B) + (1*C) + (1*D) + (1*E)] / 5
After doing all of the manual work gathering data points and computing them, even very simple functions can help separate the high achievers from the rest. Meanwhile, in-house teams do not normally have the wherewithal to use sophisticated functions to normalize the features or scientifically calculate what the model and its coefficients ought to be. We generate these scores so our clients don’t have to spend hours creating their own scores with less sophisticated models.
In the end, each of our scoring models takes a form that looks somewhat like the following:
- [(11.5*A^3) + (0.5*function(B)) + (32.3*function(C)) + (1.2*function(B&D)) + (19.2*function(J&C)) + (17.7*H^1/2) + (.2*function(J))]
In this hypothetical, A might be employee retention, B might be headcount growth, and so on. In reality, they tend to be more complex polynomial equations than the above, but this helps illustrate the point.
The role of machine learning in this process is to rapidly run simulations and assist us in determining the appropriate value of the coefficients (11.5 and .5 and 32.3 and so on) to cause the algorithm’s output to match with the training sets. Once it finds the best fitting model, it then applies that formula to all of the other companies in our data sets.
Iteration & Cross-validation
We output the results of the algorithm and examine performance, looking at measures such as root mean squared error and confusion matrices to identify weaknesses in the quality of the input data, normalization, or training sets, then optimize, then repeat. We apply cross-validation testing to get a good estimate of how our model will generalize to data that our model has not been trained on.
What is the scale of our Scores?
Clients often ask us how they should interpret the DataFox scores. Our scores are calculated on a scale between 0 and 1,250. The following is a histogram of the “Overall Quality Score” for the top 110,856 businesses in our data set:
Why do we not present all 550,000+ companies here? First, many companies are either too early, too small, or for other reasons have not hit enough milestones to earn a good score. Second, many smaller companies operate in stealth and have little to no publicly visible footprint.
We are especially careful about our score calculations, and we in fact calculate our confidence in a company’s score. When there is a key data point missing on any given company, we do not publish that score. We still built a profile for the company and collect events for the, but we leave the score blank. Meanwhile, among established, Top 100K companies, the distribution gives you a sense of what the score means.
Room for Improvement and Conclusion
More data. At the core of this algorithm, there are data inputs, transformations to organize the data, and formulas to create and output these scores. Improvements to the scores therefore are achievable in just a few areas: more data, better transformations, or better formulas.
What we know to be true from our previous work in the area of machine learning and statistics, have continued to find through both our own internal iterations and testing, and have had reaffirmed through interviews with our advisors and customers is this: there are eventually diminishing returns to formula improvements. While we frequently do tune our training sets and the polynomial equations or linear regressions that calculate our scores, we actually spend more of our institutional efforts in collecting more proprietary data.
While the next tweak to a score’s algorithm (while still using just the existing data) may yield a 1% improvement in the scores (which we measure using our training data), by adding an altogether new and proprietary feature to the data set, we might yield an improvement of 5% or 10%. Thus, our new data collection algorithms and initiatives are largely driven by their expected contribution to score quality.