Five data sources you're probably not using to improve your software planning

We live in an era of big data. So it stands to reason that we should be looking at using data for software planning... but of course not all data is actually useful in improving outcomes. Here are five useful data sources that I’ve come across in my project management work. Remember, it’s all about finding insights that get you closer to your goal!

1. Actuals versus estimates

One of the most important software planning data sources is the software project itself. Once a project is underway, your goal as a project manager is to predict if the project is still on track. It makes sense that the very best source of data to predict the remainder of the project is the work completed so far as this is effectively high quality calibration data.

Prior to starting a project you can imagine the estimates you are using as being calibrated by some external reference, whether that's data-led or judgement-led. Either way the act of up front estimation is essentially analysing a work breakdown calibrated against an expected effort profile for similar work done by a similar team.

Re-calibrate as soon as you can

One of the easiest ways to improve the quality of software planning once a project is underway is to re-calibrate the remaining estimates based on the work completed so far. You should be aiming to switch as soon as you can from external to internal calibration to produce highly accurate estimates for the remainder of the project.

In many projects actual effort is only tracked at a resolution-level, with tasks, sprints or milestones being marked as 'Done', but without recording how long they took - this is throwing away your most valuable data! Even velocity burn-down tracking merges the signals of estimate variance and velocity variance into a single conflated datapoint.

Error-adjusted work breakdown

An error-adjusted work breakdown that is up to date, is a project management super-power as it lets you re-calibrate the estimates for the remainder of the project using the best available information. The only way to get that information is to track the actual effort that is being applied to each item in the work breakdown so far, and if you can find a way to do that easily, it will pay you back a hundred-fold.

2. Estimate variance by source

Speaking of estimates... where do they come from in the first place? Best practice today is probably to reach a consensus view across a number of engineers. This might mean taking an arithmetic aggregate based on all inputs, or simply debating the point until everyone agrees. There's some value in that latter approach because a discussion gets to happen that gets everyone aligned, but truthfully, its probably overkill to discuss every estimate like this - only the truly divergent estimates will warrant an in-depth conversation. There's also a lot of potential for group-think to dominate here if it is done face-to-face. Even with planning-poker inspired private estimates there's an underlying social pressure to be 'in-the-ballpark' amongst peers.

Where bad estimates come from

Reality however might be that engineers aren't even in the estimation loop. Planning might be done on the basis of a project manager (or worse a sales person) just plucking the number out of the air. Wouldn't it be useful to track the level of estimate variance depending on where it came from? I'll save you some effort - the estimates that come from a sales person are going to be the the worst! And the estimates that come from inexperienced engineers are going to be overly-optimistic compared to experienced engineers. That's a big part of why we call them 'experienced' in the first place.

Single versus consensus

In fact, leaving aside the value of talking through divergent estimates, it's a reasonable assumption that 1 experienced engineer who actually thinks about it deeply, will probably end up giving a more accurate estimate than consensus based approaches. The problem is, which engineer is it?!? Hence why tracking estimate variance by source might actually be a useful thing to do. Even if you don't use it, it must surely be a useful data point to figure who in the team is making the best estimates. A word of warning though - don't give that person the job of making all future estimates... that's a sure fire way of getting someone to quit!

3. Velocity per resource

We all know that when it comes time to actually implementing the requirements, engineers are not interchangeable cogs in a machine - that's why an experienced project manager will put up a fight when their 'best' engineer is poached for another project. So why do we plan like they are?

Rolled-up velocity for the whole team is definitely a convenient shorthand for planning at the whole-project level. In fact, it's not unreasonable to say that rolled up velocity is good enough for planning at the individual sprint level too, so long as we do a reasonable job of taking into account availability of resources.

Backwards looking velocity

But where does the rolled up velocity come from? As an agile project progresses its usually just a sliding window of the velocity the team has achieved so far in the project. But that's not really a great help at the planning phase of a project unless the exact same team is going to be working on a similar type of project to the last one they did. Ordinarily this isn't going to be the case so a project manager will simply multiply the number of resources available by a standard velocity per-resource. That's not terrible, although we've known for almost 50 years at this stage that resources don't scale linearly in software[1].

Making it resource specific

An improvement over using a standard estimated velocity is to use an estimated velocity that's appropriate to the level of experience of the engineer. A senior engineer's expected velocity might be higher than a junior engineer's. Or perhaps lower, if they have a lot of non-productive responsibilities - it's situation-specific.

Better again, would be to use an estimated velocity that is specific to the actual engineer themselves. That's not to get dragged into a conversation about whether 10X engineers truly exist[2], it's simply to reflect that some engineers are more productive than others, at some material level. I'm not sure why that's a taboo subject inside most companies, for sure everyone knows it, they just don't like using it for planning purposes.

Per-project, per-resource velocity

Finally, the very best way to arrive at a rolled up velocity is to look at the actual velocity of a resource on a project rather than estimating it. There are lots of reasons why a specific person on a specific project might achieve more or less than an idealised velocity. Perhaps they have less familiarity with the technology used on this project. Perhaps for once, you decided to just choose boring technology[3] so productivity soars! Either way, just as the best source of data to predict the remainder of the project estimates is the work completed so far, the best way to predict the the remainder of the velocity, is the velocity per resource so far.

4. Time booked versus planned

So far we've talked about velocity. I like to define velocity very specifically to mean the potential ability for a resource to eat scope, if they were working on the project full-time. I then like to multiply that potential velocity by the percentage of time a resource is actually allocated to the project to use in the rolled-up velocity for the project in any particular week.

You might have a different way to think about this but at the very least we can agree that the ability to eat scope needs to take into account both that raw potential, and the percentage of time allocated. At the level of granularity I typically care about, its meaningful enough to track allocation at the percentage of a full-time equivalent on a weekly basis. I certainly won't bother doing this at an hourly or even more granular level (we're not lawyers!)

Billable time

One of the obvious sources of variance that disrupts our ability to plan software accurately is the amount of time actually spent on the project versus what was planned. This information is hopefully tracked quite closely if the work is being billed to a 3rd party. Sometimes (often) where it is not driving billable hours, it's not tracked very closely at all. Timesheets are definitely one of those things that have poor compliance if its not business critical. That's a shame because when it comes to planning the rest of the project it really is quite hard to pull a valuable signal out of the velocity achieved so far, if you can't determine how much time people actually worked on the project.

Knowable unknowns

When a) estimates are inaccurate, b) scope changes frequently, c) engineers have differing productivity and d) time allocated to a project varies with the ebb and flow of other commitments it leads to the common complaint that its TOO DAMN HARD to forecast software projects accurately! I have some sympathy for that point of view but I'd counter with the idea that some of these things can be known. Of these, probably only scope changes are truly unpredictable, but it's worth noting that scope changes are one area where you probably aren't expected to absorb into the initial plan. That might be desirable, but from a project managers perspective its not reasonable to expect it. If we can access data that helps us understand the variables other than scope changes, then we will be doing far better than if we let four planning variables multiply their uncertainties against each other.

5. History of projects

One of the problems with using these data sources is what to do when no data is available. At the start of the project we are most prone to planning uncertainties but we also haven't generated much data we can make use of. In a previous article I argued that projects tend to cluster together. So why don't we make use of previous project data from a similar type of project? The answer is usually simply because we don't have access to any. For many of us the moment that a requirement moves to 'Done' it spirals off into a data black-hole, never to be seen (or used) again...

Black holes - where data typically ends up [Credits: Event Horizon Telescope collaboration et al.]

Snapshots versus retrospectives

One approach is to gather data at the end of a project during a retrospective, but that data is very prone to error if too much time has passed since the work was done. Far better is to gather data snapshots as a project is underway. Ideally you'd get it as a by-product of your software planning tool.

Other people's data

Finally we don't have a good track record as an industry of using other people's data to inform our own planning efforts. This is surely a missed opportunity! There is little commercial sensitivity around sharing aggregated historical data from software projects. And the upside is that we can all benefit from a combined history of projects to solve our cold-start data problems. This is the future of software planning and the sooner we realise we've been trying to plan our software projects whilst wearing a data blindfold, the better outcomes we will all be able to achieve.

Bonus not useful: SLOC

Counting Source Lines of Code (SLOC) has long been held up as a useful data source for planning and estimating future software projects. In fact, I'm using SLOC as a proxy term here for anything that uses underlying code metrics as a predictor of project effort. I'm not saying SLOC and effort isn't correlated at all ... but I would say that a) it has such high potential to be flawed and b) its so widely mocked by engineers, that in my opinion, it just isn't worth the effort for any individual project manager to try to adopt themselves. If you have a project management office that is hell-bent on gathering and analysing SLOC data on your behalf, then it may be a different assessment... but in terms of a data source you're probably not currently using, that I recommend you should consider using... it's a no from me.

Weak correlation

In terms of why SLOC might be flawed, at a simple level it's to do with poor correlation between effort, productivity, problem solving and actual code. Furthermore these days, there's at best a fuzzy correlation between SLOC and what any engineer actually typed out themselves. How do you handle open source code libraries (the mainstay of many modern tech stack)? How do you handle code-reuse, and don't-repeat-yourself (DRY) coding practices? Do you really want to de-incentivise software refactoring and removal of technical debt if it ends up removing code lines rather than adding them? I'd posit the answer is no.


There is lots of untapped potential in data for software planning. Your job as a project manager is to pick the best available data sources to maximise your ability to predict costs, schedule, and resourcing while minimising the effort you have to put in to gather them, given your busy schedule. By choosing one or two of these data sources to drive your insight, you can get out in front of problems before they knock you off course, and with new data-centric project management tools coming to the market, you won't need to become an expert in data science to make use of them.

  1. The Mythical Man-Month: Essays on Software Engineering by Frederick Brooks, 1975.
  2. Peopleware: Productive Projects and Teams by Tom Demarco and Timothy Lister, 1999 showed that organisations tend to employ engineers from within a similar tier so it compresses the variance in performance somewhat (although it is still large!)