How to Judge Your Dev Team's Work When You Can't Read Code

How to Judge Your Dev Team's Work When You Can't Read Code

You can't read code, but you can judge the process. Learn the 6-signal framework for evaluating your dev team: rollback capability, testing discipline, deployment frequency, lead time, trade-off transparency, and failure handling. Based on Google's DORA research into delivery performance, reliability, and risk.

You can’t read code.

That’s fine. Neither can most non-technical founders who’ve built successful tech companies.

But here’s where most non-technical founders go wrong: they nod at code reviews, say “looks good,” and hope their developer knows what they’re doing.

This is dangerous.

You cannot judge the code. But you MUST judge the process.

The difference matters. “It works” is not the same as “it’s well-made.” A feature can work perfectly today and become an expensive disaster in six months. The code itself won’t tell you which—but the process will.

This framework gives you six observable signals that indicate engineering quality without reading a single line of code. Four come directly from Google’s DORA research (which studied thousands of teams over six years); two are qualitative signals that complement the data. Each can be assessed through a simple conversation.

TL;DR:

  • You can’t evaluate code quality directly—but you can evaluate engineering practices
  • Six signals indicate delivery performance and risk: rollback capability, testing discipline, deployment frequency, lead time, trade-off transparency, and failure handling
  • Four of these (rollback/recovery, deployment frequency, lead time, change failure rate) are DORA metrics; two (trade-off transparency, failure handling) are qualitative add-ons
  • Ask these questions monthly to catch problems before they become expensive

Why Process Beats Code Review

Here’s what the research tells us:

Google’s DORA program spent six years studying thousands of software teams to understand what separates elite performers from the rest. Their conclusion: process metrics predict outcomes better than technical metrics.

Microsoft’s research on code reviews found that the presence of subject-matter experts and participation levels were better predictors of code quality than the review content itself.

The pattern is consistent: rigorous engineering processes drive quality, not individual code inspection.

As a non-technical founder, you’re not qualified to judge whether code is good. But you’re absolutely qualified to judge whether your team:

  • Can recover quickly when things break
  • Tests their work before shipping
  • Ships frequently in small batches
  • Delivers changes quickly from idea to production
  • Makes deliberate trade-offs they can explain
  • Plans for failure scenarios

These six signals map to proven research. Let’s break them down.


Signal #1: The Rollback Test

The question: “If we deploy this and it breaks, how do we revert?”

Good answer: “One-click rollback to the previous version, takes under 5 minutes.”

Red flag: “We’d push a fix.” (This means they have no rollback plan.)

Why This Matters

Research from Amazon’s deployment practices shows they “fully prepare for rollback before every deployment.” They consider any version that can’t be rolled back safely as not ready for production.

The DORA research program found that recovery time—how quickly you can restore service after a failure—is one of four key metrics that separate elite engineering teams from average ones. Elite teams recover in under an hour; low performers take days.

What You’re Really Testing

This question reveals:

  1. Does your team plan for failure? Good engineers assume things will break
  2. Is deployment reversible? Or is every release a one-way door?
  3. How long would an outage last? Minutes vs. hours vs. days?

The Follow-Up Questions

If they say “one-click rollback,” ask:

  • “When did you last test the rollback process?”
  • “How long does rollback actually take?”
  • “What data would we lose if we rolled back?”

If they say “we’d push a fix,” dig deeper:

  • “How long would that take at 2 AM on Saturday?”
  • “What happens to customers during that time?”
  • “Have we ever had to do this?”

The benchmark: Elite teams recover from failures in under an hour. If your answer is “we’d figure it out,” you’re exposed.


Signal #2: The Testing Signal

The question: “Does this change include tests?”

You can’t read the test code. But you can count.

Good sign: Tests exist, even if you can’t evaluate their quality.

Warning sign: A change with 500 lines of code and 0 tests.

Why This Matters

The research here is nuanced but important.

A landmark study published at ICSE found that “coverage is not strongly correlated with test suite effectiveness when the number of test cases is controlled for.”

Translation: More tests doesn’t automatically mean better quality.

But the absence of tests is a reliable signal. Research on code review effectiveness shows that combining code review with testing catches significantly more defects than either practice alone.

What You’re Actually Looking For

You’re not evaluating test quality. You’re checking for the existence of testing discipline.

Zero tests on a significant code change means one of three things:

  1. The developer doesn’t write tests (process problem)
  2. The change was rushed (schedule problem)
  3. The code is untestable (architecture problem)

All three are red flags.

The Practical Application

Ask your developer: “For the last feature we shipped, can you show me the tests?”

Good answer: Shows you a list of test files, explains what they cover

Concerning answer: “We tested it manually” or “It works, we checked”

Warning sign: “Tests slow us down” or “We’ll add them later”

The data contradicts the “tests slow us down” argument. The upfront investment pays off: teams with testing discipline spend less time debugging and more time building.

What “Good” Looks Like

You don’t need 100% test coverage. Research from Apache systems found that test-related factors have “a limited connection to post-release defects” when controlling for other metrics.

What you need:

  • Tests exist for critical business logic
  • Tests run automatically before deployment
  • Test failures block deployment

This is binary. Either these things happen or they don’t.


Signal #3: Deployment Frequency

The question: “How often do we deploy to production?”

Elite benchmark: Multiple times per day

High performer: At least weekly

Warning sign: Less than monthly

Why This Matters

Google’s DORA research has studied this for years across thousands of engineering teams. Their findings are clear: deployment frequency is a key indicator of software delivery performance.

Why? Because frequent deployment means:

  • Smaller changes (less risk per deployment)
  • Faster feedback loops (problems found sooner)
  • Better automation (manual deployments don’t scale)
  • More confidence (teams that fear deployment have problems)

The DORA data shows elite teams deploy on-demand—multiple times per day—while low performers deploy monthly or less. Elite teams are twice as likely to meet or exceed their organizational performance targets.

What Infrequent Deployment Reveals

If your team deploys monthly or less:

  • Changes are batched - More changes per deployment = more risk per deployment
  • Deployment is manual - Someone has to “do the deployment” rather than it happening automatically
  • Deployment is scary - The team holds their breath and hopes nothing breaks
  • Feedback is slow - A bug introduced today won’t be found for weeks

The Conversation

Ask: “How many times did we deploy last month?”

Good answer: Specific number, ideally weekly or more

Concerning: “A couple times” or “when we have features ready”

Red flag: “We do big releases quarterly”

If deployment is infrequent, ask why:

  • “What would it take to deploy weekly?”
  • “What’s blocking more frequent releases?”
  • “Is deployment manual or automated?”

Industry Context

DORA defines Change Failure Rate as the percentage of deployments causing problems. Elite teams keep this under 15%; the industry average is significantly higher.

If your team deploys rarely AND has high failure rates when they do deploy, you have compound problems: big batches of risky changes going out without adequate safety nets.


Signal #4: Lead Time

The question: “Once code is written, how long until it’s live for customers?”

Elite benchmark: Less than one day

High performer: Less than one week

Warning sign: More than one month

Why This Matters

Lead time for changes is the fourth DORA metric. Technically, DORA measures from code commit to production deployment—but for a non-technical founder, the practical question is simpler: once your developer says “it’s done,” how long until customers see it?

This matters for two reasons:

  1. Fast feedback: Shorter lead times mean you learn faster whether something works
  2. Competitive advantage: Teams that ship in days can respond to market changes; teams that ship in months cannot

What Long Lead Times Reveal

If your team takes weeks to ship a small change:

  • Too much process: Approvals, reviews, and handoffs are creating bottlenecks
  • Manual steps: Deployments require human intervention at multiple points
  • Large batches: Changes are being bundled together instead of shipped incrementally
  • Environment problems: Getting code from development to production is difficult

The Conversation

Ask: “If we wanted to add a small feature—say, changing button text—how long until customers see it?”

Good answer: “A day or two, including review and testing.”

Concerning: “A week, maybe two.”

Warning sign: “We’d bundle it into the next release, so probably a month.”

The feature itself might be trivial, but the answer reveals your entire delivery pipeline.


Signal #5: Trade-Off Transparency

The question: “What did we compromise to ship this?”

Every feature has a cost. Senior engineers know this and can articulate it. Junior engineers (or engineers trying to hide problems) pretend everything is perfect.

Good answer: A specific list of trade-offs made and why

Warning sign: “Nothing, it’s done right” (this is almost never true)

Why This Matters

Every piece of software carries technical debt—shortcuts taken to ship faster, optimizations deferred, documentation skipped. This isn’t inherently bad; it’s a conscious business decision to trade speed now for work later.

The problem is invisible debt. When engineers don’t acknowledge trade-offs, debt accumulates silently until it becomes a crisis.

The question isn’t whether your team makes trade-offs—they do. The question is whether they’re aware of them and transparent about them.

What Honest Answers Sound Like

“We shipped the feature, but:

  • The database queries aren’t optimized—fine for 100 users, will need work at 10,000
  • Error messages are generic—users won’t know exactly what went wrong
  • We hardcoded some values that should be configurable later
  • The mobile experience is functional but not polished”

This is healthy engineering. Trade-offs were made consciously and documented.

What Concerning Answers Sound Like

  • “It’s production-ready” (with no caveats)
  • “We built it right the first time”
  • “There’s no technical debt”

No software is debt-free. If your developer claims zero debt, they’re either not aware of it (concerning) or not telling you (more concerning).

How to Use This

After a feature ships, ask: “Walk me through the trade-offs you made.”

Listen for:

  • Specificity - Vague answers suggest they haven’t thought about it
  • Business context - Good engineers connect technical choices to business impact
  • Future awareness - What will need attention later?

This isn’t about catching them in problems. It’s about ensuring they’re thinking about the full picture.


Signal #6: The Failure Demo

The question: “Show me what happens when [X] fails.”

Don’t ask if it works. Assume it works. Ask what happens when it doesn’t.

Good answer: Demonstrates graceful error handling, clear user messages, logging

Red flag: “I’m not sure” or the app crashes

Why This Matters

Systems fail. APIs go down. Users enter unexpected data. Networks disconnect. The question isn’t whether these things will happen—they will. The question is whether your software handles them gracefully or crashes spectacularly.

The Scenarios to Test

Pick scenarios relevant to your business:

  • “What happens if the payment API is down?”
  • “What if a user enters a 10,000-character message?”
  • “What if the database is slow?”
  • “What if someone submits the form twice quickly?”

What Good Failure Handling Looks Like

  1. User sees a helpful message - Not “Error 500” but “We couldn’t process your payment. Please try again or contact support.”

  2. The system doesn’t crash - One failure shouldn’t take down everything

  3. Errors are logged - Someone can investigate what happened

  4. Data isn’t corrupted - A failed transaction shouldn’t leave things in a broken state

The Conversation

Pick one realistic failure scenario and ask: “Can you show me what users would see?”

If they can demo it confidently, good. If they say “let me check” or the demo reveals generic error messages and crashes, you’ve found work to prioritize.

Why This Signal Matters

Your goal isn’t to eliminate all failures—that’s impossible. Your goal is to ensure failures are handled gracefully and visibly. The difference between a minor incident and a customer crisis is often just error handling.


Putting It Together: Monthly Check-Ins

Don’t ask all six questions at once. Instead, rotate through them:

Week 1: Rollback Test

  • “If our last deployment broke something, how would we revert?”

Week 2: Testing Signal

  • “For the feature we just shipped, can you show me the tests?”

Week 3: Deployment Frequency + Lead Time + Change Failure Rate

  • “How many times did we deploy this month?”
  • “Once the code was ready, how long until it was live?”
  • “How many of those deployments caused problems we had to fix?” (This is the change failure rate)

Week 4: Trade-Off Transparency + Failure Demo

  • “What compromises did we make on [recent feature]?”
  • “What happens if [relevant failure scenario]?”

Building a Dashboard

Track these over time:

MetricThis MonthLast MonthTrend
Deployments128↑ Good
Change failure rate1/12 (8%)2/8 (25%)↓ Good
Recovery time5 min15 min↓ Good
Lead time (avg)3 days5 days↓ Good
Tests per featureYesYes→ Stable

You don’t need sophisticated tools. A spreadsheet works fine. The point is visibility over time.


When These Signals Reveal Problems

Scenario A: Your Team Scores Well

Great. You have engineering discipline in place.

Still worth doing:

  • Continue monthly check-ins to maintain standards
  • Use our website scanner to verify externally visible issues
  • Document the processes so they survive team changes

Scenario B: One or Two Signals Are Weak

Common. Most teams have gaps.

What to do:

  • Don’t panic or assign blame
  • Prioritize: Rollback capability and basic testing first
  • Work with your developer on a plan to address gaps
  • Revisit in 30 days

Scenario C: Multiple Signals Are Red Flags

You may have a systemic problem.

Indicators of systemic issues:

  • No rollback capability AND no tests AND manual deployments
  • Defensive answers to basic questions
  • “You don’t need to worry about that”

What to do:

  • Get an independent technical assessment
  • Consider whether this is a skills gap or a priorities gap
  • Book a consultation for objective evaluation

The Real Pattern

These six signals share common traits:

✅ They’re observable without reading code ✅ They’re backed by research and industry best practices ✅ They correlate with delivery performance, reliability, and risk ✅ Any competent developer should be able to discuss them openly

You’re not evaluating the code. You’re evaluating whether your team follows practices that lead to good code.

The DORA research is clear: process metrics predict performance better than technical metrics. Teams with good practices build good software. Teams without them don’t—regardless of how talented the individuals are.


What to Do Next

Start here:

  1. Pick Signal #1 (Rollback Test) for your next conversation
  2. Ask the question without judgment
  3. Note the answer and follow up if needed

If you want objective data:

If you need help:


Sources & Further Reading

Frequently Asked Questions

Can I really evaluate developer work without understanding code?
Yes. Research from Google’s DORA program shows that process metrics like deployment frequency, lead time, change failure rate, and recovery time are stronger predictors of software quality than code-level metrics. You can’t judge the code, but you can judge whether your team has good engineering practices—and that’s what actually matters for business outcomes.
Won't my developers feel micromanaged if I ask these questions?
Frame it as understanding, not interrogation. Good developers appreciate informed questions from business owners. Say ‘I want to understand our process better’ rather than ‘prove to me you’re doing good work.’ The questions in this framework are standard engineering practices—any competent team should welcome discussing them.
What if my developer gets defensive about these questions?
Defensiveness about basic process questions is itself a signal. Competent engineers discuss testing, deployments, and error handling openly. If your developer says ‘you don’t need to worry about that’ or ‘it’s too technical to explain,’ that’s a warning sign. Everything can be explained in simple terms.
How often should I use this framework?
Monthly for the deployment, lead time, and error metrics. Quarterly for deeper conversations about trade-offs and technical decisions. Don’t ask all six signals at once—pick one or two per conversation and rotate through them over time.
What's the most important signal to start with?
Start with the Rollback Test. Ask ‘If we deploy something and it breaks, how do we undo it?’ The answer reveals whether your team has basic safety practices. If they can’t answer clearly, that’s your priority issue before anything else.
My developer says tests slow them down. Are they wrong?
Research shows combining testing with code review catches significantly more defects than either alone. The upfront investment pays off quickly. However, 100% test coverage isn’t the goal—the question is whether critical paths are tested. Zero tests on a 500-line change is a warning sign; not having tests on a config file change is fine.
What if we're a small startup and can't afford all these practices?
Start with the essentials: one-click rollback capability, automated tests on critical paths, and basic error tracking. You don’t need enterprise-grade DevOps—but you do need to know when things break before your customers tell you. The practices in this framework scale down; even a solo developer should have these basics.
How do I know if deployment frequency is too low or too high?
Elite teams deploy multiple times per day; high performers deploy weekly. If you’re deploying less than monthly, you’re likely batching too much risk. But frequency without stability is dangerous—check the change failure rate too. A team deploying daily but breaking production weekly has a process problem.