Artificial intelligence in police research: A preliminary examination of feasibility and replication

Artificial intelligence in police research: A preliminary examination of feasibility and replication

Abstract

Objectives: This manuscript tests whether artificial intelligence (AI) review of body-worn camera footage replicates the findings of a randomized-controlled trial of police de-escalation training using systematic social observation (SSO). Methods: Body-worn camera video that was previously analyzed using SSO was subjected to review by an AI program. The study then replicates the analysis of the randomized-controlled trial using outcomes generated by AI. The results are compared to those of the original study. Results: All five measures produced by the AI correlated with at least one SSO measure, though several SSO measures appeared completely unrelated to any measure produced by AI. The analysis of treatment effects by AI reaches the same conclusion as the original study. Conclusions: SSO provides greater nuance to understanding police interactions but is weakened by its time-consuming and expensive nature. AI provides a promising avenue for conducting similar analyses more quickly and efficiently.

Publication
Journal of Experimental Criminology

Understanding how police officers actually behave during encounters with the public is harder than it sounds. You can’t rely on officers’ own reports because self-reporting inevitably misses the nuances that matter most. So researchers have spent decades training teams of student coders to ride along with officers and record what they see in real time. For example: did the officer use force? Attempt to de-escalate? This method — called systematic social observation — produces rich, detailed data. It also takes months (just to collect the data, let alone analyze and publish results) and costs a fortune, which is why most studies that use it are funded by federal grants. Only in the past decade or so has the widespread adoption of body-worn cameras opened a second path: instead of riding along, coders can watch footage after the fact, which is more flexible but still quite labor-intensive.

In a previous study, our team used this approach to evaluate whether Polis Solutions’ de-escalation training program actually changed how officers in Virginia Beach communicated with the public. Officers were randomly assigned to receive the training immediately or wait a year. Human coders then watched hundreds of body-worn camera clips from before and after training. Key findings were that trained officers were more likely to try to start a personal conversation, less likely to repeat commands over and over, and more likely to demonstrate empathy than their untrained colleagues.

For this follow-up study, we asked a simpler question: could an AI have told us the same thing? We fed the same body-worn camera footage into TrustStat, a commercially available AI software made by Polis Solutions that uses speech analysis, natural language processing, and computer vision to score officer performance. The AI had no idea which officers were trained. It evaluated each clip on five dimensions: respect, calm, tone, clarity, and rapport.

Before testing whether training made a difference, we needed to answer a more basic question: are the AI’s scores even measuring something related to what the human coders observed? They don’t use the same indicators. Human coders looked for specific behaviors — Did the officer try to start a conversation? Did they repeat commands? Did they express empathy? TrustStat produces holistic ratings — How calm did this officer seem? How respectful? If the two systems aren’t at least somewhat correlated, comparing their conclusions would be comparing apples to oranges.

It turns out they are correlated — not strongly, but enough to matter. Officers whom human coders caught repeating commands received lower AI scores for respect (r = −.38), calm (r = −.35), and tone (r = −.37). Officers coded as more empathetic also received higher AI scores for respect and calm (r = .22 each). The correlations were weaker for conversation-starting — which helps explain why the AI didn’t replicate that particular finding when it came to treatment effects.

AI: Respect AI: Calm AI: Tone
Tried to start a conversation .11 .13* .09
Repeated commands −.38** −.35** −.37**
Empathy (rating) .22** .22** .14*

* p < .05, ** p < .01. One AI measure — clarity — moved in the opposite direction from the others, rising when officers issued more commands and falling when human coders noted empathy. It appears to capture directness or authority rather than relational quality.

Both systems are picking up on the same underlying thing — how well officers communicate with the public — just through different lenses. With that established, here’s what each approach found when we tested whether training actually changed officer behavior:

What we measured Human coders found… AI (TrustStat) found…
Starting a personal conversation Trained officers more likely ✓ No significant difference
Repeating commands Trained officers less likely ✓ No significant difference
Showing empathy Trained officers more empathetic (marginal) No significant difference
Respect in tone/manner Not measured Trained officers scored higher ✓
Calmness in voice/manner Not measured Trained officers scored higher ✓
Overall tone of voice Not measured Trained officers scored higher ✓

Both systems found that de-escalation training improved how officers communicated — each in its own vocabulary.

That’s encouraging — but there’s a catch worth taking seriously. When we asked Polis Solutions to explain exactly how TrustStat generates its scores, they declined, citing trade secrets. That’s their prerogative as a company, but it’s a real problem for researchers. If you can’t look inside the machine, you can’t fully trust what it’s measuring. For now, off-the-shelf AI tools like TrustStat are best suited as a first pass — a way to quickly identify patterns worth investigating more carefully. The bigger prize would be AI tools built specifically for research, with transparent methods and independent validation. If we can get there, the payoff for evidence-based policing could be substantial: findings that used to take years to produce might be available in time to actually inform the decisions that prompted the research.