The 2021 failure, why fWAR is broken, and Eric Hosmer

Well, shit. The most promising year of Padres baseball since George Bush was President appears to be just another disappointing season in a long streak of disappointing seasons. We should have known better. The Padres are a San Diego sports team after all, and San Diego sports teams are jinxed, right?

This, more or less, seems to be the sentiment among Padres fans at the moment. 2021 was like any other year and perhaps even more heartbreaking given our “real” hopes to start the year. Allow me to disagree.

I actually think the franchise is in its best shape ever and I’m not particularly bothered by two months of bad baseball. At least half of you guys hyped up all those shitty ballclubs who sported a payroll that started with a 3 or 4. Now you can’t stomach a few months of bad baseball in the middle of the Padres Renaissance? I knew the fanbase was soft, but you’re giving Andy Green a run for his money here. Frankly, it’s embarrassing.

Let’s examine some details. The Padres have one of the most beloved, best young players in all of baseball and another likable superstar, both signed through their primes. They have a number of team-controlled, good pieces alongside those two. A San Diego native threw the first no-hitter in franchise history this season. The Padres fielded the largest payroll they ever have, both nominally and relatively. Miraculously, the Padres even seem reasonably competent off-the-field: they play in beautiful brown uniforms, Mud and Don continue to be incredible on the broadcast, Jesse and Tony form the best radio tandem we’ve had in at least twenty years¹, and there wasn’t a single off-field incident this season that caused national embarrassment, unless I missed a Ron Fowler interview on one of Dan Sileo’s (single digit viewer) Twitch streams or something.

¹ I’m sure I’ll catch flak for this, but I didn’t enjoy listening to Ted for the past decade, as he struggled to describe the action. It’s the same concept for why I wouldn’t enjoy Trevor Hoffman pitching for the Padres at age 60. They’re both legends and deserve to be in the Padres Hall-of-Fame, but like players, broadcasters also have an aging curve.

It’s true: the Padres didn’t win the World Series this year. It turns out you can’t just try for once, after decades of not giving a shit, and expect to instantly succeed. It’s going to take time and it’s entirely possible Tatis and Machado never win us a World Series. Cover your ears if you must, but that’s actually the most likely outcome. Just do the math: there are 30 teams and one World Series title, yet Tatis isn’t playing 30 seasons. I find it absolutely incredible how many fans don’t appear to be enjoying the actually enjoyable era, right after choking on Jeff Moorad’s cock and telling me how Jose Pirela’s sprint speed will make his .400 BABIP sustainable long-term. With the likelihood of failure, even during the good years, in mind, maybe spending the entire Tatis era loudly complaining about a player with a positive contribution to the team’s championship odds isn’t a healthy approach to life.

Things aren’t perfect, obviously. They’ve fired the manager and some scouts. Tommy Pham’s attempted assassin is still at-large. Dave Cameron resigned. There’s some clubhouse turmoil, allegedly.

But wouldn’t it have been worse if these things didn’t happen? Would you be happier if they kept the manager after this choke job? Would you be happier if they didn’t fire scouts who, let’s face it, whiffed on a number of recent decisions? ~~Would you be happier if SDPD was a competent group, capable of investigating crimes and getting assassins off the street?~~ Would you be happier if Dave Cameron stayed on, despite clearly not consistently demonstrating the ability to influence sound data-driven decisions? Would you be happier if the clubhouse sang Kumbaya after going 26-43 after the All-Star break?

Of course all of this shit was going to happen after the way the season ended!

And don’t get me started on the vehement “fire Preller” movement on Twitter. Far too many of them bought ‘In AJ We Trust’ or ‘Rockstar GM’ shirts two years ago, cheered Josh Byrnes openly scoffing at signing “quality players in their prime” eight years ago, and completely ignored letting Jed Hoyer leave the organization for literally no compensation whatsoever ten years ago. You’ve got no credibility on this one.

That said, I am mildly surprised Preller is still here and unless his job security is guaranteed going forward, he probably shouldn’t be. The last thing this franchise can afford is a GM trading away future assets to prolong their own employment. That’s the dangerous Chicago Bears trading for Jay Cutler territory that I’ve discussed many times. So even if Preller’s job security is tenuous (which it has to be, right?) ownership needs to give him every impression that it isn’t. To that end, Seidler’s recent comments are good and everyone should accept that’s what we actually want to hear.

This year, the arms just exploded – seriously, you can’t get just 90 starts out of six of your starters and expect success – and it seems Preller barely got beat out for Max Scherzer at the trade deadline. For me, it’s hard to complain too loudly about that, especially after it was literally reported as a done deal by leading media members. Truthfully, I think Preller is just a mediocre gambler, but that beats the shit out of the mediocre pussy we had before him. AJ is more likely to trade for Seth Smith as the final piece to his pickup basketball squad than the final piece to the roster, and there’s something to be said about YOLOing in situations/games where only the absolute best outcome is the one you care about. If you aren’t first, you’re last. You can’t win the lottery if you never buy a ticket, Josh, expected value be damned.

Listen, the entire purpose of the documentary, “Change the Padres”, and whatever else I’ve rambled on about over the years was about influencing the Padres (with however much I can muster) to try to win an actual World Series. Even without winning one this season, I am very confident that the purpose of my account is no longer relevant. Peter Seidler is probably the best sports owner we’ve ever had in San Diego, and I can finally just sit back and enjoy the team, content that the outcome is being decided by fortune, not pre-determined by ownership inaction. My watch has ended. You win, Pete.

It seems outrageous that I – the one who was maligned for years, by fans, local media, and individuals within the Padres organization, for *correctly* identifying that the owner was pilfering funds, the team President was lying about it, and that the Padres would never improve without major changes – am the one who now sits before you, furiously typing on the same computer I used to create the documentary … about how things are actually pretty good. But, *looks into Paul Rudd’s eyes as he pours me a glass* here we are.

For the record, I plan on either re-branding (Bring Back Rye Triscuits) or just going away altogether and spending that time in more constructive ways, like making homemade Rye Triscuits and trying to become a titled chess and backgammon master. To be honest, the current Padres discussion climate just isn’t worth the time and effort anyway, except in rare, periodic long form.

It wasn’t always this way, though, and to a large extent, the prevailing climate – aversion to constructive conversation and the emergence of insolated circle jerks – are broader societal issues. It just so happens that this problem has polluted even basic sports talk, including the Padres, let alone everything else like politics, medicine, anthropology, and even comedy.

To better illustrate this point, I need to give a brief overview on the history of Padres discussions on the internet. This will be a long-winded aside, but these are my rules.

Like I hinted at above, most of what I now experience when it comes to Padres discourse, in general, is complete fucking garbage. And when I say Padres discourse, we all know we’re really just talking about Padres Twitter, which is where the overwhelming majority of it occurs. I recall once reading that Twitter is just a whole bunch of Letters to the Editor that never got published, but even that gives Twitter too much credit. Letters to the Editor require more effort and thoughtfulness, and generally exceed 140 words, let alone 140 characters. The character limitation is exacerbated by the fact that almost any asshole can post on Twitter: you, me, Mahmoud Ahmadinejad, Wendy’s, a paid troll in Russia or China, and even OJ Simpson. At least Letters to the Editor are curated. The entire world has been dulled by the reduction of discourse to limited characters with some implicit goal to acquire a dopamine kick from a free, intrinsically worthless “like”. The revival of tribal, 1930s identity politics during the era of Twitter is not a mere coincidence. It shouldn’t be a surprise, therefore, that Padres Twitter seems to solely push simple, naïve solutions (DeRp, DfA hOsMeR) when the truth for why the Padres failed in 2021 is far more detailed.

At the risk of dating myself, Padres discourse just kind of sucks now in a way that it didn’t back in the SignOn San Diego (and then SDUT forum) days. Back then, there were no likes or other forms of social validation guiding your posts. The forum was “intellectually” pure; it was an outlet to actually discuss various Padres-related topics, where posts were ordered chronologically and the flow of information was strictly Padres-related. The word limit was in the thousands and that lent itself to posters spending their time formulating coherent arguments and being able to discuss more complex, nuanced views than you can with a character limit in the low hundreds. Juxtapose that against the current situation, where users just blurt out whatever easy statement will rack up quick likes from the same inner circle of TikTok addicts, aspiring communists, and thirsty fives. Now, that doesn’t mean the forum didn’t also have its garbage – just ask someone there to explain the shortstop thread or Henry trade proposals – but overall, the level of discourse was probably the most well-developed that there ever was for Padres chatter. And the jokes and memes didn’t suffer for it, either.

The forum slowly died when (ironically) the UT took over. They didn’t kill the forum, it was just poor timing. A new form of Padres dialogue had entered the fray: Gaslamp Ball. Originally, GLB was an independent blog run by Padres fanatics Dex and Wonko. (Full disclosure: we have history.) Eventually, GLB became an SB Nation syndicate and with that acquired all the bells and whistles that comes with a larger parent company. Among those resources was a real-time comment section. And that was how GLB became the intermediate between the forums and Twitter.

The real-time comments section was basically just a Twitter thread, but restricted only to replies about a specific post. It had its own version of likes and if you got enough of them, it’d be color-coded green. (Good job, popular shithead!) Some threads would get off-topic, but over 95% of the comments were directly related to the Padres. Replies had a character limit which was lower than the forum, but greatly exceeded Twitter. It, like Twitter, had censorship issues and was subject to the Padres front office trying to curry favor and control the message. In short, it was a more limited circle jerk than what we experience today, but it wasn’t a total shit show. Actual Padres topics were discussed in a more brief manner than the forums and lacked the rich text formatting and quote format that made the forums more elegant, but GLB still retained an “intellectual” flair on most Padres discussions, even if its leadership had questionable conflicts of interest, like receiving free mentorships from Tom Garfinkel or having Brooks Conrad nut in your sister.

But there’s a reason the documentary came from a SOSD user and not Padres Twitter, GLB, MLB.com comments, or Padres Facebook. Most topics (and especially those as important to clearly explain and nuanced as public policy) require long-form and suffer from outside influence, like the dopamine kicks and social validation garbage that guides Twitter discourse. Even basic ass topics like “San Diego Padres baseball” are difficult to adequately discuss in the word vomit format that is Twitter. I don’t want to put words in his mouth, but I’m pretty sure that’s a contributing reason to why the best Padres writer we’ve ever had isn’t a part of Padres discussions anymore. I would trade the entirety of Padres Twitter – every last one of you – to get Sac Bunt Dustin posts in my inbox again.

So yeah, this post will be long. It needs to be long. That’s fine. I can’t do what I’m about to do – basically dismantle the offensive portion of fWAR – in 140 or 280 characters and you should balk at the thought of anytopic (let alone public policy!) being reduced to character limits.

And all this leads me back to Eric Hosmer, obviously. If you spend any time at all on Padres Twitter, you’ve likely heard quite a bit about Eric Hosmer. Nothing coherent, probably, but I’m sure you’ve heard it. In the off-chance that you haven’t, just look for the crowd pirating its mannerisms from the seagulls in Finding Nemo: “HOSMER”, “HOSMER”, “HOSMER”.

I’m probably one of the lone people still standing here believing that Eric Hosmer isn’t one of our biggest problems. I know your arguments well: he’s not worth his contract, his fWAR is negative, he grounds into double plays, he looks shitty defensively, and so on. You state your claims every day, even in situations completely irrelevant to Eric Hosmer, and after every benign out. I get it, seagull. Now it’s my turn to present to you why his fWAR – and many others’, including Tony Gwynn – is plain wrong and shortchanging him many wins, in a format conducive to formulating a coherent argument.

Statistics are like bikinis. They show a lot, but not everything.
Lou Piniella

One thing that I’ve always found interesting about Eric Hosmer is that his situational hitting metrics are off-the-charts good, in comparison to his generic line: in 1243 high leverage plate appearances, he has an .820 OPS, 51 points about his medium leverage line and 79 points above his low leverage line. This is abnormal; the league on the whole bats roughly the same in each of these situations, with the high and low leverage OPS’s sometimes the exact same, but occasionally differing by up to 9 points. It’s especially abnormal when you consider that pitchers are more likely to be allowed to hit in low leverage situations (deflating the aggregate numbers) while teams tend to bring in better pitchers for high leverage situations (deflating the numbers for non-pitchers).

There are a lot of plausible explanations for why this might be. Chief among them are clutch, extreme luck, and laziness. But those first two are unsatisfactory explanations for me; I’m not a huge believer in an innate clutch skill while, on the flip side, his sample size is large enough that he deserves some benefit of the doubt that this disparity is arising from something other than dumb luck. The third explanation concedes that Eric Hosmer is more talented and valuable than his baseline statistics indicate, which de facto wins the argument.

But allow me to propose a fourth plausible explanation, which I intend to prove: certain skillsets are more valuable in high-leverage situations than other skillsets. This would imply, then, that statistics which do not consider context – everything using linear weights, including wOBA and fWAR – are unfit for use, or require heavy editing, if applied on players who possess this other skillset. And it turns out that there are many players who possess this skillset, including Eric Hosmer.

For this, we’re going to turn to a different Fangraphs statistic: Clutch. Fangraphs’ Clutch number measures how many more wins a player added in high leverage situations than would be expected based on their overall performance level. For example, a player that adds 1.5 wins (by wins probability added²) in high-leverage situations when their overall performance level suggests they should have added 2.5 wins in those situations would have a clutch score of -1.0. They may have still been good in those situations (which is true, since they added 1.5 wins) but relative to their baseline performance, they didn’t live up to expectations. They weren’t “clutch”.

² Please read this primer on win probability added if you need to understand how this works.

Now, before I continue, I want you to close your eyes and think about who you believe are the most “clutch” batters since 1987 (i.e. my lifetime)?

Who did you name for the first one? David Ortiz? Derek Jeter?

*buzzer*

The answer is Mark McLemore, with a clutch score of 8.47, indicating that he produced roughly 8.5 more wins in high-leverage situations than we would have expected given his baseline performance. Next on the list is Ichiro, followed by Daniel Descalso (!?!), Omar Vizquel, Yadier Molina, Lance Johnson, Scott Fletcher, Jose Lind, Mark Grace, Bill Spiers, Tony Gwynn (+6.27 since ’87, +9.50 career), and Ozzie Guillen.

You might notice a trend here: these aren’t really clutch hitters. Sure, some of the names you may have come up with are on this list – Yadier, Johnny Damon and Jim Leyritz are all top 25 out of 2,018 players since 1987 with at least 500 plate appearances – but the majority of guys towards the top are simply contact hitters. It turns out that classic contact hitters “overperform” in high-leverage situations for two primary reasons:

In high-leverage situations, the value of a single is much closer to the value of a homerun than in an ordinary situation. For example, in a tie game with two outs and a runner on 2nd in the 8^th inning, a single adds 27% in win probability while a homerun adds 34%. But in the third inning, that single adds 10% while the homerun adds 20%. The ratio of the value between a single and homerun is 1.26 in the first scenario, but 2.00 in the second.

For a handful of different score/base/out scenarios, the following chart shows the relationship between the inning of the game vs. the aforementioned ratio between the win probability added by a homerun and a single.

Whereas fWAR (through wOBA’s linear weights³, shown below) treats a homerun as being precisely 2.28 times as valuable as a single for all plate appearances in 2021, that ratio clearly isn’t true during actual baseball game situations. Sometimes a homerun is even more valuable than that, while other times a single and homerun are identical in value. Late game settings tend to bend these curves and really augment the relative values.

³ Here’s the calculation of fWAR, so you can see for yourself that the 2.28 ratio is correct and all outs (except sac flies) are indeed treated the same, a topic discussed further down the post.

Working backward one step at a time:

WAR = (Batting Runs + Base Running Runs +Fielding Runs + Positional Adjustment + League Adjustment +Replacement Runs) / (Runs Per Win)

Batting Runs = wRAA + (lgR/PA – (PF*lgR/PA))*PA + (lgR/PA – (AL or NL non-pitcher wRC/PA))*PA

wRAA = ((wOBA – lgwOBA)/wOBA Scale) * PA

wOBA = (0.690×uBB + 0.722×HBP + 0.888×1B + 1.271×2B + 1.616×3B +
2.101×HR) / (AB + BB – IBB + SF + HBP)

That’s it. According to fWAR, a position players’ offensive WAR is ultimately decided by how many walks, hit-by-pitches, singles, doubles, triples, homers, sacrifice flies, and intentional walks the batter accumulated, normalized against league-wide performance.

You might be wondering: how are fWAR’s weights for singles, doubles, and so on determined? Linear weights describe the value of these events (single, double, etc.) in terms of runs added, based on an analysis of historical games and how these events impacted total runs scored. It is not in terms of wins added, which is the crucial error. In effect, it ends up assuming that a run produced at any juncture is equally valuable to one another, which is demonstrably false. A win, on the other hand, is always worth a win.

Ultimately, the game situation dictates the relative value of a homerun vs. a single. That should be accounted for in the way we assess player performance, unless there is hard evidence that player outcomes are randomly distributed. (You’ll be searching for a long time, because that evidence does not exist; players do participate in situational hitting.)

2. Contact hitters are more likely to make productive outs, like moving a runner over. Much more on this later…

My observation about contact hitters being disproportionately towards the top isn’t merely just an observation. There’s a clear statistical relationship between clutch and several relevant statistics: swinging strike rate, contact percentage, single rate, groundball rate, and many others.

To show this, I’m going to use a data science pre-processing tool called weight-of-evidence. This is a widely used method to determine if a particular variable has a meaningful relationship with some target variable. In this case, my target variable is whether or not the player is in the 90^th percentile or above for Fangraphs’ Clutch statistic: this is our “clutch” population.

You can skip ahead to the color-coded tables below if you wish, but I’m going to give a very quick primer on WoE for those who are curious:

Basically, you divide your variable into separate bins – generally even width or even volume – and calculate both the share of target events and share of non-target events in each bucket. You then take the natural logarithm of the ratios of these shares: it’ll be positive for buckets where our target variable disproportionately occurs and negative for buckets for our target variable disproportionately does not occur.

Generally, we aren’t just satisfied with the weight-of-evidence values themselves, but with how much total information value is provided. The information value for each bucket is the weight of evidence value multiplied by the difference between the shares. To get the total information value for the table, you simply sum up the information value for all buckets.

A total information value greater than 0.1 is considered to be reasonably useful, while anything above, say, 0.3 is considered to be fairly strong. (I’m simplifying, as there’s a bit of hands-on experience required to really understand this concept and its potential pitfalls – i.e. data leakage – but this basic understanding is sufficient for this post.)

Here’s this method applied to swinging strike rate, where the binning is in equal widths of 1%:

There’s a pretty clear relationship here: the more you swing and miss, the less likely you are to be in the 90^th+ percentile for Clutch⁴. Applying weight-of-evidence shows this is a strong relationship, as the total information value (IV) is 0.49.

⁴ Note that I’ve reduced the original 2000+ players down to 1478. That’s because batted ball data doesn’t exist before a certain date, so many players had to be removed from the dataset for this portion of the analysis.

The same general relationship, but inverted, applies for contact percentage:

Increasing contact rate increases the likelihood of being clutch – the 0.44 total information value confirms that there’s definitely something to this.

But more than anything, single rate seems to be a key discriminator:

I mean, look at that. Hitting more singles means you are considerably more likely to overperform in high-leverage situations, from a win probability added perspective, and thus be considered “clutch”. Players with single rates greater than 18% represent almost three-tenths of all 90^th percentile Clutch players, but only one-tenth of all un-Clutch players. Amazingly, four of the six most single-heavy players are in the 90^th percentile or above in Clutch: Ichiro, Juan Pierre, Dee Gordon, and some fuckin dude named Harold Castro. (The late, great Tony Gwynn would be included here, too, but the dataset does not include players without granular batted ball data, per the prior footnote.)

Even groundball rate shows some promise, but this is probably because groundballs often result in singles, which are disproportionately more valuable in high-leverage situations:

What does this all mean? At the end of the day, it means Mark McLemore’s 17.4 career fWAR is short-changing him by basically 8.5 wins. McLemore was worth about 50% more than fWAR indicates.⁵

⁵This isn’t exactly how the math would work. Taking Clutch into consideration would alter so much of the underlying calculations that, pragmatically speaking, it would break fWAR altogether. (fWAR already is broken, once you ingest these realizations, but that’s another topic.) That’s because the total sum of the Clutch variable for all batters is actually negative; the relative value of events is bent so much during high-leverage situations that the only players who are able to maintain a positive relative performance in these conditions are those with specific underlying profiles. This is also surely affected by the quality of pitcher used in high-leverage situations. However, the units of Clutch are indeed wins, so a rudimentary exercise like this works to some degree.

But obviously, McLemore isn’t the only player affected. Our beloved Tony Gwynn was worth 9.5 more wins than his 65.0 fWAR states; that’s 14.6% more! Adrian Gonzalez was worth 5.73 more wins, while – good news! – Adam Frazier has produced 4.5 more wins. Most screwed of all in the old friend department is Geoff Blum, whose 3.3 extra ‘clutch’ wins is almost as much as his entire career fWAR of 4.4. Ultimately, there are a lot of players whose reputations as baseball players have been unfairly altered by fWAR’s failure to account for these effects.

On the other side of the spectrum, Sammy Sosa amassed a whopping -14.7 Clutch score, indicating that his performance in high-leverage situations was basically 15 wins worse than expected. Now consider that Sammy Sosa’s overall fWAR is 60.0, almost the same as Tony Gwynn’s (65.0). With the Clutch correction, however, their career difference widens from 5 wins up to 30 wins, from an 8% relative difference up to 41%! So while fWAR says these two players are very comparable theoretically, the story of what actually happened on the baseball field indicates that Sammy Sosa was nowhere close to the player that Tony Gwynn was.

There are other prominent players who suffer the same fate: Jim Thome, ARod, Adrian Beltre, and Giancarlo Stanton, among others, all have clutch values of -8.0 wins or worse. Mike Trout is at -7.88, Barry Bonds at -7.38, and old “friend” Matt Kemp at -6.82. The list goes on. Again, this doesn’t mean these players were bad. Quite the opposite, really; they were so good in general that more was expected out of them in these situations. It just turns out that the way value is generated in high-leverage situations is not the same as the way value is generated in ordinary situations, to the benefit/detriment of certain types of hitters.

And in what I’m sure you’ll find to be both humorous and validating, at the bottom of the list for the Padres – the least clutch Padre since 1987 – is none other than Chase Headley: -5.1 wins! Wil Myers sits in second-to-last at -4.35 wins. Phil Plantier, Darrin Jackson, and Dave Roberts (lol) round out the bottom five.

But you know where this is going, right? Right?!?

Of the 2,018 MLB players since 1987 with at least 500 plate appearances, Eric Hosmer ranks 18^th, wedged between Adrian Gonzalez and Jim Leyritz. In his career, he has produced 5.84 more wins in high-leverage situations than would be expected from his overall performance. That’s a 53% increase on his 11.0 career fWAR. As a Padre, he’s put up 1.44 of those 5.84 additional clutch wins, worth something close to $12 million in value (but actually more because, as mentioned above, total league Clutch does not add up to zero).

For Hosmer, it appears his Clutch rating stems from a high groundball rate, fairly high single rate, productive outs, and plain ‘ole timely hitting, but not his contact or swinging strike rates.

Don’t get me wrong, though. He’s still very mediocre, perhaps worth 2 to 2.5 wins in his Padres career with the Clutch correction instead of fWAR’s paltry 0.5. But with this 300% increase in value as a Padre, we’ve started to enter the “maybe we shouldn’t just DFA him” territory. At the end of the day, Hosmer’s contract is a sunk cost: we’re paying that fucker no matter what. But if his actual production in 2022 is 1 win, rather than replacement-level, it makes justifying an outright DFA difficult. And that’s before considering platooning him, which would increase his per-unit value, and considering that finding a platoon partner is a far cheaper and easier option.

But the relative value of events changing in different situations isn’t the only reason fWAR is so wrong, particularly on Eric Hosmer.

Context neutral statistics, like Wins Above Replacement (WAR), are also predicated on the assumption that all plate appearances are equally-indicative of player quality. That implicitly assumes that a player who hits a homerun in the ninth inning when down by six runs randomly hit this homerun and that this player is equally likely (from a skill perspective) to hit a homerun in the ninth inning of a tie game. It simply credits the player for hitting a homerun, removing everything that has to do with game context from the calculation.

In this way, WAR penalizes players who “overperform” in important situations in comparison to players who “underperform” in important situations. The creators of WAR contend that the outcomes are random – that a player who overperformed was lucky and that a player who underperformed was unlucky. Put another way, WAR assumes that the outcome in all situations is as simple as drawing a marble (groundball out, single, homerun, etc.) out of the player’s bag of outcomes, where the ratio of marbles in the bag is determined solely by the overall statistics from the player. This assumes that the ratio of marbles does not change in any circumstance – it assumes that a player is just as likely to hit a flyball with the bases empty as with a runner on third. In summary, WAR is completely predicated on the idea that situational hitting does not exist.

So in addition to ignoring that the value of events changes in different circumstances, fWAR also ignores that players might change the type of events they influence via situational hitting.

But situational hitting certainly does exist and it isn’t merely random. Hitters are more likely to groundout to the right side of the infield when there’s a runner on second and no outs. They’re more likely to fly out with a runner on third and less than two outs. And so on. These are well-known and studied. Going back to the analogy above: the ratio of marbles in the bag isn’t fixed and different players have different bags of marbles in different situations. We need to quit pretending they don’t.

There’s another logical fallacy in the argument against Hosmer worth highlighting here, as we discuss batted ball profiles. For years, both fans and writers have beseeched Hosmer to change his launch angle in order to increase the number of flyballs he hits. This would lead to more homeruns and fewer hard-hit groundball outs, increasing his theoretical value. I agree with this line of reasoning, although the effect (due to what I’ve highlighted above, but also due to what I’ll describe below) is considerably less than the theorists publicize. But underpinning this idea is the notion that players actually can control and therefore change their batted ball profiles.

That begs the question: if players can change their batted ball profiles, why is this idea only extended to the aggregate level, but not at the individual event level? Why do we assume that Hosmer could only change his overall flyball rate, while simultaneously assuming that Hosmer cannot control his batted ball profile in specific situations?

The situational hitting error doesn’t just end at failing to consider that players might hit more singles when a single is disproportionately more valuable. The other major glaring situational hitting omission is the failure to differentiate between types of outs. Productive outs, avoiding grounding into double plays, and sacrifice bunts aren’t considered. Revisit the formula in the footnote above, if you want to confirm. Players which excel at these things receive no credit for these skills, even though they are real baseball plays which result in improved win/loss outcomes.

To illustrate this point, let’s look at a few examples where different types of outs have significantly different values. With no outs and a runner on second, down one run in the bottom of the 7^th, a strikeout reduces the odds of winning from 48.7% to 39.9%, a reduction in win probability of 8.8%. A productive out in this situation – via groundball to the right side of the infield – reduces the odds of winning from 48.7% to 46.5%, a reduction in win probability of 2.2%. Relative to the strikeout, the value of a well-placed groundball out is worth 6.6% of a win! That groundball out is more valuable in comparison to the strikeout than if a player on this same team had hit a leadoff single two innings earlier with the same score: 42.4% up to 48.2%.

When we assess player value and talk about how players perform, we tend to focus solely on positive outcomes that players produce: the hits, the walks, the homers, and so on. And make no mistake, productive outs are still outs and they still mostly decrease the odds of winning. But all outs aren’t the same and some are much better than others. Unless we have good evidence that players can’t control the type of outs they produce and when they occur, we probably shouldn’t be assessing players with that rigid assumption. This is especially unfair since it double penalizes players if they sacrifice any portion of their positive outcomes in order to hedge against striking out; they are both penalized for getting out and then not credited for the real value produced when the hedge works.

You know where this is going, right? Right?!?

Surprise! It turns out Eric Hosmer is really good at making productive outs.

In his career, in situations where there is a runner on and nobody out, Eric Hosmer’s outs have advanced the runner(s) 195 times out of 535 chances: 36.5%. (As a Padre, it’s 36.6%, aligned with his career, and in 2021 it was 43.8%!) League average over this time period is 29.3%, which comes to an absolute difference of 38 of this type of productive out over his career. That’s only a few extra good outs per year, but it’s also only one type of situation where a productive out is helpful.

Examining these situations more granularly, it turns out that many of them occur with a runner on second and no outs. Hosmer has advanced this runner 61.0% of the time in his career, well above the league average of 53.4%. Again, this is only a few plate appearances a year, worth a small fraction of a win.

Another situation where the value of outs differ significantly is when there’s a runner on third and fewer than two outs. Even in one of these basic ass early game sac fly situations – say, the bottom of the first, tie ballgame, runner on third with one out – the difference between the sac fly and a strikeout is a whopping 8.2% of a win (+2.0% versus -6.2%). As you may now expect, Hosmer is better than league average in this family of situations, scoring this runner 55.9% of the time in comparison to the league average of 50.5%. Like before, this isn’t worth all that much on its own – a few runs or fractions of a win here and there.

One good way to understand this difference is to compare Hosmer directly to his teammates. Wil Myers, who we noted earlier as being poor in Fangraphs’ Clutch measure, is really shitty in these situations. With a runner on and nobody out, his outs advance the runner(s) just 22.2% of the time, 7.1% less than league average. Astonishingly, Myers would have produced 77 fewer of this type of productive out than Hosmer over the same number of opportunities that Hosmer has had in his career! And in the subset where there’s a runner on second and nobody out, Myers has advanced the runner just 42.0% of the time, 19.0% less than Hosmer! In other words, for every two times that Myers moves this runner over, Hosmer does it three times.

Myers isn’t the only teammate we can compare to. Hunter Renfroe also was dogshit in these situations – 25.0% and 38.3% – and Cody Decker literally never got a hit in any of these situations. Manny Machado, on the other hand, is pretty good in these situations, with career percentages of 30.1% and 60.8%. In case you were wondering, Tony was absolutely nails in both of these situations – 43.0% and 67.9% – as well as every other niche situational hitting category, except sacrifice bunting for some weird reason.

Again, I want to stress that those items above aren’t worth much individually. But, like remembering to turn off the light when you leave the room, the value will add up over a long period of time if you do it often and do all of them. In order to illustrate this, and avoid accusations of “cherry-picking”, let’s just look at every plate appearance and how it impacts wins/losses. Isn’t that what matters anyway? Every plate appearance, weighted for their impact? That’s called Win Probability Added.

Whenever WPA and similar statistics get brought up, they are often dismissed with a simple “they’re noisy” complaint. In one sense this is true: the aggregate WPA figure is significantly influenced by a few high-leverage plate appearances, and that isn’t always an accurate representation of player skill. But this is a lazy dismissal. Most plate appearances aren’t noisy, while the studies done by Fangraphs and Hardballtimes to focus solely on the consistenty of year-to-year total WPA instead of focusing on percentiles or career totals. Even then, the high magnitude events – clutch or lucky or unlucky, however you want to characterize them – are still real events which happened; they added the value they purport to have added and I don’t mind erring on the side of properly crediting players for their achievements.

Anyway, one way to really visualize this concept is to rank order every plate appearance from most impactful to least impactful and plot it on an x-y plane: the x-axis is the “badness” percentile of the event while the y-axis is the magnitude of the event in terms of win probability added.

Let’s brainstorm what we expect to see when we make this plot. We should expect to see that the plot crosses the x-axis at roughly the same percent as the players’ on-base percentage, since getting on-base is almost always positive while making outs is almost always negative. We should expect that players who hit more homers and have a higher slugging percentage will have more high-value events than the others. And, at the other end, we should expect players with more productive outs to have a softened tail, as their outs will appear less negative.

This is exactly what we see, plotted here for Hosmer, as well as a number of other firstbasemen (Josh Bell, Jesus Aguilar, Miguel Sano, and Trey Mancini) who allegedly were all more valuable offensively in 2021 than Hosmer:

It’s hard to really process everything here, as the y-axis is stretch to view the extreme events – game-winning hits, etc. – so let’s zoom-in on the boring region: the other 80% of plate appearances that we generally don’t focus on when looking at player performance.

Well, it’s pretty clear from this graph that Eric Hosmer is better than all four of these other firstbasemen in basically half of all plate appearances. For example, from the 70^th percentile onwards, Hosmer’s average plate appearance is like a 0.5% to 1.0% win probability improvement over Miguel Sano and Josh Bell per plate appearance. Doing some back-of-envelope math…

From the 70^th to the 90^th percentiles is 20% of all events, which is something like 125 plate appearances, or 0.6-1.2 wins added just by making less shitty outs than these other guys, either situationally or through committing productive outs. And again, almost none of this is captured by Batting Runs and, therefore, this real value is not a part of Hosmer’s fWAR. Some portion of it will be captured by Clutch, but not all of it.

Another interesting way to see this difference is to examine their outcomes on a basic boxplot.

The top of Hosmer’s range of best outcomes are worse than the others, his mean and median outcomes are about the same, while his worst outcomes are clearly better. And Miguel Sano, who has more Batting Runs than Eric Hosmer, is actually worse in all of these categories, so another point to the fWAR “it’s just lazy theoretical dogshit” counter.

The win probability charts aren’t displaying some 2021 fluke, either. Hosmer has pretty much always been like this as a Padre:

Some years it has been more pronounced than others, but the overall line here is pretty consistent with his 2021 season: his chart reflects that his outs are just not as bad as the other guys’. There are more productive outs and a change in approach in important situations. This hurts his overall line, since fWAR doesn’t consider type of out, but helps the team win more than these other firstbase options. I don’t know about you, but that kinda sounds like a team player.

Taking all of this into consideration – the changing value of events based on game situation, the batted ball profile of a player influencing Clutch, the changing distribution of outcomes from situational hitting, productive outs – it’s hard for me to ever take fWAR seriously again. Its fundamental assumptions are just, well … wrong. The nicest thing I can say about it is that it’s a fun theoretical concept whose offensive component, at the very least, needs to be completely rethought.

But the problem doesn’t just end with the fact that the statistic itself is poorly constructed.

What exacerbates the issue is the fact that players know they are being assessed based on metrics like fWAR. This incentivizes players to jettison areas of real value – fuck singles and productive outs, amirite?!? – in exchange for maximizing the fake value (fWAR) by which they appear to get compensated. It shouldn’t surprise that players continue to trend towards “three true outcomes” given this dynamic.

This bastardization of an otherwise helpful metric is not something foreign to baseball. In fact, these perverse incentives abound in real life and are the root of many major scandals in recent memory. Take the Wells Fargo Account Opening scandal, as a prime example.

Internally, Wells Fargo started assessing and compensating their employees based on how many new accounts they opened. Shortly thereafter, Wells Fargo employees started opening accounts for existing customers without their consent. Millions and millions of them. Perhaps number of accounts opened was a good metric when the employees themselves were not aware that this was how they were being assessed. But as soon as Wells Fargo made this a known metric, it shouldn’t surprise that unscrupulous activity followed. And whatever value the metric may have had in assessing employee performance before the metric was known, it certainly didn’t have the same meaning when employees started opening millions of accounts without consent.

The same thing has happened in baseball. The second we started assessing players with a bunk set of assumptions which completely devalued productive outs and situational hitting, we incentivized players to strike out hundreds of times, so long as they hit a few more dingers in probably meaningless situations. We finally have someone to blame for Ryan Schimpf. Good riddance, Dave Cameron!

What it all boils down to is quite simple: in our noble quest to upgrade from boomers talking about pitcher wins and “ribbies”, we’ve literally removed any semblance of what actually decides games – how much an event impacts the odds of winning and losing – from our player assessments. We’ve gone all the way off the deep end. As Kirk Lazarus would say…

So where do we go from here? Should we use fWAR? Should we use any of the new-fangled statistics at all? My answer is a resounding yes, with a caveat. We use the tools provided to us in the way that they’re supposed to be used: for the problems they were designed to solve when the assumptions within them are satisfied.

In one sense, Fangraphs giving the world fWAR was like giving the world a hammer. The problem is that the world includes Padres Twitter, with the mental capacity of a toddler. While a hammer is a great tool for a narrow set of use-cases, if you give a hammer to a toddler, they’re just going to break a ton of shit and only rarely will they actually strike a fucking nail. Put down the hammer, boys, and consider using the wrench or flathead every once in a while.

In the case of Eric Hosmer and similar players with a history of overperforming in important situations and making productive outs, we can’t run around quoting fWAR like it’s some divine truth. At the very least, for these players we should be adding up their fWAR and Clutch values, and making other minor adjustments for things like productive situational grounders.

Speaking of which, we need a term for productive situational grounders that isn’t ten syllables or thirty characters, if it’s ever going to become part of our vernacular or used on Twitter. We can’t just go the acronym route either because it (PSG) conflicts with French soccer club Paris Saint-Germain and will never be taken seriously.

Wait a second…productive situational grounders.

Productive. Situational. Grounders.

Productive. Situational Grounders.

Prestige value! FINALLY!

Gwynntelligence

Think. Laugh. Enjoy Padres Misery.

The 2021 failure, why fWAR is broken, and Eric Hosmer

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply