GDPR. The Facebook scandal. Amid mounting privacy concerns, Data Guy continues to sell author’s private sales data to Big Pub. But is it even accurate?

It’s now several months since the January 2018 Author Earnings Report arrived. You can still find some of it on the Author Earnings site.

But only some of it. Because within hours of the report going live Data Guy was redacting information, and soon after removing entire sections of the report as anger rose in the indie community about how Author Earnings had become a sideshow of a commercial enterprise called BookStat.

We knew this much because when the January 2018 report  went live Data Guy told us, complete with a link, about the BookStat site.

That link quickly disappeared, one of the first casualties as the January 2018 Author Earnings comments thread became a damage limitation exercise.

Critical discussion about the accuracy or otherwise of the data presented and its implications for the industry was lost in a debate – and here I use that term loosely, as there are claims Data Guy was deleting awkward questions – about Data Guy’s selling indie author sales information to corporate publishing.

This was not what we signed up to.

First, some background. Let’s cast our minds back to those heady days of 2014, when the Author Earnings Report first emerged.

The 7k Report

Said Hugh Howey at the time,

It’s no great secret that the world of publishing is changing. What is a secret is how much. Is it changing a lot? Has most of the change already happened? What does the future look like?

The problem with these questions is that we don’t have the data that might give us reliable answers. Distributors like Amazon and Barnes & Noble don’t share their e-book sales figures. At most, they comment on the extreme outliers, which is about as useful as sharing yesterday’s lottery numbers.

This lack of data has been frustrating.

I have longed for greater transparency so that up-and-coming authors can make better-informed decisions.

Well, Howey certainly gave us data, courtesy of the “data guru”, who in February 2014 offered us the “7K Report”, taking a look at 7,000 titles on Amazon to show how indie authors were, apparently, outselling and out-earning Big Pub.

It was revolutionary stuff back then, and given it told us what we all wanted to hear, no-one was to bothered about the fact that the “data guru” was simply taking a carefully selected single day’s data, itself of questionable accuracy, and extrapolating from that a full year’s sales.

It was a recipe for questioning the validity of the finds, and that kicked off just a week later when the second Author Earnings Report came out, looking at 50,000 titles. The comments thread is dominated by remarks questioning the results on numerous counts.

February 2014 Author Earnings Report

But by and large there was a ring of truth to the results, and over the following years Data Guy delivered what we wanted to hear. Evidence that indies were doing well and traditional publishing was struggling.

True or not, it sounded good.

The problem was, Data Guy, who supposedly had all the numbers at his fingertips, chose never to give us the whole picture.

As Jane Friedman and Porter Anderson said in the Hot Sheet this past February,

The reports were erratic in focus, each changing the basis for analysis, so that no consistent comparative picture could be built from them.

This, of course, was quite deliberate. We can be sure of that, because no–one with even elementary training in statistics would erroneously deliver such a muddled picture.

Besides, Data Guy himself touted his credentials, dismissing any criticism with the comment,

I get a real chuckle every time I see the assumptions folks are making about the team’s data analysis expertise or lack thereof. I suspect that, if that background were common knowledge, a lot of the most vocal folks promoting their own presumably-superior qualifications would get real quiet real quick.

But of course that background is not common knowledge. In all the years of the Author Earnings Reports we have always been asked to accept Data Guy’s anonymity and take his expertise on trust. That’s not to say many in the industry are unaware of who he is, but we all respect his decision to keep his real identity private.

Or at least, that was the case. But now we are in the bizarre situation where Data Guy, still declining to reveal his own true identity, is happily selling indie author earnings data through BookStat to anyone who can meet the price.

Just what the price is remains unclear, but to even qualify to get considered you need to be a company with turnover of $10 million.

If your company’s annual revenues are $10 million or more, and you think Bookstat’s sales data dashboard might be a good match for your business needs, give us a shout. We’d be happy to set up a live demo or explore different subscription options.

What sort of data is Data Guy selling?

From the largest Big Five trade publishers down to the scrappiest garage micropresses, to sales from Amazon’s in-house publishing imprints and format-dominating Audible Studios to J.K. Rowling’s Pottermore — data that you’ll find nowhere else — even the sales of individual self-published authors: it’s all right there, live at your fingertips, ready for you to ask it the questions that drive your business.

And yes, “even the sales of individual self-published authors” means how many sales you or I made today / last week / last year. All available to corporate publishing to examine at their leisure.

Will that data be accurate? Data Guy is claiming around 95% accuracy. But one of the many problems here is that we who aren’t in the $10 million turnover club are not going to be given the chance to see, challenge, or correct the claims Data Guy is making about our sales.

Now that may be no big deal for the hobbyist author selling a handful of books each month. It gets more serious for the career author. Those numbers of Data Guy’s could be way off the mark. We don’t know, of course, because we aren’t being given the option to inspect the data BookStat is selling about us.

There have always been serious questions about the accuracy of the Author Earnings Reports, both publicly and, especially, in discussions in private Facebook groups where professional indie authors congregate. And the latest report was no exception.

Private Facebook discussions will remain private – here at TNPS we do respect author privacy – but to make the point I will pick up on the public utterances of one indie author who has been vocal on both the privacy and the accuracy issues relating to Author Earnings January 2018 and to BookStat.

When the issue of Data Guy selling our sales info came up on The Passive Voice in January, Marie Force was one among many to say this was totally unacceptable.

New Author Earnings Report

They had no right to publish our names and data without our permission or to sell and profit from our data, especially when we have reason to believe he’s WAY off on a few things. Just because you CAN doesn’t mean you SHOULD. Even businesses have a right to keep their proprietary information private from competitors.

ACCURACY

Marie Force is one of the most respected figures in the indie community, with millions of sales behind her, and her comment touches on three key areas that concern us here.

Data Guy publishing our names and data without our permission; Data Guy selling that information for profit; and Data Guy asserting the data being sold is accurate.

The latter, especially, should be a red flag for the corporate publishers and other businesses shelling out for this data, as Marie Force is also a respected figure in the traditional publishing community.

When an author of this calibre, experience and sales numbers says some of Data Guy’s numbers are “WAY off” (her capitals) that is not something to dismiss lightly.

Marie Force returned to this (and the privacy issue) in April, again in comments on The Passive Voice, when the topic of top-selling indies came up.

Top-Selling Indie Authors?

Here running several Marie Force comments together.

NO ONE KNOWS what anyone is actually selling. NO ONE knows. Not even data guy … NO ONE has any way to know what an author makes in audio, foreign, traditional advances, sales on other platforms that are often AS robust as Amazon. Any speculation is total and complete BULLSHIT. There is no way to know unless you are collecting the proceeds … (T)hese lists are PURE speculation. They add nothing of any real value to the conversation. And if you are a person who believes that some things should be private, seeing your name on a list that details your earnings is rather infuriating. Especially when you weren’t consulted BEFORE the list went public and especially when the people behind the list plan to then SELL the data. Since there is absolutely no way to conclusively determine who is making what, especially if you’re using skewed Amazon rankings as your guide, it’s probably best not to speculate.

At which point let’s bring in another “indie big name”, Dean Wesley Smith, who notes in comments on The Passive Voice and on his own blog that whatever numbers Data Guy may have come up with for his sales will be totally off the mark.

My books per title don’t sell well on Amazon. I make most of my money from my books in all the places Data Guy can’t get to, such as movie options, overseas sales, secondary markets, kickstarter projects, book funnel sales, sales through eBay, sales through our own B&M stores, and so on and so on. But I still don’t want that information on my 400 titles out there.

In addition, of course, there’s also the small matter of the 225 million downloads OverDrive registered last year that are completely ignored by Data Guy. How can anyone take seriously a report on the US book market that totally disregards almost a quarter million library downloads, but is happy to include Kindle Unlimited subscription downloads?

Remember what BookStat is claiming:

From the largest Big Five trade publishers down to the scrappiest garage micropresses, to sales from Amazon’s in-house publishing imprints and format-dominating Audible Studios to J.K. Rowling’s Pottermore — data that you’ll find nowhere else — even the sales of individual self-published authors: it’s all right there, live at your fingertips, ready for you to ask it the questions that drive your business.

So to be clear, BookStat can tell us Joe Debut Amateur Author sold five copies last month, but 225 million OverDrive library downloads, and countless millions more from other digital library operators, are dismissed as irrelevant, even though the libraries are paying for these books to be in their catalogues.

But it’s not just untracked sales and revenue from other sources that is the issue here.

The bigger question is about the ebook sales and revenue that Data Guy is, through BookStat, claiming to be tracking with such precision.

Now we capture over a million top selling titles a day. Every day.

Our analytics run in real-time, 24/7.

Which means that if a book sold even a single online copy since April 2017, no matter whom the publisher or author, we can probably find it in our ever-growing dataset. Whether that title sold two copies yesterday or two thousand, we can see those sales. We can total them up in our dashboard. And for next week’s unreleased titles–or next month’s–we can tally up their accumulated online preorders, too.

With over 250 million rows of live ebook, audiobook, and online-print data at our fingertips, we can now, with the click of a mouse, slice & dice online book sales from last quarter, last month, or last week, any way we like. So let’s take a look back over the last three quarters of 2017, from late April thru the end of December, to see what our dashboard can tell us about which books US consumers were actually buying online during that 9-month period.

That from the January 2018 Author Earnings Report.

January 2018 Report: US online book sales, Q2-Q4 2017

The problem being, the numbers offered to us in the January 2018 report (itself  full of contradictions – see below)  are wildly at odds with the numbers Data Guy has offered us in previous Author Earnings Reports, which we have also been led to believe were accurate.

As ever Data Guy has been careful to show us almost no continuity with previous reports, that might let us build a realistic picture, and as ever we are expected to take his numbers on trust.

Yet on the rare occasion we do have something to compare Data Guy goes out of his way to ensure we do not compare the new data with the old.

Says Data Guy in the January 2018 AE Report, which bears little relation to previous AE reports,

It doesn’t make sense to compare our new, continuously aggregated market-wide data against one of our old single-day, single-retailer AE snapshots; that would be an apples-vs-oranges comparison, and relatively meaningless.

No, sorry, Data Guy, but it doesn’t here matter how you derived your figures. What matters is that the numbers you told us before, and were claiming were accurate, and the numbers you are telling us now, and are claiming are accurate, are totally different.

Data Guy asserts in the January 2018 Author Earnings Report that in Q2-Q4 2017 the total value of ebooks sold in the US was $1.3 billion.

January 2018 Report: US online book sales, Q2-Q4 2017

Data Guy makes the point that seasonal differences in the ebook sector are negligible, so it would not be unreasonable to assume that, if $1.3 billion worth of ebooks were sold in Q2-Q4 then for the whole year the value would be $1.7 billion.

Now that’s an impressive number, of course. But here’s the thing.

In the February 2017 Author Earnings Report looking at the full year 2016, Data Guy told us categorically that total US ebook sales were worth $3.2 billion.

February 2017 Big, Bad, Wide & International Report: covering Amazon, Apple, B&N, and Kobo ebook sales in the US, UK, Canada, Australia, and New Zealand

Wait. What? Has the US ebook market collapsed between 2016 and 2017?

Well, we all know the ebook sales of the Big 5 have plummeted. But by that much?

If the market has collapsed, why isn’t Data Guy telling us about it? If it hasn’t, then allowing for Data Guy’s claims of precision with the new numbers, were Data Guy’s numbers wrong before?

Have we previously been told, erroneously, that the US ebook market was worth much more than it actually was?

Let’s take another example: The glaring difference in the number of romance ebooks being sold, as per Data Guy’s statistics.

Back at the RWA Conference in 2016 Data Guy confidently stated 235 million romance ebooks were sold in 2015.

2016 Romance Writers of America RWA PAN Presentation

Impressive! If accurate. But was it?

Here’s the thing: when Data Guy in January 2018 offered us his Q2-Q4 numbers for unit romance sales for 2017 he clearly gives us a number for just over 50 million.

Extrapolating for the extra three months to give us a full year we arrive at a number of 66.6 million. A far cry from the 235 million romance sales Data Guy was asserting happened in 2016.

Romance, so we are told, is the biggest selling genre (but hold that thought – see below) so this much smaller number of romance ebooks being sold, as per Data Guy 2018 compared to 2016, is consistent with the overall ebook dollar value being much smaller as reported in 2018 compared with 2017. Again, all this is Data Guy’s publicly revealed numbers, so who knows what other contradictions lie in the BookStat data we are not being allowed to see.

I asked Data Guy about these discrepancies and he was at pains to explain that,

The “differences now emerging” that you mention, aren’t real differences. They are misperceptions resulting from attempting to compare two numbers that aren’t measuring the same thing.

Data Guy elaborated,

Think about what you are comparing here, and you’ll see the problem. The old AE reports captured roughly 1% of a year’s sales (i.e. four single-day snapshots, out of 365 days) and then projected that statistically to the other 99% of the year (the days not captured).  The new data set reports ONLY the fraction of book sales that were explicitly and individually captured during the time period, on the days they were captured; it makes zero attempt to project that total to account for days not covered, titles not captured on particular days, other retailers not tracked, etc.  Comparing the two measures directly and inferring past overestimates or market collapses makes no mathematical or statistical sense.

So just to be crystal clear here, Data Guy is stating that when I compare the twelve-month $3.2 billion value for the US ebook market published in Author Earnings in February 2017 with the nine month $1.3 billion value for the US ebook market as published in Author Earnings in 2018 I am comparing apples and oranges.

No, Data Guy, I am comparing the value of the US ebook market YOU gave us in two consecutive years. Two values that are massively different, telling us either the US ebook market collapsed during that twelve month period, or the values for one or other year were erroneous, or a mixture of both.

Data Guy of course was fully aware of the discrepancy, which was why he implores us not to compare the old and new data. What the eye doesn’t see…

Curiously while Data Guy is going out of his way here to say there is no inconsistency, and extolling the accuracy of the new BookStat data, in comments on the January 2018 Author Earnings report he is admitting actually there are still many weak areas. And some are deeply disturbing.

One author notes,

Some of your top 50 ebook earning authors have incorrect title counts. At a glance, a few are just way bloated, more titles than that author actually seem to have. What data are you pulling that’s making the title lists look like this?

Data Guy responds,

Sometimes, especially when an indie isn’t using ISBNs (and most aren’t), the algorithms don’t do a great job of auto-correlating different versions of the same book selling at different retailers, so they look like different books to our spider. Foreign-language editions, even very low-selling ones, also creep into the title count if they are available from US retailers.

So let’s get this straight. Big corporations are paying through the nose for Bookstat data derived from algorithms that Data Guy admits “don’t do a great job.”

To another author Data Guy responds,

Sales from two-author collaborations and multi-author box sets aren’t divided up neatly between authors in our data; so authors who derive a big chunk of their revenue from either or both could be getting assigned too large a share of revenue from those box sets, or too little.

So this 95% accurate BookStat data actually could be too large or too little. Goldilocks, eat your heart out.

Yet another author says,

Out of curiosity, why is JK Rowling listed twice on the print book list (5 & 17)?

Data Guy explains,

Inconsistency in the publisher-entered “author name” metadata at the different retailers, or for different books by the same author at the same retailer. One of the many items on the cleanup list…

But hold on, Data Guy. Just now it was the fault of indies not using ISBNs. But ALL print books have ISBNs. And exactly which publisher would inconsistently enter the name JK Rowling?

There is no room for confusion here. This is not a problem at the publisher or retailer end. This is a problem at the BookStat end

But it gets worse. A closer inspection reveals that JK Rowling is not only listed twice, but at position #7 she is attributed 76 titles and at position #17 her title count rises to 93.

At which point we might ask, is 76 correct, or is 93 correct, or do we need to count them both and make 169, or are none of these numbers actually accurate? After all, when we turn to JK Rowling in the ebook list she has 220 titles!

For a player of Rowling’s stature this matters enormously, because Data Guy is attributing sales and revenue to these titles when he can’t decide from one minute to the next how many titles there actually are.

How can we trust any numbers the BookStat report tells us when Data Guy is admitting he can’t even track the title count of the most iconic author on the planet?

A reminder of what Data Guy bragged about his team:

I get a real chuckle every time I see the assumptions folks are making about the team’s data analysis expertise or lack thereof. I suspect that, if that background were common knowledge, a lot of the most vocal folks promoting their own presumably-superior qualifications would get real quiet real quick.

All this “data analysis expertise” and Data Guy and co. not only don’t know how many titles JK Rowling has out, but they didn’t even bother to look at the charts they published and spot the glaring errors any Middle Grade kid would have picked up.

Messing up with JK Rowling is simply inexcusable.

But it’s not just Rowling. On the Sell More Books Show, among many public and many more private discussions, this issue of title count accuracy comes up again.  In the podcast it is mentioned that some indie authors’ names were initially published (list later deleted) in the January AE Report are claiming the title count bears no relation to the books they have out.

Episode 200 – Kobo & Walmart, Apple Books, and BookStat Backlash

Let’s tease this a little further since the accuracy question got subsumed by the privacy question in January. (I’ll come to the privacy issues later in this post.)

Late in comments on the January AE Report, long after most visitors will have departed, someone picks up on the discrepancy in the 2016 RWA Report and the 2018 AE Report I mentioned above.

In your 2016 presentation to the RWA, you put the size of the US romance ebook market at 235m units. In the data above, 9 months is 50m units – which seems a precipitous drop. What’s responsible for this? Improved accuracy in new crawling?

It seems an enormous difference, would that put into question your previous estimates of the market, or am I misreading the data?

Data Guy explains,

The difference between the old 235M-ish number and what’s calculated here comes down to how particular book sales get assigned to particular genres, basically, and how to account for titles that are listed under multiple genre categories.

The old number you referenced included every sale of every single book that was listed under any Romance subcategory, even if that same book was *also* listed under Mystery/Thriller and/or Scifi/Fantasy and/or Literature & Fiction, etc. On the other hand, these newer calculations shown here only give “partial credit” to Romance for sales of books that are double-, triple-, or quadruple-listed under other genres.

Doing it this way gives a truer, non-overlapping measure of what percentage share of all book sales each genre or subgenre commands. The percentage breakdowns by genre or subgenre will now add up to 100%, which is the kind of intuitive slice-of-the-pie measure most people tend to expect.

There we have, in a nutshell, at least one reason why the earlier Author Earnings Reports wildly exaggerated the size of the US ebook market. Data Guy was multiple-counting titles because they showed up in more than one category.

The big question is just how much more accurate the latest BookStat information is. I’ve outlined above some reasons for concern, and there are many others, but of course we mere mortals are not going to get to see that Bookstat data.

Which really leaves us none the wiser as to what the true numbers are.

Before moving on to privacy matters, I’ll leave you with this observation.

According to Data Guy’s January 2018 report Literature & Fiction was the single biggest selling genre in ebooks in the nine months measured. Some 70 million unit ebooks valued at $330 million were recorded.

Below that comes Mystery Thriller & Suspense with 35 million unit sales and $187m in revenue, while Romance comes in with a lower dollar value of $160 million for more unit sales – 50 million.

But hold on, since when was Literature & Fiction the biggest selling genre? Come to that, since when was Literature & Fiction a genre at all? It’s a catch-all category for most adult fiction, to separate it from children’s and YA titles and non-fiction.

Seriously, who amongst us has ever listed a book just as Literature & Fiction? The retailers and aggregators might put our titles into that category automatically, in order to reach relevant sub-categories like mystery, thriller, suspense, romance, sci-fi or whatever, but Literature & Fiction as a genre is utterly meaningless.

What genre do you write, friend? Oh I write literature and fiction. It outsells everything else. It’s the only genre to be in.

Yet here is Data Guy telling us Literature & Fiction sold twenty million more units than romance and twice as many as mystery, suspense and thrillers.

To confuse matters still further, Data Guy’s January 2018 AE Report – using the data Big Pub is forking out big money for – has as many issues with genre as it does with title counts.

In the list of top audio book genres Data Guy tells us that #1 is, surprise, surprise, Literature & Fiction.

At #2? Well, there’s another top-selling genre we’ve all been ignoring and missing out on millions of sales from. Fiction & Literature. No, seriously.

At #3 is Mystery, Thriller and Suspense. Not, of course, to be confused with Mysteries and Thrillers at #10. Go figure.

Bizarrely the audiobook sales for Literature & Fiction and Fiction & Literature combined are equal to the combined sales of Mystery, Thriller, Suspense and Science Fiction and Fantasy and Romance and Biography and Memoir.

Bearing in mind this is just from the sliver of data we’ve been allowed to see, it begs the question what other inconsistencies are lurking in the full BookStat report behind the $10 million turnover paywall.

Privacy

But let’s move on from issues about the accuracy or otherwise of the Author Earnings and BookStat reports and address the thorny issue of privacy.

There’s also an ongoing debate about the legality of what Data Guy is doing, but I’m no lawyer, so will stick to the ethical issue here. For those wanting to ponder the legalities, check out The Passive Voice blog referenced above, or Dean Wesley Smith’s blog.

New Author Earnings Report

But I’ll start here with Smith, because he offers one example of why, legal or not, what Data Guy is doing could harm authors.

Data Guy explains in the 2018 AE Report that,

not just publishers have sought our help, but also book distributors, aggregators, global consulting firms, international publishing startups, and even private-equity firms investing in or advising major transactions in the publishing space.

At which point enter Dean Wesley Smith:

I’ll talk tonight about “perception” because (Data Guy) calls this authors “earning” and except for beginning writers, these earnings he can find are only a tiny part of what we make. And by releasing this kind of personal business data (no matter how he got it) and selling it (again no matter how he got it), he can cause extreme damage to businesses.

Just one example. A movie producer is looking to buy a book for a movie, looks at Data Guy’s numbers and sees the book didn’t sell very well over a certain period of time, decides to not do the movie, and the author loses millions. And that is only one minor way this can be damaging. This is why businesses, which all of you are as indie authors, don’t release private data.

There are any number of other damaging scenarios. For example, overestimating an author’s sales. Author A is reporting $100,000 in revenue to the taxman and there’s Data Guy with his multiple title overlap counting and mix-n-match genres asserting Author A raked in $150,000. Author A will be facing an expensive audit.

Which is why, regardless of whether what Data Guy is doing is legal or ethical, we should all be given the right to know what information Data Guy is selling about us.

At a time when the European Union is enacting globally-impacting legislation to protect data from abuse, and when Facebook is being hammered for failing to protect the data it holds on us, is it even remotely ethical for Data Guy, who hides his own identity and who is certainly not publishing how much he earns, to publish and to sell our data for private gain on a website, BookStat, we have restricted access to, when the “Terms & Conditions / Privacy Policy” button on BookStat goes nowhere?

Clearly Marie Force and Dean Wesley Smith think not, and they are far from alone. A search of the indie blogs and discussion boards in January-February reek of a sense of betrayal. The very existence of BookStat, after all, is down to indie authors privately and confidentially sharing their sales information with Data Guy so he could create Author Earnings.

Dean Wesley Smith summed it up:

Having this data out in public (sort of) is one thing, but collecting and selling it with the express intent to help traditional publishers and hurt indie publishers is what startles me.

I just am stunned that Data Guy and Hugh have picked this path to hurt us all after everything they have done to help indies over the last few years. Stunning to me.

At which point just to add that my understanding, via the Hot Sheet mentioned above,  is Hugh Howey is no longer involved with Author Earnings, and may therefore have no involvement with BookStat.

So what safeguards does Data Guy have in place to protect indie authors? What is his ethics and privacy policy? As above, the Privacy Policy button BookStat is just for show, so I asked Data Guy directly:

Given the current concerns about data harvesting and privacy in the light of the Facebook fiasco, can you offer some explanation and justification as to

  1. why BookStat is i) harvesting and ii) selling the private and confidential sales dataof individual authors without their consent
  2. what safeguards BookStat has in place to prevent the further dissemination and potential misuse of that dataonce sold to third parties
  3. what remedy Bookstat offers individual authors to verify and challenge the accuracy of the databeing sold
  4. what remedy individual authors may have to request you desist from i) harvesting and ii) selling their private and confidential sales data
  5. what other data about individual authors you may be i) harvesting and ii) selling through BookStat

As above, these concerns have new poignancy in the light of the on-going saga with Facebook. With Facebook we at least chose to be part of it and have the option to delete our accounts. We have no such option either way with BookStat.

Data Guy responded,

Bookstat collects all of its data from publicly available sources via web scraping. None of it is anyone’s private or confidential data, so your entire premise here appears to be based on a fundamental misunderstanding.

An even more apt comparison might be to Nielsen Bookscan and other industry data providers, that similarly track retailer point of sale data and book metadata at the title level as a product — it’s a long established practice in any professional industry.

Viewed in that context, answers should be self-evident.

An interesting counterpoint from Data Guy, but one that doesn’t stand up to scrutiny.

For starters Bookscan is dealing with real values, as provided by the publishers. There is no website scraping and guesswork about how chart ranking equates to sales.

On the other hand, BookStat is only functional because of the initial private and confidential information provided in good faith to the non-profit Author Earnings project. Without that privately volunteered data to guestimate likely sales per chart position and calibrate the results the scraped data is meaningless.

I wonder how many of those who volunteered that information would have done so knowing that data would be later used to operate a for-profit enterprise available only to corporate business, that would sell the sales data of indie authors to our perceived competitors, while giving us no facility to question the accuracy of that information, and that would happily publish the names of the biggest selling indie authors without their permission.

But hold that thought. Did I say the initial private and confidential information provided?

Forget that. Data Guy is still, after BookStat has been launched with corporate funding, asking indies to donate their private and confidential data so he can sharpen up BookStat further for the benefit of his corporate buyers.

In comments on the January 2018 AE Report someone mentions some weak areas and Data Guy responds,

If you or the high-selling authors you mention are willing to share specific numbers, shoot me an email; it would be interesting to chat.

At which point let’s trip across to the said comments on the Author Earnings Report of January 2018 where Data Guy, who was noticeably and unusually absent from the discussion on The Passive Voice and on Dean Wesley Smith’s blog, had little choice to respond to at least some awkward questions. But even here on his home turf Data Guy chose to ignore most questions he didn’t like, and by some accounts even deleted questions.

More importantly, what he had to say there was at odds with his “this is not private and confidential data” response to me.

When the new report went live on January 23 Data Guy happily published a list of what he believes to be the top selling indies in the US. Full author names, but with unit sales and revenue blurred out on the AE Report. That extra data of course only for those who are paying big bucks.

Diane Capri asked, having seen a list of the top fifty indie sellers published, where we might see the top 2,000.

Data Guy responded,

We probably won’t be making that public; it’s always a balancing act between individual author privacy versus sharing data that helps all authors.

So all is clear. The top fifty indie authors are fair game, but the rest have a right to privacy. Except if someone can stump up the big bucks, in which case privacy goes out of the window.

Only, hold on, Data Guy, remind me again what you said to me:

Bookstat collects all of its data from publicly available sources via web scraping. None of it is anyone’s private or confidential data.

So what was all this about respecting “individual author privacy”?

As the indies identified in the AE Report began to express their concern at this violation of their right to privacy, Data Guy began deleting the names of the objectors.

Kathryn Guare commented,

I notice that between yesterday and today several more independent author names in the top 50 list have been obscured. I assume they’ve requested it?  I’ve taken screen shots in case more of them disappear!

Disappear they did, as more authors publicly and privately expressed dismay at this new turn in the Author Earnings story. With a quarter of the authors’ names already pulled to allay discontent, Data Guy suddenly pulled the whole list, explaining:

We had initially shared a ranked Top 50 Indie Ebook Sellers list here (with units and dollars blurred out, of course). But then some of the authors on it started emailing us and asking for their names to be blurred out, too. As a courtesy to those authors, upon request we did so, but after the first few it became too much of a hassle… So we yanked the whole list, and will just simply state what we observed on it.

But hold on. If “none of it is anyone’s private or confidential data” then why blur the numbers? And how it is courteous to remove the list from public view but still be selling the data to those with deep enough pockets?

But let’s just ponder further Data Guy’ assertion that none of this is “anyone’s private or confidential data.”

I’ll let Debora Geary explain, quoting her comments on the January AE Report. Geary argues that while,

this data can be compiled through publicly available information, the level of work Data Guy and his team went through to get it, in my mind, could be likened to having a PI stalk you for a month and then put a “Personal Profile” on the Internet. It makes me very uncomfortable to consider that level of detail “industry data.”

Besides, Data Guy’s assertion this information is “out there” is simply untrue. Yes, anyone can look at an Amazon page and see Author A’s book is at #2 in this or that chart category. But nowhere is there any comprehensive list of how many books are selling for that category at that particular hour. If there was, Data Guy wouldn’t have asked, and still be asking, for indie authors to volunteer their sales data.

Geary then asked,

Data Guy, can you state your policy (or your commercial arm’s policy) on selling individual-identified sales data to your trad-pub clients? Have you done it? Will you do it in future?

Data Guy chose not to respond to that, but was still happily answering less controversial questions.

At which point Dean Wesley Smith arrived on the scene.

Behind your paywall to big publishers, you are not releasing individual data of any writer’s sales. Correct?

Or at least not without written permission. Correct?

Please, please tell me I am correct.

I know you blocked sales and names here, but the sounds of what you are doing behind the paywall scares many I have talked to today about this. Major business issues with business privacy and so on. Because if traditional publishers and movie industry and gaming industry and such start making decisions on books because they know exact sales from you of any author, that will lead to more lawsuits than I can imagine.

Debora Geary chimed in,

I asked a very similar question above. Data Guy, if you could answer it here in both places, that would be much appreciated.

Too big a name to just ignore, Data Guy responded to Dean Wesley Smith, taking care not to actually answer the question.

No part of any business I’m involved with is sharing or releasing actual sales data that has been shared with me confidentially by any author, publisher, or agent.

Nor will it ever happen.

That is not what Dean asked, Data Guy.

At which point Debora Geary tried again.

Data Guy, that’s a partial answer, but not the part that at least I was asking. You calculate individual-level estimates of units sold and income earned (with a very sophisticated algorithm you are comparing to data sources like Bookscan, which are actual sales.) That data is blurred in this report, but with a “lock” symbol. Are you selling unlocked versions of that individual-identified data to your paying clients? What is your policy on selling this data if you have not already done so? (Adding) Please give us honest answers and develop a clear and ethical data use policy moving forward.

Another author, AW, came in to say,

You may not release the actual sales data that has been released to you by others, but you have used that data, as you admit, to check and fine tune your algorithm. Hence, if you release the data provided by your algorithm, as it appears you do as indicated by the paywall, you are in part, betraying the confidential sharing of data, as this has been specifically used to make sure that your algorithm ends up giving results that will be pretty close to the data that was shared. So to claim that you will release only “your” algorithm data, and not the shared data, is a legalistic workaround.”

Neither AW nor Debora Geary received a response from Data Guy, although he was still about happily answering less awkward questions.

But not for long. Next up was Kristine Katherine Rusch who, noting Data Guy had removed the BookStat link that the AE Report had originally included, asked

Why is the paywall structured so that only companies with revenues of $10 million can subscribe?

Again Data Guy chooses not to ignore such a big name in the indie movement, so explains why (too long to repeat here) but concludes,

Further responses will be limited to questions only about the AE report data above.

Very conveniently meaning he can ignore, for example, J.M. Ney-Grimm who, talking about Data Guy’s selling our data, says,

But the information provided by Bookstat won’t be available to most authors. It will be available to multi-million-dollar corporations (publishers, movie makers, etc.), giving them yet more power in their negotiations with authors.

What we don’t know is how many, if any, awkward comments were removed or blocked. Several authors in the comments thread on the Dean Wesley Smith blog, including Smith himself, mention comments on the AE Report being deleted.

Whether Data Guy did or did not delete awkward comments we can’t say for certain, but there’s no question Data Guy evaded awkward questions on the January AE Report comments thread and was uncharacteristically absent from the discussions elsewhere.

Time was, the Author Earnings Reports, while always open to question, did appear to shine a dim light on the ebook market.

Given most of the BookStat data is behind a very expensive paywall, none of us can say just how accurate or inaccurate the latest guestimates from Data Guy are.

But as should be clear from the little I’ve covered here, there are so many contradictions, not just with past reports but within the latest report, as to throw us back into the dark ages.

 

 

 

 

 

 

4 Replies to “GDPR. The Facebook scandal. Amid mounting privacy concerns, Data Guy continues to sell author’s private sales data to Big Pub. But is it even accurate?”

  1. On just one area of data accuracy, let me comment from observation that seeing an author name spelled variously on various retail sites is not unusual, since the metadata that supports it comes from various sources.

    When I supply an ONIX data feed to a distributor with my own carefully entered (but subject to human error) data, the distributor and (more commonly) his downstream retailers have heterogeneous systems for digesting it, including ordinary spreadsheets and human data entry. These retailers are slow or impossible to correct. The further metadata travels, the more the damage spreads.

    Take the trivial issue of accommodating Title/Subtitle/Series/Series Entry metadata. Out of my 100-150 end destination retailers (not to mention their different locations), I can think of at least 4-5 ways of handling that information, or combining fields together, or throwing data away. If I tried to get a count of all my titles using ONLY destination retailers, the numbers would be all over the place, because they don’t all handle the data “title” the same way.

    Take a look at B&N, for example, where the source of an audiobook (distributor) causes it to end up in entirely different areas of their retail site, so that simply getting a count of audiobook titles is non-trivial, either overall, or as part of a formats-by-title exercise, where the audiobook distributor you use skews the data badly.

    Real world data scraped from end-destination sources will always be riddled with errors — data repair is an expensive and significant task, no matter how many metadata standards we propagate, particularly as different participants step up at various rates of speed (or never).

    1. Thanks, Karen.

      I don’t doubt the problems, but seriously, did no-one proof the charts before publishing them and spot that JK Rowling (exact same name, no variation) was listed twice?

      The point is, if all these problems have yet to be resolved, the level of accuracy being claimed is clearly untrue.

      If Data Guy stands up and says “we’ve made improvements, this is our latest guestimate, but we still have weakness A, B and C, so take these numbers as a guideline,” then fine.

      As you say, “Real world data scraped from end-destination sources will always be riddled with errors.”

      If Data Guy had been that honest there would be no problem here.

      But Data Guy is proclaiming this fantastic new level of precision where every single book sold is logged, and then when it is pointed out the numbers don’t add up Data Guy agrees, hidden away in comments most people will never see, that this is wrong, that is wrong, etc.

      In the main post intro every one reads: “With over 250 million rows of live ebook, audiobook, and online-print data at our fingertips, we can now, with the click of a mouse, slice & dice online book sales from last quarter, last month, or last week, any way we like.”

      Hidden away in comments: “The algorithms don’t do a great job of auto-correlating different versions of the same book selling at different retailers, so they look like different books to our spider.”

  2. Some interesting backstory. Back in 2016 I caught wind of author earnings. I am a professor of quantitative psychology (a particular flavor of statistics well-suited to analyzing this sort of data) and spend three years as a biostatistician. I silently balked at the author earnings report; no respected statistician would:
    * present data with pie charts (it’s nearly impossible for human brains to accurately compare areas. It’s much better to use bar charts)
    * Aggregate values across time. That violates the assumption of independence and any estimates (e.g., mean of romance sales) would be misleading
    * Similar to the above, nobody would treat each author’s data as independent (as would be required to aggregate genres as data guy has done)
    * Present no “conditioned” models, or models that consider other variables simultaneously.
    * Compute simple means with data so skewed.
    I suspected then (and I suspect now) that data guy is a data guy, not a stats guy. Given that he knows how to scrape data and clearly knows how to handle a database doesn’t mean he has the prowess to answer these very interesting questions we want to ask (such as whether it is more beneficial to self-publish vs. traditionally publish).

    Despite not being a stats guy, data guy is trying to do statistics. That’s like a basketball player trying to do gymnastics. Yes, both require athleticism, but the skill set is entirely different.

    Anyway, back to the story. So I contacted Hugh Howey and asked if he wanted some more sophisticated modeling done on the data. He eagerly agreed and I drafted a blog post that gave a more cautious (and more sound) picture of what was going on (based on the dataset). I emailed the draft to him. He was excited then said he’d share it with Data Guy and get back to me.

    Nothing.

    I emailed again, and again received no reply. I thought it rather odd at the time that he was once so eager and now he was ignoring me. I wondered then (and I wonder now) whether Data Guy had something to do with that. I wonder if Data Guy was concerned that somebody who actually knew what they were doing would soon discover his ineptitude.

    I ended up having it published on another indie blog (the original blog has since migrated to another domain name and with that migration, the post was lost, but I published it on my blog here: dustinfife.net/blog/amazon-vs-self-publishing-what-do-the-numbers-say).

    It’s interesting how all this has panned out. If someone deflects attacks by gloating about anonymous and invisible credentials, it’s probably a hoax.

Leave a Reply

Your email address will not be published. Required fields are marked *